Back to home

The 5-Phase Evaluation

We don't test your app the way a diligent student uses it. We test it the way a shortcut-seeking student games it.

The Core Methodology

Most app reviews are opinions. Invariant is an audit.

We deploy an agent that uses your app the way a cognitive miser would — actively trying to succeed without doing the intended thinking. Every violation is documented with evidence. Every pass is verified under adversarial conditions.

The result: a diagnosis, not a score.

0
Phase

Domain Sheet

Define What We're Testing

Before we touch your app, we define the evaluation criteria specific to your domain.

We establish:

  • Target Cognitive Process — The mental work that must happen for learning to occur
  • Knowledge Components — The smallest teachable units we'll test
  • Active Production Requirements — What counts as real output in your domain
  • Shortcut Strategies — The specific exploits we'll attempt

This ensures we're testing whether your app produces learning — not whether it has nice UX.

Artifact

Domain Sheet — Your app's custom evaluation specification

1
Phase

Triage Pass

Fast Structural Check

We run your app once as a normal user. Then we run it as an adversarial user.

Questions we answer in minutes:

  • Can we succeed without the target cognitive process?
  • Can we bypass mastery checks entirely?
  • Is progress driven by passive recognition instead of active production?
  • Does time-on-task substitute for demonstrated competence?

If we find a fatal flaw here, we stop. Your app is structurally unreliable.

Artifact

Triage Verdict — Pass/Fail with evidence documentation

2
Phase

Instruction Loop Audit

Deep Single-KC Trace

We pick one knowledge component and trace it through your entire instructional loop:

StageWhat We Observe
InstructionHow is the KC introduced?
PracticeWhat response format is required?
ErrorWe intentionally fail in 2-3 distinct ways
RemediationDoes the app diagnose and correct, or just repeat?
Mastery GateIs the decision binary and diagnostic?
ProgressionWhat unlocks the next content?

Critical distinction: We're not checking if feedback exists. We're checking if errors are detectable, localizable, interpretable, and actionable.

Artifact

Loop Trace — Step-by-step documentation with invariant scoring

3
Phase

Breadth Sampling

Full Invariant Audit

Compliance can't be a one-screen illusion. We sample across your entire app:

Early content

Where first impressions form

Mid-sequence content

Where complexity increases

Transition points

Where skills integrate

Harder steps

Where gaps and struggles surface

Every sample point. All 11 invariants. Full evidence documentation.

Artifact

Complete Invariant Scoresheet — Pass/Fail per invariant with reproduction steps

4
Phase

External Validation

Reality Check

Your app claims mastery. We verify it.

Cold Probes

Same knowledge component, different surface form, no supports. Can the learner perform without the app's scaffolding?

Delayed Probes

Same test after time passes. Did learning stick, or just session memory?

The Verdict

If your "mastered" users fail our probes, your mastery claims are inflated. The app doesn't get to define its own success.

Artifact

Probe Results — External validation data with mismatch analysis

5
Phase

Diagnostic Report

Your Report

You receive two outputs:

Public Verdict

Simple. Binary. Defensible.

Meets Standard — Passed all 11 invariants
Does Not Meet — Failed one or more

Developer-Facing Diagnosis

Exactly what violated which invariant where:

  • • Reproduction steps for every violation
  • • Screenshots and screen recordings
  • • Severity ratings (Critical / Major / Minor)
  • • Specific fix suggestions
  • • Priority order for remediation

This isn't a score to argue about. It's evidence to act on.

We Try to Break Your App

Our adversarial testing protocol includes systematic attempts to succeed without learning:

Rapid Guessing

Random tapping faster than reading/thinking is possible

Retry-Until-Lucky

Repeating until correct answer is guessed

Pattern Exploitation

Using answer position, repeated sequences, predictable stems

Hint Abuse

Using hints to reveal answers rather than scaffold thinking

Context Cueing

Inferring answers from pictures, layout, or frames

Recognition Gaming

Eliminating wrong answers vs. generating correct ones

Mode Switching

Finding escape hatches or alternative easier paths

Time Farming

Accumulating progress through time-on-task rather than mastery

If any of these strategies yield "success" in your app, you fail Invariant 1.
The goal isn't to be harsh. It's to find what your users will find — before they find it.

Get your app evaluated.

Find out whether your learning app meets the standard.