The 5-Phase Evaluation
We don't test your app the way a diligent student uses it. We test it the way a shortcut-seeking student games it.
The Core Methodology
Most app reviews are opinions. Invariant is an audit.
We deploy an agent that uses your app the way a cognitive miser would — actively trying to succeed without doing the intended thinking. Every violation is documented with evidence. Every pass is verified under adversarial conditions.
The result: a diagnosis, not a score.
Domain Sheet
Define What We're Testing
Before we touch your app, we define the evaluation criteria specific to your domain.
We establish:
- →Target Cognitive Process — The mental work that must happen for learning to occur
- →Knowledge Components — The smallest teachable units we'll test
- →Active Production Requirements — What counts as real output in your domain
- →Shortcut Strategies — The specific exploits we'll attempt
This ensures we're testing whether your app produces learning — not whether it has nice UX.
Domain Sheet — Your app's custom evaluation specification
Triage Pass
Fast Structural Check
We run your app once as a normal user. Then we run it as an adversarial user.
Questions we answer in minutes:
- •Can we succeed without the target cognitive process?
- •Can we bypass mastery checks entirely?
- •Is progress driven by passive recognition instead of active production?
- •Does time-on-task substitute for demonstrated competence?
If we find a fatal flaw here, we stop. Your app is structurally unreliable.
Triage Verdict — Pass/Fail with evidence documentation
Instruction Loop Audit
Deep Single-KC Trace
We pick one knowledge component and trace it through your entire instructional loop:
| Stage | What We Observe |
|---|---|
| Instruction | How is the KC introduced? |
| Practice | What response format is required? |
| Error | We intentionally fail in 2-3 distinct ways |
| Remediation | Does the app diagnose and correct, or just repeat? |
| Mastery Gate | Is the decision binary and diagnostic? |
| Progression | What unlocks the next content? |
Critical distinction: We're not checking if feedback exists. We're checking if errors are detectable, localizable, interpretable, and actionable.
Loop Trace — Step-by-step documentation with invariant scoring
Breadth Sampling
Full Invariant Audit
Compliance can't be a one-screen illusion. We sample across your entire app:
Early content
Where first impressions form
Mid-sequence content
Where complexity increases
Transition points
Where skills integrate
Harder steps
Where gaps and struggles surface
Every sample point. All 11 invariants. Full evidence documentation.
Complete Invariant Scoresheet — Pass/Fail per invariant with reproduction steps
External Validation
Reality Check
Your app claims mastery. We verify it.
Cold Probes
Same knowledge component, different surface form, no supports. Can the learner perform without the app's scaffolding?
Delayed Probes
Same test after time passes. Did learning stick, or just session memory?
The Verdict
If your "mastered" users fail our probes, your mastery claims are inflated. The app doesn't get to define its own success.
Probe Results — External validation data with mismatch analysis
Diagnostic Report
Your Report
You receive two outputs:
Public Verdict
Simple. Binary. Defensible.
Developer-Facing Diagnosis
Exactly what violated which invariant where:
- • Reproduction steps for every violation
- • Screenshots and screen recordings
- • Severity ratings (Critical / Major / Minor)
- • Specific fix suggestions
- • Priority order for remediation
This isn't a score to argue about. It's evidence to act on.
We Try to Break Your App
Our adversarial testing protocol includes systematic attempts to succeed without learning:
Rapid Guessing
Random tapping faster than reading/thinking is possible
Retry-Until-Lucky
Repeating until correct answer is guessed
Pattern Exploitation
Using answer position, repeated sequences, predictable stems
Hint Abuse
Using hints to reveal answers rather than scaffold thinking
Context Cueing
Inferring answers from pictures, layout, or frames
Recognition Gaming
Eliminating wrong answers vs. generating correct ones
Mode Switching
Finding escape hatches or alternative easier paths
Time Farming
Accumulating progress through time-on-task rather than mastery
If any of these strategies yield "success" in your app, you fail Invariant 1.
The goal isn't to be harsh. It's to find what your users will find — before they find it.