back to blog

testing AI is a different problem than testing code

7 min read
ai engineering testing

the problem

normal code is deterministic. you write a function, you know what it should return, you write a test that asserts the output matches the expectation. pass or fail. green or red. the relationship between input x and output y is governed by explicit logic you wrote yourself.

AI systems don’t work like that. they’re probabilistic. the same input can produce different outputs. there’s often no single “correct” answer. an LLM summarizing an article can produce a dozen valid summaries. a fraud detection model scores risk on a spectrum. the whole framework of “write assertion, check output, mark pass/fail” collapses when correctness is a distribution instead of a point.

yet most teams still test AI the way they test software. that’s the crisis.

why current AI testing is bad

benchmarks are memorized, not reasoned

most evaluation relies on hold-out datasets — MMLU, HumanEval, GSM8K. but modern LLMs are trained on vast swaths of the internet, and these test sets are often inadvertently included in training data. the model isn’t reasoning; it’s recalling. this is data contamination, and it makes benchmark scores unreliable indicators of real-world performance.

a 2025 arxiv paper put it bluntly: “training on the benchmark is not all you need.” models that ace benchmarks can fail catastrophically on slight variations of the same problem because they never learned the underlying logic — they memorized the exam.

the clever hans effect

in the early 1900s, a horse named clever hans appeared to do arithmetic. he was actually picking up on subtle cues from his trainer. AI models do the same thing. a medical imaging model might achieve 95% accuracy not by learning to identify disease, but by latching onto hospital watermarks or ruler markings in scans. a sentiment classifier might rely on the presence of specific formatting rather than understanding text.

this is shortcut learning. the model minimizes training loss by exploiting spurious correlations in the data rather than learning the true causal features. standard metrics reward this behavior because the scores look great. the failure only shows up when the model encounters real-world data without those shortcuts.

the oracle problem

in traditional testing, there’s an oracle — a known correct output for every input. for AI, there often isn’t one. asking “is this summary good?” is a subjective judgment. even human experts disagree. this means you can’t automate correctness checks the way you can with deterministic code.

the two failure modes are:

  • the miss: the system marks a wrong output as correct
  • the false alarm: the system flags a correct output as wrong

both are dangerous, and both get worse as model outputs become more open-ended.

what a good system looks like

metamorphic testing: testing without an oracle

the most promising approach to the oracle problem is metamorphic testing. instead of checking whether a specific output is correct, you check whether the relationship between outputs matches expected behavior when inputs are transformed.

examples:

  • invariance: rotating an image shouldn’t change what the model detects in it
  • directional: increasing someone’s income shouldn’t decrease their credit score
  • symmetry: cos(x) should equal cos(2π − x)
  • robustness: adding static noise to an audio clip shouldn’t change the transcription

if adding “pets allowed” to a hotel search increases results, something is broken — even if you don’t know the exact correct number of results. you’ve caught a bug without needing an oracle.

behavioral testing: the checklist methodology

the checklist framework (from the “beyond accuracy” paper out of microsoft research) treats NLP testing like a capability matrix. instead of just checking accuracy, you test invariance, directional expectations, and minimum functionality separately.

  • does sentiment analysis give the same score if you swap “john” for “mary”? it should.
  • does adding “the service was terrible” make sentiment more negative? it should.
  • can the model handle a simple negation like “this is not good”? under checklist-style testing, models that score 90%+ on standard benchmarks routinely fail these basic checks.

this is the real test: not “does it get the right answer” but “does it behave consistently under perturbation.”

red teaming: actively trying to break the model

you don’t test a vault by pushing on the door. you hire someone to pick the lock.

AI red teaming is the practice of simulating real attacks — prompt injection, data extraction, jailbreaks, bias exploitation. in 2025, this has moved from manual probing to automated systems:

  • PAIR (prompt automatic iterative refinement) — evolves attack prompts based on model responses
  • TAP (tree of attacks with pruning) — explores multiple conversation paths to find jailbreaks
  • Garak — 100+ attack modules that scan for hallucinations, slurs, and data extraction

non-english safety vulnerabilities can be up to 195% more prevalent than english ones. a model that’s safe in english might leak prompt instructions in kannada. automated red teaming catches what manual testing misses.

LLM-as-a-judge — useful but biased

as outputs scale, human evaluation becomes a bottleneck. the industry uses LLMs to evaluate other LLMs — “LLM-as-a-judge.” it’s scalable and shows human-like understanding of nuance. but it carries documented biases:

  • position bias: favors the first answer in a pairwise comparison
  • verbosity bias: rewards longer answers even when content is thin
  • affinity bias: GPT-4 prefers GPT-4’s writing style
  • reasoning limits: the judge can’t evaluate problems beyond its own capability — the referee must be smarter than the player

good systems mitigate this with order-swapping, reference grading, and cross-evaluation using different model architectures for judge and subject.

shift-left and continuous: testing doesn’t end at deployment

the biggest shift in thinking: testing is not a phase, it’s a lifecycle. AI systems degrade in production. model drift, distribution shift, adversarial adaptation — the model that tested well in january might hallucinate in march because user inputs shifted.

this means:

  • observability over monitoring: monitoring tells you latency went up. observability tells you why — correlating traces, prompts, model versions, and user segments to find the root cause of quality degradation.
  • automated drift detection: continuous comparison of live inference data against the committed schema. silent regressions get caught before users notice.
  • self-healing tests: when API specs change, tests update automatically. non-breaking changes are absorbed silently; actual regressions are flagged for review.

teams adopting shift-left, AI-first testing release features 3.4x faster with 62% fewer production incidents than those relying on traditional script-based QA.

the blueprint

a good AI testing system isn’t a collection of static test suites. it’s a multi-layered architecture:

  1. metamorphic relations — test behavioral consistency, not specific outputs
  2. behavioral testing (checklist-style) — verify capabilities, not just accuracy
  3. automated red teaming — actively hunt for failure modes
  4. LLM judges with bias mitigation — scalable evaluation, but cross-validated and reference-grounded
  5. continuous observability — production telemetry that catches drift before users do
  6. shift-left integration — tests run on every commit, not just before release

the transition from deterministic software to probabilistic AI means the end of the green checkmark era. a test suite that passes today doesn’t mean your model is correct — it means your model was correct on the specific inputs you tested, at the specific time you tested them, under the specific conditions that existed then. confidence, not certainty, is the new measure of quality.

we don’t get to eliminate uncertainty in probabilistic systems. we get to measure and manage it. that’s the job now.

tahzeer