blog & research · test a detector

how to test an AI detector before you trust it

by Tuan Hoang · detection lead · last reviewed 2026-07-03
‘99% accurate’ is doing a lot of work.

don’t take an accuracy claim’s word for it, ours or anyone’s. run a 30-sample blind test: ten known-human texts, ten known-AI, ten hybrid, plus a pre-2022 false-positive baseline and one paraphrase stress test. an afternoon of work tells you more than any marketing page.

why the 99% number is a marketing artifact

a single accuracy percentage hides everything that matters: measured on what content, produced how, at what false-positive threshold? in the largest public benchmark (RAID, ACL 2024), detectors advertising 99%+ dropped sharply the moment they left their home domain. and in 2025 the U.S. FTC ordered Workado to stop advertising its AI Content Detector as roughly 98% accurate after its complaint alleged independent testing showed 53% on general-purpose content. that order changed the rules: accuracy claims now require competent, reliable evidence. the full evidence file lives in are AI detectors accurate?; this page is about what you do instead of trusting the number.

and to be clear about our own position: amige. doesn't claim to be the most accurate AI detector. nobody honest can, because the ranking shifts with content type, language, and how hard someone tried to hide the generation. a tool leading with that superlative has told you something useful about its marketing, if not its classifier.

WHY ONE ACCURACY NUMBER MEANS NOTHINGadvertisedvendor’s own testout-of-domaincontent it wasn’t trained forparaphrasedone ‘humanizer’ passno scale on purpose — the sources have the numbers.
the same detector, three test conditions — the shape third-party benchmarks keep finding

the 30-sample blind test

pasting one ChatGPT response into a detector isn't a test, it's an anecdote. a real evaluation needs variety and needs you blind to the answers while you score:

  • 10 known-human samples. your own pre-AI writing, or published text from before 2022: guaranteed human, so every flag here is a false positive.
  • 10 known-AI samples. generate them yourself, across more than one model and more than one prompt style. vary length and topic.
  • 10 hybrid samples. AI drafts you’ve genuinely edited, and human drafts an AI has polished. this is where most real-world content lives, and where most detectors wobble.

shuffle them, label them privately, run all thirty, and only then unblind. score the four outcomes separately (human called human, AI called AI, human called AI, AI called human), because a tool can ace one column while failing the one you care about. and watch your own bias: if you already suspect a text, you'll read a 55% as a conviction. the protocol exists to protect you from yourself.

THE CONTROL GROUP10 humanpre-2022 = guaranteed10 AIseveral models, prompts10 hybridedited both directionsshuffle → scan all 30 → unblind → scorefour columns, scored separately.
the 30-sample blind test — label privately, shuffle, scan, then unblind

weigh false positives first

the flag on innocent writing is the expensive mistake: it's an accusation with a person on the other end. the 2023 Stanford study in Patterns found that, on average, seven popular detectors misclassified 61.3% of TOEFL essays by non-native English speakers as AI-generated, against about 5.1% for native-speaker writing. structured, careful, second-language prose reads as “machine” to a naive classifier. if your test population includes non-native writers (a classroom, a hiring pipeline), weight your human samples accordingly, and read up on the false positive rate before you act on any flag.

stress-test with a paraphrase

take three of your known-AI samples, run them through a paraphraser, and scan again. expect the scores to collapse: a NeurIPS 2023 study dropped one detector from 70.3% to 4.6% detection at a 1% false-positive rate with a single paraphrase pass, and a NeurIPS 2025 attack cut true-positive rates by an average of ~88% across detectors at the same threshold. the “humanizer” tools sold online exploit exactly this. any detector that claims paraphrased text can't hide from it is contradicting the published research. amige. can't reliably catch it either, and says so. what you're testing here isn't whether the tool survives (it mostly won't); it's whether the tool is honest about it.

the transparency checklist

the score matters less than whether you can interrogate it. before you adopt any detector, check:

  • can you see the disagreement?. a single merged percentage hides how split the evidence was. amige. shows every detector’s read on each verdict; whatever tool you pick should show its work somehow.
  • is ‘uncertain’ a possible answer?. a tool that always picks a side is guessing on the hard cases. abstention is a feature, not a bug.
  • are its accuracy claims cited?. third-party benchmarks with dates beat self-reported numbers. post-Workado, an uncited ‘99%’ is a red flag.
  • does it hedge attribution?. ‘looks like Midjourney’ is honest; ‘made with Midjourney’ is more than any closed-set classifier can know.

that's the whole method: blind samples, false positives first, one paraphrase attack, then judge the tool on its honesty rather than its confidence. if you want a starting shortlist to run it against, the best AI detectors in 2026 ranks the field by use case. and yes, the protocol applies to us too.

questions

because they’re measuring different things. each detector has its own training data, its own feature set, and its own decision threshold: one leans on perplexity, another on burstiness, another on learned generator fingerprints. there’s no industry-standard definition of ‘accuracy’ to calibrate against, so disagreement between tools is normal. it’s also informative: if three independent detectors split on a text, that text is genuinely ambiguous, and an honest verdict on it is ‘uncertain,’ not a coin-flip percentage.

no. not amige., not anyone. peer-reviewed testing shows every detector has a real false-positive rate, and it lands hardest on non-native English writers (61.3% of non-native TOEFL essays were misclassified, on average across seven detectors, in the 2023 Stanford study). a detector score is one probabilistic signal to weigh alongside drafts, revision history, and a conversation with the person, never the sole basis for an accusation. any tool that markets itself as accusation-grade is overclaiming.

sources.

  1. 01
    Dugan et al. — RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors (ACL 2024)
    detectors advertising 99%+ accuracy drop sharply on out-of-domain content and under adversarial attack.
  2. 02
    Liang et al. — GPT detectors are biased against non-native English writers (Patterns / Cell Press, 2023)
    61.3% of non-native TOEFL essays misclassified as AI on average across seven detectors, vs ~5.1% for native-speaker writing.
  3. 03
    FTC — Order Requires Workado to Back Up AI Detection Claims (April 2025)
    ~98% advertised vs an alleged 53% on general-purpose content; accuracy claims now require competent, reliable evidence.
  4. 04
    Krishna et al. — Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense (NeurIPS 2023)
    one paraphrase pass dropped DetectGPT from 70.3% to 4.6% detection at a 1% false-positive rate.
  5. 05
    Cheng et al. — Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text (NeurIPS 2025)
    average ~88% true-positive-rate drop across detectors at a 1% false-positive rate.
  6. 06
    Jabarian & Imas — Artificial Writing and Automated Detection (Chicago Booth / BFI Working Paper 2025-116)
    independent test across six genres and four frontier models; wide variance between tools — the case for testing on YOUR content.
scan it. see for yourself →is this AI? →