AI Image Detection: Spotting Midjourney, DALL-E, and Stable Diffusion
How AI image detectors work, what artifacts they identify, and which tools perform best on synthetic images from the major generative models.
Read →Blog › GUIDE
Perplexity, burstiness, vocabulary entropy, and model fingerprinting — the four statistical signals that separate AI-generated text from human writing.
AI text detection is a classification problem rooted in the statistical properties of language model outputs. I have spent months analyzing how the major AI detectors work under the hood, and they all rely on a core insight: generative AI models produce text by sampling from learned probability distributions, and this process leaves statistical signatures that differ measurably from human writing.
In this guide I break down the four key signals that AI detectors use, explain why they work, discuss where they fail, and cover the emerging methods that may replace today's statistical approaches. Whether you are evaluating detectors for your organization, building detection into a content pipeline, or simply trying to understand why a tool flagged your writing, this deep dive will give you the technical foundation to interpret results with confidence.
Perplexity measures how predictable a text sequence is under a language model. Technically, it is the exponentiated average negative log-likelihood of a text sequence. In simpler terms: how surprised would a language model be by this text?
AI-generated text tends toward low perplexity because it was produced by exactly the kind of probabilistic process the detector measures. The text follows the patterns the model learned during training. Human text tends toward higher perplexity because humans make unexpected word choices, use idiosyncratic phrasing, and produce text that is less statistically predictable.
The mathematical intuition is straightforward. A language model assigns a probability to each next token in a sequence. When the model generates text, it samples from those probabilities — meaning the resulting text is, by definition, aligned with what the model considers likely. When a human writes, they may choose words that are contextually appropriate but statistically improbable under the model's distribution. The detector exploits this gap.
This is the signal that GPTZero was originally built on, and it remains the foundation of most AI detection systems. However, perplexity alone is insufficient because some human writing is naturally low-perplexity — technical documentation, legal writing, and formulaic content all follow predictable patterns.
Burstiness captures the variance in sentence-level perplexity across a text. Human writers are inherently bursty — we alternate between flowing, predictable passages and labored, unconventional ones. One paragraph might be crisp and formulaic, the next might be wandering and exploratory. This variance is a natural byproduct of how humans think and write.
AI-generated text shows unnaturally uniform perplexity across sentences. Even when an AI model produces varied content, the sentence-to-sentence perplexity variance is typically lower than in human writing. This consistency is a strong signal because it is difficult to eliminate through simple post-processing or rephrasing.
Several lexical metrics collectively measure how diverse and unique the vocabulary in a text is. Type-token ratio (the number of unique words divided by total words), hapax legomenon rate (words that appear exactly once), and n-gram entropy each contribute to the overall picture.
AI-generated text tends toward lower vocabulary diversity and elevated frequency of specific transition phrases like "furthermore," "moreover," "it is important to note," and "in conclusion." These phrases appear at statistically unusual frequencies in AI output compared to human writing across equivalent content types.
I find this signal fascinating because it reflects a fundamental difference in how AI models and humans select words. Humans draw from their full vocabulary including rare and domain-specific terms. AI models sample from learned distributions that favor common, high-probability tokens. Even with high temperature settings, models gravitate toward a narrower band of "safe" word choices than a typical human writer uses.
The practical implication is that vocabulary entropy is particularly useful for detecting AI text in domains where humans naturally use specialized terminology — medical writing, legal documents, academic research. In these fields, the gap between human vocabulary diversity and AI vocabulary diversity is largest, making detection more reliable. Conversely, detection is hardest in domains where even humans write in formulaic, low-diversity language — form letters, standardized reports, template-driven content.
The most advanced AI detectors go beyond generic statistical analysis and maintain model-specific fingerprinting. Different AI models have identifiable quirks in their output distributions. Our testing shows that GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 each produce text with distinct statistical profiles.
Originality.ai appears to maintain per-model classifiers, which explains its consistently high accuracy across different AI sources. In our benchmark, Originality.ai's per-model accuracy ranged from 86% (Llama) to 94% (GPT-4o) — a pattern consistent with specialized rather than generic classification.
Model fingerprinting works because each model's training data, architecture, and RLHF tuning create unique token probability distributions. GPT-4o has a measurable preference for certain sentence-opening patterns. Claude shows distinctive hedging behavior. Gemini favors specific organizational structures. These fingerprints are subtle — invisible to human readers — but statistically significant across samples of 200+ words. The challenge is that fingerprints shift with each model update, requiring detectors to continuously retrain against the latest model versions.
No single signal is sufficient for reliable detection. The best commercial detectors use ensemble classification — multiple models, each trained on a different signal, whose outputs are combined through a meta-classifier. This is why the top-performing detectors in our benchmark consistently outperform any single-signal approach.
The combination also reduces false positives. A text that scores low on perplexity alone might be legitimate technical writing. But a text that scores low on perplexity and low on burstiness and low on vocabulary entropy is far more likely to be AI-generated. Each additional corroborating signal increases confidence exponentially, which is why multi-signal detectors achieve 90%+ accuracy while single-signal approaches plateau around 75-80%.
Statistical detection can be defeated by humanization tools that rephrase AI-generated text to alter its statistical profile. These tools work by introducing the variance, vocabulary diversity, and perplexity patterns that detectors look for in human writing. The most effective humanizers can reduce detection accuracy by 15-25 percentage points.
Humanizers exploit a fundamental asymmetry: it is easier to add noise to a signal than to extract it. A humanizer injects enough burstiness, vocabulary variation, and structural irregularity to push the text past a detector's threshold, but the underlying content and reasoning structure often remain AI-derived. This creates a cat-and-mouse dynamic where detectors improve, humanizers adapt, and the cycle continues.
The most bypass-resistant approaches combine statistical detection with provenance-based methods: C2PA content credentials that embed origin information at creation time, or cryptographic watermarking like Google's SynthID that embeds invisible signals in AI output. These methods are harder to strip than statistical patterns because the watermark is woven into the token selection process itself, not layered on top of the output. See our state of AI content report for more on this arms race.
Manual paraphrasing also reduces detectability, though it requires significant effort. Our testing found that a skilled human editor can reduce a 90% AI-confidence score to below 50% with approximately 20 minutes of rewriting per 500 words — effective but not scalable, which is why automated humanizers dominate the evasion landscape.
Beyond the four core signals, several newer approaches are entering the detection landscape. Retrieval-based detection compares submitted text against known AI training data and common AI output patterns stored in large databases. Stylometric analysis goes deeper than vocabulary entropy, analyzing syntactic structure, rhetorical patterns, and discourse-level organization.
The most promising long-term approach may be classifier-free detection using the AI models themselves. If you have access to the suspected source model, you can measure how closely a text matches that model's output distribution directly — without training a separate classifier. This approach has shown accuracy above 95% in research settings, though it requires knowing or guessing which model generated the text, and it does not scale easily to commercial deployment.
Perplexity measures how predictable text is to a language model. AI-generated text typically has low perplexity because it follows learned probability distributions. Human text is less predictable, producing higher perplexity scores. It is calculated as the exponentiated average negative log-likelihood of the token sequence.
Burstiness measures the variance in sentence-level predictability. Human writers naturally alternate between predictable and unpredictable passages — one paragraph might be formulaic while the next is creative and unexpected. AI text is more uniform in its predictability across sentences, producing a measurably lower burstiness score.
Yes. AI humanizer tools can reduce detection accuracy by 15-25 percentage points by altering the statistical signatures that detectors rely on. Manual paraphrasing also reduces detectability but requires significant time per passage. Provenance-based methods like watermarking are more resistant to these bypass techniques.
No single method is fully reliable. The best detectors combine multiple signals — perplexity, burstiness, vocabulary entropy, and model fingerprinting — in an ensemble approach that achieves 90%+ accuracy. For long-term reliability, provenance-based methods like watermarking may surpass statistical detection because they are harder to strip from the output.
Most detectors need at least 200-300 words for reliable classification. Below that threshold, there is not enough text for statistical signals to emerge clearly. Some detectors claim sentence-level detection, but accuracy drops significantly below 100 words. For high-stakes decisions, longer samples always produce more reliable results.
AI detection works by exploiting measurable statistical differences between human and AI-generated text. Perplexity, burstiness, vocabulary entropy, and model fingerprinting each contribute a signal, and the best detectors combine all four in ensemble classifiers that achieve accuracy above 90%. Understanding how these signals work helps you interpret detection results with appropriate confidence — and appropriate skepticism. No statistical method is foolproof, and the arms race between detectors and humanizers will continue, but the underlying math is sound and improving. For tool recommendations based on these methods, see our AI detector comparison.
How AI image detectors work, what artifacts they identify, and which tools perform best on synthetic images from the major generative models.
Read →Evidence-based guidance for educators implementing AI detection: which tools to use, how to interpret results, false positive risks, and building fair assessment policies.
Read →When to use WebSocket streaming vs synchronous REST for AI detection. Latency tradeoffs, implementation patterns, and when each approach makes sense.
Read →