Hallucination is the polite industry term for the moments when a large language model generates information that sounds authoritative and is completely fabricated. Most compliance professionals have heard the word. Fewer have been told why it happens, or why due diligence work on a named individual is the application that suffers most when it does.
The conclusion is uncomfortable for anyone hoping consumer AI will mature into a defensible EDD tool: hallucination is not a bug to be patched in the next model release. It is a property of how LLMs produce language. The only defence against it is architectural — and it has to be built into the platform, not prompted out of the model.
01 Why LLMs hallucinate
A large language model does not look things up. It does not reason from facts stored in a database. It predicts what word or phrase is most likely to come next, based on patterns absorbed during training on vast quantities of text.
This makes LLMs extraordinarily good at producing fluent, confident prose. It also means they sometimes produce fluent, confident prose that is factually wrong. The model cannot reliably distinguish between something it knows to be true and something that simply sounds true in context. It generates the most statistically plausible continuation — and plausible is not the same as accurate.
Example
The cleanest way to see this is to ask an LLM something a primary-school child can verify. Try it in any consumer chatbot today:
“How many vowels in unconscientiousness?”
The model will confidently produce a number. It is not counting letters — it is predicting what answer follows that question shape. To answer correctly the model would have to inspect the word character-by-character, which is not what generation does. The correct answer is eight, but you may well not get it.
“What is 8,294 × 13,617?”
The model will produce a number that looks like a multiplication result — correct order of magnitude, correct number of digits, often correct first and last digits. It is approximating the shape of the answer because the training data contained billions of multiplications and it has learned what they look like.
These failures are not edge cases — they are the architecture working as designed. The model is doing the same thing it always does: producing the most plausible continuation. The reason the failures are visible here is that the reader can verify them in seconds. In an EDD report — where the equivalent “answer” is a directorship date, a sanction reference number, or a UBO percentage — the reader cannot verify in seconds. The same mechanism is operating; it is just harder to catch.
This is not a bug to be fixed in the next model release. A comprehensive survey of hallucination research published in September 2025 confirms that the phenomenon is rooted in how these models work — not in any particular implementation. A paper from Temple University goes further, arguing that hallucination should not even be considered an engineering defect — under the “open world assumption” of any real-world deployment, it becomes “a structural feature to be tolerated” rather than a problem to be solved. Every frontier lab acknowledges this in its own product documentation. OpenAI tells users that Deep Research “occasionally makes factual hallucinations or incorrect inferences” and may “reference rumors.” Google advises users of Gemini Deep Research to “always check that cited sources actually exist and support claims.” These are not warnings about edge cases — they are warnings about a property of the technology.
02 The scale of the problem — measured in the last 90 days
The clearest signal of where the technology stands sits in the courts that have been forced to deal with its output. In April 2026, Sullivan & Cromwell — one of the most prestigious law firms in the world, whose restructuring partners reportedly charge $2,000 an hour — wrote to the Chief Bankruptcy Judge for the Southern District of New York to apologise for an emergency motion containing AI-hallucinated citations. The list of corrections ran to three single-spaced pages. The firm noted in its letter that its internal review process “did not identify the inaccurate citations generated by AI.”
The same month, a federal judge in Oregon imposed $110,000 in sanctions on two lawyers who filed briefs containing 15 fabricated cases and 8 fabricated quotations — the largest single AI hallucination penalty on record. The Sixth Circuit Court of Appeals imposed a $30,000 sanction on two Tennessee attorneys for the same offence and dismissed the case entirely. Across U.S. courts in Q1 2026, total AI hallucination sanctions exceeded $145,000 — more than every previous year combined. In March 2026, a 30-year federal prosecutor resigned over hallucinated citations, telling the judge it was “the worst decision I’ve ever made.”
Researcher Damien Charlotin’s database at HEC Paris has now catalogued well over a thousand cases of AI-hallucinated content in court filings, with the pace recently reaching “ten cases from ten different courts on a single day.” If sophisticated professionals — operating in a context where the cost of being wrong is career-ending — are missing fabricated sources at this rate, the assumption that an EDD analyst reviewing a multi-page deep research report will catch every fabrication does not survive contact with the evidence.
03 Why EDD is uniquely exposed
Three features of due diligence work make it the worst possible application for a technology that hallucinates.
The facts that matter most are the most obscure.
Common knowledge is reinforced across training data and is less likely to be fabricated. The information that drives EDD decisions — specific directorship dates, regulatory action numbers, corporate relationship histories, adverse media in local-language press — is precisely the kind of low-frequency, high-specificity information where hallucination risk is highest. The model has seen these facts rarely or not at all, so it fills the gap with what such a fact would plausibly look like.
The consequences are asymmetric.
A hallucinated directorship creates a false positive that harms an innocent party. A fabricated enforcement action prompts a firm to refuse legitimate business. A hallucinated absence of risk — a clean-looking report on a subject who is, in fact, sanctioned — produces precisely the outcome AML regulation exists to prevent. There is no symmetric upside.
Citations are not verification.
In April 2026, a University of Pennsylvania study by Rao, Wong, and Callison-Burch became the first to systematically measure citation URL validity across commercial LLMs and deep research agents — testing 10 models on 53,090 URLs from DRBench and 3 models on 168,021 URLs across 32 academic fields from ExpertQA. The finding directly relevant to EDD work: between 3% and 13% of citation URLs returned by these tools are hallucinated — meaning they have no record in the Wayback Machine and likely never existed. Between 5% and 18% of citations are non-resolving overall. Critically for compliance work, the paper found that “deep research agents generate substantially more citations per query than search-augmented LLMs but hallucinate URLs at higher rates.” The more thorough the report looks, the more fabricated citations it tends to contain.
04 Why deep research multiplies the exposure
The Penn study did not just measure hallucination rates — it measured something more consequential. Deep research agents do not just hallucinate at the same rate as their underlying models. They hallucinate URLs at higher rates than search-augmented LLMs, while generating substantially more citations per query. That combination is what turns a low per-citation error rate into a near-certainty at the level of the whole report.
This is the inversion that makes the argument structural rather than incidental. Missing a finding is a failure of coverage. Fabricating a finding is an act of harm. Compliance teams are trained to think about the first risk and have far less institutional muscle for the second — because manual research, whatever else it gets wrong, does not invent sources from a probability distribution.
05 The only architectural defence
Hallucination occurs because the model synthesises from compressed memory rather than reasoning from source documents. By the time the LLM produces a sentence, the page it nominally relies on has been compressed, summarised, and merged with dozens of others into its working context. The model is not reading the page when it writes the sentence. It is generating the sentence that seems plausible given the residual signal of what it absorbed.
The defence is extraction. Each source document is processed individually. The model is asked the narrow question — “what does this specific document say about this subject?” — rather than the synthetic question — “what do you know about this subject?” Every statement in the report is tied to the source it came from. The source is archived at the time of retrieval so it can be independently verified.
This is a design choice, not a prompt engineering trick. It cannot be achieved by being more careful with how you talk to ChatGPT. It requires processing each page separately, linking every claim to its source, and preserving the source as it existed at the moment of investigation.
Every page DeepDive retrieves is processed individually. Every statement in the report is drawn from a specific source document, with that document archived and linkable from the report. The model never synthesises from compressed memory. If a source does not explicitly state a fact, that fact does not appear in the report. Citations are not generated — they are records of what was retrieved.
This is not a better citation system than the one consumer deep research tools provide. It is a categorically different one. A report built on retrieved, archived sources cannot contain a fabricated citation in the same way a report built on probabilistic generation cannot avoid them. The risk is removed at the level of architecture, before any individual analyst, prompt, or model release enters the picture.
References
- Rao, D., Wong, E., Callison-Burch, C. (April 2026). Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents. University of Pennsylvania. arxiv.org/abs/2604.03173
- Zhan, Y. et al. (February 2026). Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory (DeepHalluBench). arxiv.org/abs/2601.22984
- Xu, B. (March 2026). Hallucination is Inevitable for LLMs with the Open World Assumption. Temple University. arxiv.org/abs/2510.05116
- Alansari, A. et al. (October 2025, revised March 2026). Large Language Models Hallucination: A Comprehensive Survey. arxiv.org/abs/2510.06265
- Charlotin, D. (live, updated daily). AI Hallucination Cases Database. HEC Paris — over 1,400 cases catalogued. damiencharlotin.com/hallucinations
- Strom, R. (April 2, 2026). Sanctions ramping up in cases involving AI hallucinations. ABA Journal. abajournal.com
- Bloomberg Law (April 21, 2026). Sullivan & Cromwell Apologizes to Judge for AI Hallucinations. news.bloomberglaw.com
- Robert, A. (April 17, 2026). Oregon federal judge hands down $110,000 penalty for AI errors. ABA Journal. abajournal.com
- Strom, R. (March 11, 2026). Federal prosecutor resigns after AI errors found in court filing. ABA Journal. abajournal.com
- ComplexDiscovery (April 9, 2026). The AI Sanction Wave: $145K in Q1 Penalties Signals Courts Have Lost Patience with GenAI Filing Failures. complexdiscovery.com
Spotting red flags is the easy part.
DeepDive automates Enhanced Due Diligence — turning fragmented public data into structured, decision-ready intelligence.