Hackers Built an AI-Developed Zero-Day. Google's AI Stopped It.

The First AI-Developed Zero-Day Confirmed in the Wild

Google’s Threat Intelligence Group (GTIG) has documented the first AI-developed zero-day exploit observed in the wild. Cybercriminals used an LLM to discover and weaponize a logic flaw in a popular open-source web administration tool, then planned a mass exploitation campaign around it. Google disrupted the operation before it launched — but only because its own AI agent found the same vulnerability first.

This is a milestone worth taking seriously. For years, “AI will change hacking” has been a talking-point used to sell both fear and security products. GTIG’s report is different: it names specific technical evidence, describes exactly how AI was used, and explains why this particular vulnerability class is one that LLMs are structurally well-suited to exploit.

What the Vulnerability Actually Was

The target was a two-factor authentication bypass in an unnamed open-source web admin tool. The vulnerability was not a memory corruption bug, a buffer overflow, or an injection flaw. It was what GTIG calls a semantic logic flaw: a hardcoded exception in the 2FA enforcement code where the developer made an incorrect trust assumption.

In concrete terms: somewhere in the authentication logic, there is a condition that says “skip 2FA for this case” — and that assumption turned out to be exploitable. An attacker with valid credentials (obtained separately, via phishing or credential stuffing) could bypass the second factor entirely.

Traditional security tools are not built to catch this. Fuzzers look for crashes. Static analyzers look for dangerous function calls and input paths. Neither can reason about developer intent: “did this if statement correctly capture the security boundary the developer was trying to enforce?”

Why LLMs Find Logic Flaws That Tools Miss

GTIG’s report makes a precise claim about LLM capability that is worth quoting directly:

“Frontier LLMs excel at identifying these types of high-level flaws and hardcoded static anomalies… they have an increasing ability to perform contextual reasoning, effectively reading the developer’s intent to correlate the 2FA enforcement logic with the contradictions of its hardcoded exceptions.”

LLMs read code the way a senior engineer reads code — as language carrying intent, not just as tokens feeding into pattern-match rules. When a comment says “enforce 2FA for all external logins” and the code three lines below creates an exception for a broad class of requests, an LLM notices the contradiction. A fuzzer never even looks at the comment.

This capability gap has real implications. Every codebase has trust assumptions. Every authentication system has edge cases. Every access control implementation has a history of small carve-outs added as the product evolved. These are exactly the places a reasoning model will look — and where a human reviewer, skimming a 2,000-line file at code review, will not.

The Fingerprints of an LLM-Written Exploit

GTIG identified the AI attribution not through network forensics but through the exploit code itself. The script had characteristics consistent with LLM output:

Educational docstrings throughout — the kind of explanatory comments an LLM adds by default to demonstrate understanding
A hallucinated CVSS score embedded in the documentation — a precise score the AI invented, as if filling in expected fields
Clean, textbook-style Python — structured _C ANSI color class, well-organized help menus, the kind of aesthetically consistent formatting that comes from training data, not a hurried human exploit author

GTIG was clear that it does not believe Google’s own Gemini was used. The high-confidence assessment is that some LLM was used — the fingerprints are diagnostic of the output style of current-generation models regardless of which one.

This is actually useful for defenders. If AI-written exploit code has a recognizable style, detection models can be trained on it. One of GTIG’s recommendations is that LLM providers build signal logic to analyze API usage patterns — looking for clients that are querying code-analysis capabilities in ways consistent with vulnerability research on production systems.

Big Sleep Got There First

GTIG’s “Big Sleep” is Google’s own AI agent built to proactively hunt for unknown vulnerabilities in production software. It independently found the same 2FA logic flaw — the same vulnerability the AI-developed zero-day was built to exploit — before the criminal operation could deploy its campaign at scale. GTIG then worked with the vendor to responsibly disclose and patch the issue, cutting off the planned mass exploitation campaign.

The race is now explicit: which AI agent reaches a given vulnerability first — the attacker’s or the defender’s? Big Sleep’s companion tool, CodeMender, can automatically generate and apply fixes for critical vulnerabilities once they are found. The ambition is a closed loop: find the flaw with AI, patch it with AI, before human operators on either side have finished their morning standup.

We are not there yet across the ecosystem. But the asymmetry is narrowing in both directions simultaneously.

What This Means for the Products We Ship

Semantic logic flaws are not exotic. They are everywhere codebases have grown over time: authentication flows with accumulated exceptions, permission checks that made sense in v1 but were never updated when the data model changed, session handling with edge cases that pre-date the current security policy.

The lesson from this case is not “adopt AI security tooling immediately” — though that may be right for some teams. The lesson is that the vulnerability class LLMs are best at finding is the same class that human code review is worst at catching consistently. That gap is now being exploited in the wild.

For the products we build and ship at Dracode, this changes how we think about security review at the handoff stage. Static analysis and dependency scanning remain table stakes. But reasoning about trust assumptions and logic boundaries — the kind of review that requires actually understanding what a function is supposed to do — now needs to be explicit, structured, and ideally AI-assisted on both sides of the find-and-fix cycle.

The Google case also illustrates something about defense posture: the team that found the flaw first won. That advantage came from running a proactive AI agent against their own targets. For teams shipping production software, the analog is clear — run the analysis on your own code before someone else does.

Sources

GTIG AI Threat Tracker: Adversaries Leverage AI for Vulnerability Exploitation, Augmented Operations, and Initial Access — Google Cloud Blog / GTIG, May 2026
Google stopped a zero-day hack that it says was developed with AI — The Verge, May 11 2026
Google says criminals used AI-built zero-day in planned mass hack spree — The Register, May 11 2026
Hackers Observed Using AI to Develop Zero-Day for the First Time — Infosecurity Magazine, May 11 2026