Code Metal recently raised $125 million to rewrite defense industry code using AI. Google and Microsoft both report that 25–30% of their new code is AI-generated. AWS used AI to modernize 40 million lines of COBOL for Toyota. Microsoft’s CTO predicts that 95% of all code will be AI-generated by 2030. The rewriting of the world’s software is not coming. It is underway.
Anthropic recently built a 100,000-line C compiler using parallel AI agents in two weeks, for under $20,000. It boots Linux and compiles SQLite, PostgreSQL, Redis, and Lua. AI can now produce large-scale software at astonishing speed. But can it prove the compiler correct? Not yet.
No one is formally verifying the result.
Andrej Karpathy described the pattern: “I ‘Accept All’ always, I don’t read the diffs anymore.” When AI code is good enough most of the time, humans stop reviewing carefully. Nearly half of AI-generated code fails basic security tests, and newer, larger models do not generate significantly more secure code than their predecessors. The errors are there. The reviewers are not. Even Karpathy does not trust it: he later outlined a cautious workflow for “code [he] actually care[s] about,” and when he built his own serious project, he hand-coded it.
Consider what happens at scale. A single bug in OpenSSL — Heartbleed — exposed the private communications of millions of users, survived two years of code review, and cost the industry hundreds of millions of dollars to remediate. That was one bug, introduced by one human, in one library. AI is now generating code at a thousand times the speed, across every layer of the software stack, and the defenses we relied on (code review, testing, manual inspection) are the same ones that missed Heartbleed for two years.
The Harvard Business Review recently documented what it calls “workslop”: AI-generated work that looks polished but requires someone downstream to fix. When that work is a memo, it is annoying. When it is a cryptographic library, it is catastrophic. As AI accelerates the pace of software production, the verification gap does not shrink. It widens. Engineers stop understanding what their systems do. AI outsources not just the writing but the thinking.
The threat extends beyond accidental errors. When AI writes the software, the attack surface shifts: an adversary who can poison training data or compromise the model’s API can inject subtle vulnerabilities into every system that AI touches. These are not hypothetical risks. Supply chain attacks are already among the most damaging in cybersecurity, and AI-generated code creates a new supply chain at a scale that did not previously exist. Traditional code review cannot reliably detect deliberately subtle vulnerabilities, and a determined adversary can study the test suite and plant bugs specifically designed to evade it. A formal specification is the defense: it defines what “correct” means independently of the AI that produced the code. When something breaks, you know exactly which assumption failed, and so does the auditor.
Poor software quality already costs the U.S. economy $2.41 trillion per year, according to a 2022 study by the Consortium for Information & Software Quality. That number was calculated before AI began writing a quarter or more of new code at leading companies. Chris Lattner, the creator of LLVM and Clang, put it bluntly: AI amplifies both good and bad structure. Bad code at AI speed becomes “incomprehensible nightmares.” As AI generates an increasing share of the world’s critical infrastructure (financial systems, medical devices, defense, transportation), unverified code becomes a systemic risk, not just a quality problem.
Most critical software today works, maintained by talented engineers using testing, code review, and hard-won experience. The problem is not that everything is broken. It is that AI is changing the scale and speed of software production faster than our ability to verify it. What works at human pace may not survive AI pace.
It will get worse, and it will get worse fast, unless verification scales with generation.
Testing and proof are complementary. Testing, including property-based testing and fuzzing, is powerful: it catches bugs quickly, cheaply, and often in surprising ways. But testing provides confidence. Proof provides a guarantee. The difference matters, and it is hard to quantify how high the confidence from testing actually is. Software can be accompanied by proofs of its correctness, proofs that a machine checks mechanically, with no room for error. When AI makes proof cheap, it becomes the stronger path: one proof covers every possible input, every edge case, every interleaving. A verified cryptographic library is not better engineering. It is a mathematical guarantee.
Consider an example. An AI rewrites a TLS library. The code passes every test. But the specification requires constant-time execution: no branch may depend on secret key material, no memory access pattern may leak information. The AI’s implementation contains a subtle conditional that varies with key bits, a timing side-channel invisible to testing, invisible to code review. A formal proof of constant-time behavior catches it instantly. Without the proof, that vulnerability ships to production. Proving such low-level properties requires verification at the right level of abstraction, which is why the platform must support specialized sublanguages for reasoning about timing, memory layout, and other hardware-level concerns.
The Claude C Compiler illustrates the other side: it optimizes for passing tests, not for correctness. It hard-codes values to satisfy the test suite. It will not generalize. Property-based testing would likely catch this particular case, but the general problem remains: for any fixed testing strategy, a sufficiently adversarial system can overfit to it. A proof cannot be gamed. It covers all inputs by construction.
The friction of writing code manually used to force careful design. AI removes that friction, including the beneficial friction. The answer is not to slow AI down. It is to replace human friction with mathematical friction: let AI move fast, but make it prove its work. The new friction is productive: writing specifications and models, defining precisely what “correct” means, designing before generating.
The obstacle has always been cost. Writing proofs by hand was too expensive to apply broadly. AI changes the economics. Proof is becoming practical at scale.
Most people think of verification as a cost, a tax on development, justified only for safety-critical systems. That framing is outdated. When AI can generate verified software as easily as unverified software, verification is no longer a cost. It is a catalyst.
The value is not in the verification workforce. It is in what verified delivery enables. Consider a company delivering ML kernels for new hardware. Today, months go to testing and qualification. When AI writes the kernel and proves it correct in one pass, that timeline collapses to hours. A provably correct hardware design delivered in weeks rather than the year it currently takes changes the economics of an entire industry.
Verification, testing, and specification have always been the bottleneck, not implementation. Good engineers know what they want to build. They just cannot afford to prove it correct. If that cost drops to near zero, every domain where correctness matters accelerates. Aerospace, automotive, and medical device certification currently takes years of qualification effort. Cloud providers invest similar effort qualifying security-critical services and cryptographic implementations. Verified code generation could collapse that timeline to weeks. Hardware verification, where a single bug can cost hundreds of millions of dollars, benefits equally.
This acceleration requires specifications: precise descriptions of what the software must do. As AI takes over implementation, specification becomes the core engineering discipline. Writing a specification forces clear thinking about what a system must do, what invariants it must maintain, what can go wrong. This is where the real engineering work has always lived. Implementation just used to be louder.
Writing specifications is not always easy, but it is easier than writing the optimized implementation. And a powerful shortcut exists: an inefficient program that is obviously correct can serve as its own specification. User and AI co-write a simple model, AI writes an efficient version, and proves the two equivalent. The hard part shifts from implementation to design. That is the right kind of hard.
This is not limited to critical systems. Any non-trivial engineering project where bugs are expensive (which is most of them) accelerates when correctness is cheap.
What would a verification platform for the AI era require?
A small, trusted kernel: a few thousand lines of code that check every step of every proof mechanically. Everything else (the AI, the automation, the human guidance) is outside the trust boundary. Independent reimplementations of that kernel, in different languages (Lean, Rust), serve as cross-checks. You do not need to trust a complex AI or solver; you verify the proof independently with a kernel small enough to audit completely. The verification layer must be separate from the AI that generates the code. In a world where AI writes critical software, the verifier is the last line of defense. If the same vendor provides both the AI and the verification, there is a conflict of interest. Independent verification is not a philosophical preference. It is a security architecture requirement. The platform must be open source and controlled by no single vendor.
The platform must be both a programming language and a theorem prover, with code and proofs in one system, with no translation gap. It needs a rich and extensible tactic framework that gives AI structured, incremental feedback: here is the current goal, here are the hypotheses available, here is what changed after each step. AI must control the proof search, not delegate to a black box.
The alternative, push-button solvers that return a binary pass or fail with no intermediate state, gives AI nothing to learn from and no way to guide the search. Worse, proofs that rely on heuristic solvers often break when the solver updates or when developers make small changes to how they write their specifications, even when the changes are logically equivalent. You cannot build a reliable AI pipeline on a foundation that is not reproducible. (I discuss this in detail in a recent Stanford talk.)
It needs the largest possible library of formalized knowledge to build on. Verifying software is mathematics: the same reasoning that proves a theorem in abstract algebra proves that a cryptographic library correctly implements its specification. A platform that serves mathematicians and engineers is not a compromise. It is the recognition that rigorous reasoning is one discipline, whether applied to prime numbers or protocol correctness.
And it needs deep extensibility. Users and AI must be able to write extensions that access the system’s internals, building custom tools, automation, and domain-specific reasoning engines. This is already happening: AI agents build their own proof strategies on top of the platform. The platform adapts to its users, not the other way around.
The AI community has already made its choice. AlphaProof (Google DeepMind), Aristotle (Harmonic), SEED Prover (ByteDance), Axiom, Aleph (Logical Intelligence), and Mistral AI all build on Lean. Every major AI reasoning system that has achieved medal-level performance at the International Mathematical Olympiad used Lean. No competing platform was used by any of them. The future is much larger than today’s early applications.
Lean is backed by Mathlib, the largest coherent body of formalized mathematics ever created: over 200,000 formalized theorems and 750 contributors. Five Fields medalists engage with Lean. The same platform serves mathematicians formalizing theorems and engineers verifying production systems. ACM SIGPLAN recognized this convergence with its 2025 Programming Languages Software Award: “Lean has become the de facto choice for AI-based systems of mathematical reasoning.”
Enterprise teams already use Lean in production: AWS verified its Cedar authorization policy engine, and Microsoft is using Lean to verify its SymCrypt cryptographic library. Over 8,000 GitHub repositories contain Lean code. Over 200,000 users have installed the programming environment. More than 700 people are active in the Lean Zulip channel every day. Research groups worldwide contribute to the ecosystem. As Chris Lattner observed, manual rewrites and translation work are becoming AI-native tasks. AI will rewrite the world’s codebase. The platform it does so on matters enormously.
Lean already produces performance comparable to Haskell and OCaml. When higher performance is essential, Lean models can be translated into efficient imperative code embedded in Lean, with clean semantics and without C’s undefined behavior. We are actively working on closing the remaining gap for performance-critical code. The real comparison is not Lean versus C. It is verified code versus unverified code.
Lean is the result of over twelve years of continuous development. We designed and built every layer from scratch: the trusted kernel, the compiler, the language server, the IDE, the proof automation. The team is 20 people. The community independently chose Lean: mathematicians, AI researchers, and enterprise engineers, all building on the same platform.
At the Lean FRO, Kim Morrison, a Senior Research Software Engineer, recently ran an experiment that went well beyond our expectations. An AI agent converted zlib, a widely used C compression library embedded in countless systems, to Lean, with minimal human guidance. No special tooling was built. It was Claude, a general-purpose AI, with no special training for theorem proving, out of the box. The workflow had four steps. First, the AI produced a clean, readable Lean implementation of the zlib compression format, including the DEFLATE algorithm at its core. Second, the Lean version passed the library’s existing test suite, confirming behavioral equivalence. Third, key properties were stated and proved, not as tests, but as mathematical theorems. The capstone theorem:
theorem zlib_decompressSingle_compress (data : ByteArray) (level : UInt8)
(hsize : data.size < 1024 * 1024 * 1024) :
ZlibDecode.decompressSingle (ZlibEncode.compress data level) = .ok data
This is a machine-checked proof that decompressing a compressed buffer always returns the original data, at every compression level, for the full zlib format. The size precondition (hsize) is a proof engineering artifact: it bounds the fuel the AI used during proof construction. A human engineer would likely eliminate it; the AI, working without guidance, chose a bound that was sufficient to close the proof. As both AI capabilities and platform tooling improve, artifacts like this will disappear. The raw DEFLATE and gzip framing roundtrips are proved as well. Fourth, an optimized version is being developed and proved equivalent to the verified model. This work is ongoing.
The entire proof was constructed with minimal human guidance and no specialized tooling. The result demonstrates that AI can convert production software to a verified form today. This was not expected to be possible yet.
zlib is a sequential algorithm with a clean RFC specification, simpler than a database engine or a distributed system. The gap between this result and a verified software stack is real. But until recently, no one had demonstrated AI-generated proofs about production software. The trajectory matters more than the starting point. Crucially, the zlib proof was produced by a general-purpose AI with no specialized training for theorem proving. No custom model was needed. This means the barrier to verified software is no longer AI capability. It is platform readiness. As general-purpose AI improves, the bottleneck shifts entirely to the verification platform: how rich is the feedback it gives AI, how powerful is the automation, how large is the library of prior knowledge to build on.
The zlib workflow also reveals something testing structurally cannot do. When your tests and your code share the same wrong assumptions, testing finds nothing. The act of proving forces every assumption to be explicit: a designer states a property, the AI turns it into a precise formal statement, and then either constructs a proof or discovers that the specification needs revision. Wrong assumptions surface as proof obligations that cannot be discharged. Specifications become sharper through the attempt to prove them.
The approach extends beyond sequential algorithms to distributed systems, where the hardest bugs live. Ilya Sergey’s group at NUS built Veil, a distributed protocol verifier on top of Lean that combines model checking with full formal proof. When a property does not hold, Veil generates concrete counterexamples: actual execution traces that demonstrate the failure, just as model checkers do. When it does hold, Veil produces a full formal proof, not just the absence of counterexamples up to some bound. Veil has verified protocols including Rabia, a randomized consensus protocol, proving agreement and validity for any number of nodes. During verification, the team discovered an inconsistency in a prior formal verification of Rabia that had gone undetected across two separate tools. For simpler protocols, Claude Code can infer most of the required auxiliary invariants within minutes.
The broader picture confirms the trajectory. AI systems using Lean have solved open mathematical problems that resisted decades of human effort. A single mathematician working with an AI agent formalized the full Prime Number Theorem in three weeks: 25,000 lines of Lean, over 1,000 theorems. The previous formalization took over a year and dozens of contributors. Fields Medalist Maryna Viazovska’s proof that the E8 lattice is the optimal sphere packing in eight dimensions has been formally verified in Lean, with an AI agent completing the final steps. Google DeepMind’s AlphaEvolve discovered a new mathematical construction, and AlphaProof formalized the proof in Lean, a complete AI pipeline from discovery to verified proof.
The productivity gap is widening: teams with the best tools are pulling further ahead while others stagnate. Verification will be a decisive advantage.
Where this leads is clear. Layer by layer, the critical software stack will be reconstructed with mathematical proofs built in. The question is not whether this happens, but when.
It is already beginning. The ecosystem is growing. Startups are training AI on proof data, building verification tools, and growing talent. This is phase one: verifying existing code using Lean. Phase two is reconstruction: building the stack in Lean from the ground up, with proofs built in from the start. Both phases happen on the same platform. Both strengthen it.
The target is the foundation of the modern software stack: cryptography, because everything else trusts it. Core libraries (data structures, algorithms, compression) because they are the building blocks of all software. Storage engines like SQLite, embedded in every device on earth. Parsers and protocol implementations (JSON, HTTP, DNS, certificate validation) because every message passes through them. And compilers and runtimes, because they build everything else.
Consider SQLite, embedded in every smartphone, every browser, every major operating system. A verified SQLite carries proofs that a crash during a write cannot corrupt the database, that concurrent readers never see partial transactions. Testing tools like Jepsen and stateful property-based testing find many such bugs in practice, but they provide confidence, not a guarantee. Mathematical proof covers every crash point and every interleaving.
Each verified component is a permanent public good. Unlike proprietary software, a verified open-source library cannot be degraded, cannot have its guarantees quietly revoked, and cannot be held hostage by a single company’s business decisions. The proofs are public. Anyone can audit them, build on them, or replace the implementation while preserving the guarantees. This is infrastructure in the deepest sense: it raises the floor for everyone. When you build on this stack, you inherit the proofs. Each layer is independent. The platform and methodology come first; then the work parallelizes naturally.
Once verified components are cheap, you compose them with confidence. Today, integration is where most bugs live: the boundaries between components, the assumptions that do not quite match. Integration testing exists precisely because component tests do not compose: passing unit tests on two components tells you nothing about whether they work together. Verified interfaces are fundamentally different. When each component carries a proof against a shared specification, the composition is guaranteed correct by construction. You can verify each piece independently and know the whole is sound. This advantage scales superlinearly: the larger the system, the wider the gap between tested and verified.
The goal is a verified software stack: open source, freely available, mathematically guaranteed correct. Developers building critical systems choose verified components the way they choose open-source libraries today, except these carry proofs, not just tests. No single team builds this alone, and no single platform is guaranteed to be the right one forever. What matters is that verified software becomes the norm. If something better than Lean emerges, the mission has still succeeded. It is an open challenge for the entire community.
The role of the engineer changes, but it does not shrink. Engineers spend more time writing specifications and models, designing systems at a higher level of abstraction, defining precisely what systems must do, what invariants they must maintain, what failures they must tolerate. The work becomes more creative, not less: designing before generating, thinking before building. Productivity comes not from generating more code, but from generating code that is provably correct on the first attempt.
Specifications also raise a question that is not purely technical: who decides what “correct” means? A specification for a medical device, a voting system, or an AI safety monitor encodes values, not just logic. Making specifications formal and public does not resolve these questions, but it makes them explicit and auditable in a way that buried code never can. AI itself can bridge the accessibility gap: the same systems that generate verified code can explain specifications in plain language, making formal guarantees accessible beyond the experts who write them.
The AI industry has a strategic stake in this outcome. An AI that generates provably correct code is qualitatively different from one that merely generates plausible code. Verification transforms AI code generation from a productivity tool into a trust infrastructure.
AI is going to write a great deal of the world’s software. It will advance mathematics, science, and engineering in ways we cannot yet anticipate. The question is whether anyone can prove the results correct.
Lean is open source and freely available at lean-lang.org. To learn Lean, start with our documentation. To join the community, visit the Lean Zulip. The Lean Focused Research Organization builds and maintains the platform. If you want to support this work, contact us at contact@lean-fro.org.