Consider this goal:
example {α} [CommRing α] [IsCharP α 0] [NoNatZeroDivisors α]
(d t d_inv : α)
(Δ40 : d * (d + t + d * t) = 0)
(Δ41 : d^2 * (d + d * t - 2 * d * t^2 + d * t^4 + d^2 * t^4) = 0)
(_ : d * d_inv = 1) :
t + 2 * t^2 - t^3 - 2 * t^4 + t^5 = 0 := by
grind
The proof term behind that grind is thousands of lines: every rewrite, every associativity permutation, every ring normalization. grind produces it in milliseconds using Gröbner basis techniques from the 1960s. Yet AI systems routinely generate 20-step manual proofs for goals like this, because they don’t know the tool exists or how to configure it. Training AI to synthesize proof terms is training it to simulate algorithms we already have. It’s like training a model to output x86 assembly instead of Python.
AI should learn which tools to invoke and how to configure them. The tools do the mechanical work. And this creates a flywheel:
grind closes more goals automaticallygrind more effectively → shorter proofs, faster synthesisgrind solves → learns what annotations enabled itThe bottleneck is annotations: metadata attached to theorems that tell the automation how to use them. [simp] marks a lemma for the simplifier. [grind] marks it for the grind tactic, with optional modifiers for fine-grained control (see the reference manual). grind_pattern specifies trigger patterns and constraints for quantifier instantiation.
Most mathematicians can write proofs, but they can’t write good annotations. It requires understanding E-matching internals, instantiation cascades, and automation behavior—specialist knowledge that doesn’t scale. Mathlib has 200,000+ theorems. A small fraction have good annotations. Every missing or misconfigured annotation is a proof that grind could have solved but didn’t.
Current AI-for-math benchmarks ask: can AI prove theorems? We’re asking something different: can AI make the theorem prover’s automation better?
#grind_lint#grind_lint check analyzes all theorems annotated for E-matching and reports how many instantiations each triggers. It creates artificial goals to detect long or unbounded instantiation chains—exactly what makes automation fail.
#grind_lint check -- analyze all annotated theorems
#grind_lint check (min:=10) (detailed:=50) -- configure reporting thresholds
#grind_lint check in Array HashMap -- restrict to specific namespaces
#grind_lint inspect Array.back?_empty -- focus on a specific theorem
info: instantiating `Array.back?_empty` triggers 17 additional `grind` theorem instantiations
info: instantiating `Array.count_singleton` triggers 20 additional `grind` theorem instantiations
This is a dense reward signal, not just “proof found” but “how expensive was the search?”
grind +suggestionsgrind has an extensible suggestion mechanism. It queries an oracle for additional theorems that might help close the goal. Currently heuristic-based. You can replace it with a neural model.
Every invocation gives your model the current goal state, the solver’s internal state (known equalities, disequalities, etc.), and the available theorems in scope. Your model returns a ranked list of theorems to try. Reward: did grind close the goal?
This is a gym environment inside the theorem prover.
Every grind invocation can report exactly what happened:
set_option diagnostics true in
grind
[thm] E-Matching instances
[] Array.size_push ↦ 6
[] HashMap.contains_iff_mem ↦ 3
[] insert.eq_1 ↦ 1
[split] Case splits ▶
For every successful proof, you know which theorems were used and how often. This is labeled data.
200,000+ theorems with existing human-written annotations. Your supervised learning baseline. The annotations aren’t perfect, but they’re a starting point.
Annotations aren’t just “use this theorem.” They include constraints that control when and how quantifiers are instantiated.
Without constraints:
axiom fMono : x ≤ y → f x ≤ f y
grind_pattern fMono => f x, f y
Result: 25 instantiations, combinatorial explosion.
With constraints:
grind_pattern fMono => f x, f y where
guard x ≤ y
x =/= y
Result: 3 instantiations, exactly what’s needed.
The constraint language is simple—guards and disequalities. But predicting which constraints to add requires understanding the e-matching algorithm: why does guard x ≤ y prevent useless instantiations? Why does x =/= y eliminate symmetric redundancy?
Current LLMs can probably guess [simp] vs. not with reasonable accuracy. Predicting optimal constraints is genuinely hard. This is the frontier.
Level 1 (Easy): Given a lemma, predict whether it should have [simp] or [grind].
Level 2 (Medium): Given a lemma marked [grind], predict good trigger patterns.
Level 3 (Hard): Predict patterns with optimal constraints that minimize instantiations while preserving completeness.
Level 4 (Research Frontier): Given a goal that grind fails on, diagnose whether the failure is due to missing annotations or bad constraints, and propose a fix.
Functional Benchmark: Given a library and a test suite of goals, produce annotations such that grind closes the maximum number of goals with minimum instantiation count. Evaluate on the Pareto frontier.
Run grind +suggestions across all of Mathlib. For every proof it closes, log which suggestions were needed. Those theorems are candidates for [grind] annotation, and the trace tells you what patterns and constraints would work. This is automated annotation discovery—mining labels from successful proofs, not asking humans to label.
Then train a model on this data. Deploy it as the suggestion oracle. Measure whether it improves grind success rate on held-out goals.
#grind_lint as reward that learns to write good constraints?grind +suggestions, improves success rate on held-out goals?Current Mathlib annotations were written by humans through trial and error over years. Can an AI, given the same feedback signals we use, learn to annotate a new library in hours instead of years? We don’t know if this is solvable with current methods. That’s why we’re bringing it to you.