Proof‑Refactor · 2026
Proof-Refactor.
A process‑guided agentic framework that turns compiling Lean proofs into modular, readable, reusable artifacts — without optimizing for a single proxy metric.
1Southern University of Science and Technology 2Peking University *Corresponding author
§ 01 / Pipeline
A single Lean file. Four focused sessions.
Each phase is a separate Claude Code session with lean‑lsp‑mcp; the
artifact passed forward is the Lean file itself. A custom extract tactic
lets the pipeline reify proof fragments into standalone theorems.
§ 02 / The idea
Refactoring,
not golfing.
Generated formal proofs always compile. They rarely read like Mathlib. They inline reusable arguments and reintroduce ad‑hoc definitions rather than factoring them out as reusable declarations.
Existing approaches optimise for length, the only objective that is reliable to measure. But readability, modularity and reuse are not reducible to an automatically computable objective — and length‑based rewards drift towards golfing rather than helper‑design.
Proof‑Refactor treats the problem the way humans do: obtain a proof, then refactor it. Identify reusable fragments, generalise them into declarations that fit the surrounding library, prove those declarations, and rewrite the original proof to call them.
§ 03 / Four phases
Extract. Design. Prove. Repair.
-
01
Extract
An external model proposes proof fragments worth lifting. The agent runs
lean_extract, which inserts anextractwrapper and produces a scaffold theorem. -
02
Design
Scaffolds are grouped and rewritten into clean, library‑style signatures, with Mathlib search results used as guidance. Every theorem is annotated with its dependencies; scaffolds are temporary and must never be depended on.
-
03
Prove
The agent fills in each
sorry, guided by the dependency order. Original proof fragments provide the proof idea even when their code can’t be reused verbatim. External help is consulted only on rare hard obligations. -
04
Repair
The owner proof is repaired by following dependency annotations and invoking the verified helpers. At each extraction site, the scaffold body guides the rewrite, but the repaired proof never calls the scaffold. The
extractwrapper and the scaffolds are removed, and mathematical annotations are added.
§ 04 / The extract tactic
Turn a tactic block
into a theorem.
extract lifts a tactic block into a standalone theorem — either preserving
the original goal, or stating the implication from the post‑block goal to the
pre‑block goal.
Only the variables the block actually uses survive. It supports both
have‑style subproofs and contiguous goal‑changing blocks,
and cleans up unused variables.
After extraction
The two inserted declarations are scaffolds. Set.inter_comm is the
owner declaration.
§ 05 / Case study
Putnam 1968 B6.
The original proof inlines an explicit open‑cover argument for compactness. Proof‑Refactor surfaces the two mathematical steps that were buried inside it — a density fact about the rationals, and a convergence criterion in a pseudo‑metric space — and rebuilds the main theorem around them.
Original
Proof‑Refactor
Full examples: See the showcase →
§ 06 / Main results
Rubric‑based quality,
two benchmarks.
A separate LLM‑as‑a‑judge — not Claude Code — evaluates each refactoring along five rubric dimensions, grounded by LeanSearch retrieval. Length is reported alongside as an additional diagnostic.
| Dataset | Metric | Baseline | Proof-Refactor | Δ |
|---|---|---|---|---|
| Putnam2025 · 12 problems | ||||
| Average score | 3.96 | 4.27 | +0.31 | |
| Structure | 4.01 | 4.31 | +0.29 | |
| Signature quality | 3.76 | 4.28 | +0.51 | |
| Tactic quality | 4.04 | 4.22 | +0.18 | |
| Reuse | 4.00 | 4.29 | +0.29 | |
| Human readability | 3.99 | 4.26 | +0.28 | |
| Word reduction rate | 8.3% | 4.5% | −3.8pp | |
| Avg. words reduced | 625.9 | 337.2 | −288.7 | |
| PutnamBench · 96 problems | ||||
| Average score | 3.75 | 4.19 | +0.45 | |
| Structure | 3.80 | 4.21 | +0.41 | |
| Signature quality | 3.46 | 4.14 | +0.68 | |
| Tactic quality | 3.97 | 4.19 | +0.22 | |
| Reuse | 3.84 | 4.24 | +0.41 | |
| Human readability | 3.67 | 4.18 | +0.50 | |
| Word reduction rate | 4.0% | 6.5% | +2.5pp | |
| Avg. words reduced | 105.5 | 118.1 | +12.6 | |
Largest gains: signature quality and human readability. Consistent with a method that targets proof architecture, not local tactic compression.
§ 07 / Ablation
External assistance carries
the design phase.
Remove the external‑assistance channel, keep everything else fixed. Average score flips from positive to negative; the loss rate climbs from 52% to 70%.
- +0.31 Full pipeline · avg. score Δ
- −0.55 No external assistance
- −1.34 Structure gap
- −1.17 Readability gap
Mixing helper design with tool‑heavy Lean interaction in a single context overloads the agent. Decoupling high‑level reasoning from low‑level interaction is the load‑bearing design choice.
§ 08 / Cite
BibTeX.
@misc{fu2026proofrefactorrefactoringgeneratedformal,
title={Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts},
author={Yiming Fu and Peixuan Liu and Zichen Wang and Kun yuan},
year={2026},
eprint={2606.03743},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2606.03743},
}