Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

Fu, Yiming; Liu, Peixuan; Wang, Zichen; Yuan, Kun

Proof‑Refactor · 2026

Proof-Refactor.

A process‑guided agentic framework that turns compiling Lean proofs into modular, readable, reusable artifacts — without optimizing for a single proxy metric.

Yiming Fu¹ Peixuan Liu¹ Zichen Wang² Kun Yuan^2,*

¹Southern University of Science and Technology ²Peking University ^*Corresponding author

Read the paper Open showcase Code Cite

§ 01 / Pipeline

A single Lean file. Four focused sessions.

Each phase is a separate Claude Code session with lean‑lsp‑mcp; the artifact passed forward is the Lean file itself. A custom extract tactic lets the pipeline reify proof fragments into standalone theorems.

Fig. 1 Overview of the Proof-Refactor pipeline. Each phase starts from the Lean file produced by the previous phase.

§ 02 / The idea

Refactoring,
not golfing.

Generated formal proofs always compile. They rarely read like Mathlib. They inline reusable arguments and reintroduce ad‑hoc definitions rather than factoring them out as reusable declarations.

Existing approaches optimise for length, the only objective that is reliable to measure. But readability, modularity and reuse are not reducible to an automatically computable objective — and length‑based rewards drift towards golfing rather than helper‑design.

Proof‑Refactor treats the problem the way humans do: obtain a proof, then refactor it. Identify reusable fragments, generalise them into declarations that fit the surrounding library, prove those declarations, and rewrite the original proof to call them.

§ 03 / Four phases

Extract. Design. Prove. Repair.

01
Extract

An external model proposes proof fragments worth lifting. The agent runs lean_extract, which inserts an extract wrapper and produces a scaffold theorem.
02
Design

Scaffolds are grouped and rewritten into clean, library‑style signatures, with Mathlib search results used as guidance. Every theorem is annotated with its dependencies; scaffolds are temporary and must never be depended on.
03
Prove

The agent fills in each sorry, guided by the dependency order. Original proof fragments provide the proof idea even when their code can’t be reused verbatim. External help is consulted only on rare hard obligations.
04
Repair

The owner proof is repaired by following dependency annotations and invoking the verified helpers. At each extraction site, the scaffold body guides the rewrite, but the repaired proof never calls the scaffold. The extract wrapper and the scaffolds are removed, and mathematical annotations are added.

§ 04 / The extract tactic

Turn a tactic block
into a theorem.

extract lifts a tactic block into a standalone theorem — either preserving the original goal, or stating the implication from the post‑block goal to the pre‑block goal.

Only the variables the block actually uses survive. It supports both have‑style subproofs and contiguous goal‑changing blocks, and cleans up unused variables.

Before extraction

owner declaration only

After extraction

two scaffolds

The two inserted declarations are scaffolds. Set.inter_comm is the owner declaration.

§ 05 / Case study

Putnam 1968 B6.

The original proof inlines an explicit open‑cover argument for compactness. Proof‑Refactor surfaces the two mathematical steps that were buried inside it — a density fact about the rationals, and a convergence criterion in a pseudo‑metric space — and rebuilds the main theorem around them.

Original

86 lines · inlined open‑cover proof

Proof‑Refactor

52 lines · two reusable helpers

Full examples: See the showcase →

§ 06 / Main results

Rubric‑based quality,
two benchmarks.

A separate LLM‑as‑a‑judge — not Claude Code — evaluates each refactoring along five rubric dimensions, grounded by LeanSearch retrieval. Length is reported alongside as an additional diagnostic.

Dataset	Metric	Baseline	Proof-Refactor	Δ
Putnam2025 · 12 problems
	Average score	3.96	4.27	+0.31
	Structure	4.01	4.31	+0.29
	Signature quality	3.76	4.28	+0.51
	Tactic quality	4.04	4.22	+0.18
	Reuse	4.00	4.29	+0.29
	Human readability	3.99	4.26	+0.28
	Word reduction rate	8.3%	4.5%	−3.8pp
	Avg. words reduced	625.9	337.2	−288.7
PutnamBench · 96 problems
	Average score	3.75	4.19	+0.45
	Structure	3.80	4.21	+0.41
	Signature quality	3.46	4.14	+0.68
	Tactic quality	3.97	4.19	+0.22
	Reuse	3.84	4.24	+0.41
	Human readability	3.67	4.18	+0.50
	Word reduction rate	4.0%	6.5%	+2.5pp
	Avg. words reduced	105.5	118.1	+12.6

Largest gains: signature quality and human readability. Consistent with a method that targets proof architecture, not local tactic compression.

§ 07 / Ablation

External assistance carries
the design phase.

Remove the external‑assistance channel, keep everything else fixed. Average score flips from positive to negative; the loss rate climbs from 52% to 70%.

+0.31 Full pipeline · avg. score Δ
−0.55 No external assistance
−1.34 Structure gap
−1.17 Readability gap

Mixing helper design with tool‑heavy Lean interaction in a single context overloads the agent. Decoupling high‑level reasoning from low‑level interaction is the load‑bearing design choice.

Explore

See it on real proofs.

Open showcase

§ 08 / Cite

BibTeX.

@misc{fu2026proofrefactorrefactoringgeneratedformal,
      title={Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts}, 
      author={Yiming Fu and Peixuan Liu and Zichen Wang and Kun yuan},
      year={2026},
      eprint={2606.03743},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.03743}, 
}

Proof-Refactor.

Refactoring,not golfing.

Extract

Design

Prove

Repair

Turn a tactic blockinto a theorem.

See it on real proofs.

Refactoring,
not golfing.

Turn a tactic block
into a theorem.