Proof-Refactor Paper

Proof‑Refactor · 2026

Proof-Refactor.

A process‑guided agentic framework that turns compiling Lean proofs into modular, readable, reusable artifacts — without optimizing for a single proxy metric.

1Southern University of Science and Technology   2Peking University   *Corresponding author

§ 01  /  Pipeline

A single Lean file. Four focused sessions.

Each phase is a separate Claude Code session with lean‑lsp‑mcp; the artifact passed forward is the Lean file itself. A custom extract tactic lets the pipeline reify proof fragments into standalone theorems.

Overview of the Proof-Refactor pipeline
Fig. 1 Overview of the Proof-Refactor pipeline. Each phase starts from the Lean file produced by the previous phase.

§ 02  /  The idea

Refactoring,
not golfing.

Generated formal proofs always compile. They rarely read like Mathlib. They inline reusable arguments and reintroduce ad‑hoc definitions rather than factoring them out as reusable declarations.

Existing approaches optimise for length, the only objective that is reliable to measure. But readability, modularity and reuse are not reducible to an automatically computable objective — and length‑based rewards drift towards golfing rather than helper‑design.

Proof‑Refactor treats the problem the way humans do: obtain a proof, then refactor it. Identify reusable fragments, generalise them into declarations that fit the surrounding library, prove those declarations, and rewrite the original proof to call them.

§ 03  /  Four phases

Extract. Design. Prove. Repair.

  1. 01

    Extract

    An external model proposes proof fragments worth lifting. The agent runs lean_extract, which inserts an extract wrapper and produces a scaffold theorem.

  2. 02

    Design

    Scaffolds are grouped and rewritten into clean, library‑style signatures, with Mathlib search results used as guidance. Every theorem is annotated with its dependencies; scaffolds are temporary and must never be depended on.

  3. 03

    Prove

    The agent fills in each sorry, guided by the dependency order. Original proof fragments provide the proof idea even when their code can’t be reused verbatim. External help is consulted only on rare hard obligations.

  4. 04

    Repair

    The owner proof is repaired by following dependency annotations and invoking the verified helpers. At each extraction site, the scaffold body guides the rewrite, but the repaired proof never calls the scaffold. The extract wrapper and the scaffolds are removed, and mathematical annotations are added.

§ 04  /  The extract tactic

Turn a tactic block
into a theorem.

extract lifts a tactic block into a standalone theorem — either preserving the original goal, or stating the implication from the post‑block goal to the pre‑block goal.

Only the variables the block actually uses survive. It supports both have‑style subproofs and contiguous goal‑changing blocks, and cleans up unused variables.

Before extraction

owner declaration only

After extraction

two scaffolds

The two inserted declarations are scaffolds. Set.inter_comm is the owner declaration.

§ 05  /  Case study

Putnam 1968 B6.

The original proof inlines an explicit open‑cover argument for compactness. Proof‑Refactor surfaces the two mathematical steps that were buried inside it — a density fact about the rationals, and a convergence criterion in a pseudo‑metric space — and rebuilds the main theorem around them.

Original

86 lines · inlined open‑cover proof

Proof‑Refactor

52 lines · two reusable helpers

Full examples: See the showcase →

§ 06  /  Main results

Rubric‑based quality,
two benchmarks.

A separate LLM‑as‑a‑judge — not Claude Code — evaluates each refactoring along five rubric dimensions, grounded by LeanSearch retrieval. Length is reported alongside as an additional diagnostic.

Dataset Metric Baseline Proof-Refactor Δ
Putnam2025 · 12 problems
Average score3.964.27+0.31
Structure4.014.31+0.29
Signature quality3.764.28+0.51
Tactic quality4.044.22+0.18
Reuse4.004.29+0.29
Human readability3.994.26+0.28
Word reduction rate8.3%4.5%−3.8pp
Avg. words reduced625.9337.2−288.7
PutnamBench · 96 problems
Average score3.754.19+0.45
Structure3.804.21+0.41
Signature quality3.464.14+0.68
Tactic quality3.974.19+0.22
Reuse3.844.24+0.41
Human readability3.674.18+0.50
Word reduction rate4.0%6.5%+2.5pp
Avg. words reduced105.5118.1+12.6

Largest gains: signature quality and human readability. Consistent with a method that targets proof architecture, not local tactic compression.

§ 07  /  Ablation

External assistance carries
the design phase.

Remove the external‑assistance channel, keep everything else fixed. Average score flips from positive to negative; the loss rate climbs from 52% to 70%.

  • +0.31 Full pipeline · avg. score Δ
  • −0.55 No external assistance
  • −1.34 Structure gap
  • −1.17 Readability gap

Mixing helper design with tool‑heavy Lean interaction in a single context overloads the agent. Decoupling high‑level reasoning from low‑level interaction is the load‑bearing design choice.

Explore

See it on real proofs.

Open showcase

§ 08  /  Cite

BibTeX.

@misc{fu2026proofrefactorrefactoringgeneratedformal,
      title={Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts}, 
      author={Yiming Fu and Peixuan Liu and Zichen Wang and Kun yuan},
      year={2026},
      eprint={2606.03743},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.03743}, 
}