Case study·pdf-edit-engine

pdf-edit-engine

Live

Change the words in a PDF without touching the layout — operator-level content-stream surgery that preserves fonts, kerning, and pixel-exact positioning.

View on PyPI Source

~/pdf-edit-engine

$ pip install pdf-edit-engine

Installed pdf-edit-engine 0.2.0

>>> from pdf_edit_engine import replace

>>> r = replace("invoice.pdf",

...     "$12,500", "$13,250",

...     "invoice-v2.pdf")

>>> r.fidelity_report.font_preserved

True

1,200+

tests

88%

coverage

414

audit probes

Install

pip

pip install pdf-edit-engine

01The challenge

Why this needed to exist.

Editing text in an existing PDF is a constant need — names, dates, typos, labels — but PDF was designed as a display format, not an editing format. Text is stored as positioned glyph indices, not editable strings, which is why every mainstream tool falls back to one of two approaches: redact the area and stamp new text over it with a substitute font, or extract to another format and re-render. Both silently destroy the original fonts, kerning, and exact pixel positioning. I hit this wall while building pdf-toolkit-mcp and realised there was no production-grade library that could change the words in a PDF while keeping everything else identical.

02The approach

How I built it.

Instead of treating a PDF as a document, pdf-edit-engine treats it as an instruction stream. It interprets the content-stream operators inside BT/ET blocks, tracks graphics state — transformation matrix, active font, colour — and modifies the operators themselves. Where PyMuPDF — the mainstream Python tool — covers the original text with a white rectangle and stamps replacement text with a substitute font, pdf-edit-engine keeps the original glyphs and just changes the operators that position them. A two-tier font system extends embedded subsets on demand: a CMap-only fast path when the needed glyphs already exist in the font binary, and in-place glyph injection that appends the missing outlines without renumbering existing ones when they don't. Replacement text has its kerning redistributed across glyphs so the output preserves the original string width exactly — no visible spacing gaps. Every edit returns a structured FidelityReport (font_preserved, overflow_detected, reflow_applied, glyphs_missing) so automated pipelines and AI agents can verify quality programmatically without visual review. Every function also supports dry_run=True to preview the report before touching disk.

03Impact

What it did.

Shipped to PyPI as pdf-edit-engine v0.2.0 with 1,200+ tests at 88% coverage under mypy strict. A 414-probe invariant audit suite across 14 layers runs as a permanent regression guard on every change. The CI matrix validates the engine against seven PDF generators (Chrome, Google Docs, four reportlab variants, pikepdf synthetic) with 100% character agreement across all of them. Benchmarks on a 100-page PDF: 0.3 s to index 900 matches, 0.03 s to replace on a single page, 0.1 s for a 50-edit batch, under 500 MB of memory. The engine powers pdf-edit-mcp — a 38-tool MCP server that brings format-preserving editing to AI agents.

04Tech stack

What I used — and why.

Python 3.12 + pikepdf

pikepdf gives byte-level access to content streams and can unparse modified operators back into a valid PDF — the foundation of the in-place edit approach.

fonttools

Font introspection, CMap parsing, and glyph metrics drive both the two-tier subset-extension algorithm and the kerning-redistribution that preserves original text widths.

pdfminer.six

Position-aware text extraction. The engine correlates pdfminer's layout with pikepdf's content streams so operator-level edits target the correct glyphs.

pytest + mypy strict

1,200+ tests across seven PDF generators catch encoding and font edge cases; mypy strict keeps the public API type-safe so downstream tools like pdf-edit-mcp get reliable typings.

Want something similar?

Available for freelance projects and contract engineering. Usually reply within 24 hours.

Let's talk More work