pdf-edit-engine
LiveChange the words in a PDF without touching the layout — operator-level content-stream surgery that preserves fonts, kerning, and pixel-exact positioning.
pip install pdf-edit-engineWhy this needed to exist.
Editing text in an existing PDF is a constant need — names, dates, typos, labels — but PDF was designed as a display format, not an editing format. Text is stored as positioned glyph indices, not editable strings, which is why every mainstream tool falls back to one of two approaches: redact the area and stamp new text over it with a substitute font, or extract to another format and re-render. Both silently destroy the original fonts, kerning, and exact pixel positioning. I hit this wall while building pdf-toolkit-mcp and realised there was no production-grade library that could change the words in a PDF while keeping everything else identical.
How I built it.
Instead of treating a PDF as a document, pdf-edit-engine treats it as an instruction stream. It interprets the content-stream operators inside BT/ET blocks, tracks graphics state — transformation matrix, active font, colour — and modifies the operators themselves. Where PyMuPDF — the mainstream Python tool — covers the original text with a white rectangle and stamps replacement text with a substitute font, pdf-edit-engine keeps the original glyphs and just changes the operators that position them. A two-tier font system extends embedded subsets on demand: a CMap-only fast path when the needed glyphs already exist in the font binary, and in-place glyph injection that appends the missing outlines without renumbering existing ones when they don't. Replacement text has its kerning redistributed across glyphs so the output preserves the original string width exactly — no visible spacing gaps. Every edit returns a structured FidelityReport (font_preserved, overflow_detected, reflow_applied, glyphs_missing) so automated pipelines and AI agents can verify quality programmatically without visual review. Every function also supports dry_run=True to preview the report before touching disk.
What it did.
Shipped to PyPI as pdf-edit-engine v0.2.0 with 1,200+ tests at 88% coverage under mypy strict. A 414-probe invariant audit suite across 14 layers runs as a permanent regression guard on every change. The CI matrix validates the engine against seven PDF generators (Chrome, Google Docs, four reportlab variants, pikepdf synthetic) with 100% character agreement across all of them. Benchmarks on a 100-page PDF: 0.3 s to index 900 matches, 0.03 s to replace on a single page, 0.1 s for a 50-edit batch, under 500 MB of memory. The engine powers pdf-edit-mcp — a 38-tool MCP server that brings format-preserving editing to AI agents.
What I used — and why.
Python 3.12 + pikepdf
pikepdf gives byte-level access to content streams and can unparse modified operators back into a valid PDF — the foundation of the in-place edit approach.
fonttools
Font introspection, CMap parsing, and glyph metrics drive both the two-tier subset-extension algorithm and the kerning-redistribution that preserves original text widths.
pdfminer.six
Position-aware text extraction. The engine correlates pdfminer's layout with pikepdf's content streams so operator-level edits target the correct glyphs.
pytest + mypy strict
1,200+ tests across seven PDF generators catch encoding and font edge cases; mypy strict keeps the public API type-safe so downstream tools like pdf-edit-mcp get reliable typings.
Want something similar?
Available for freelance projects and contract engineering. Usually reply within 24 hours.