pdf-edit-engine
Change the words in a PDF without touching the layout — operator-level content-stream surgery that preserves fonts, kerning, and pixel-exact positioning.
Install
pip install pdf-edit-engineThe Challenge
Editing text in an existing PDF is a constant need — names, dates, typos, labels — but PDF was designed as a display format, not an editing format. Text is stored as positioned glyph indices, not editable strings, which is why every mainstream tool falls back to one of two approaches: redact the area and stamp new text over it with a substitute font, or extract to another format and re-render. Both silently destroy the original fonts, kerning, and exact pixel positioning. I hit this wall while building pdf-toolkit-mcp and realised there was no production-grade library that could change the words in a PDF while keeping everything else identical.
The Approach
Instead of treating a PDF as a document, pdf-edit-engine treats it as an instruction stream. It interprets the content-stream operators inside BT/ET blocks, tracks graphics state — transformation matrix, active font, colour — and modifies the operators themselves. Where PyMuPDF — the mainstream Python tool — covers the original text with a white rectangle and stamps replacement text with a substitute font, pdf-edit-engine keeps the original glyphs and just changes the operators that position them. A two-tier font system extends embedded subsets on demand: a CMap-only fast path when the needed glyphs already exist in the font binary, and a full re-embed with --retain-gids when they don't. Replacement text has its kerning redistributed across glyphs so the output preserves the original string width exactly — no visible spacing gaps. Every edit returns a structured FidelityReport (font_preserved, overflow_detected, reflow_applied, glyphs_missing) so automated pipelines and AI agents can verify quality programmatically without visual review. Every function also supports dry_run=True to preview the report before touching disk.
The Impact
Shipped to PyPI as pdf-edit-engine v0.1.0 with 628 tests at 85% coverage under mypy strict. The CI matrix validates the engine against seven PDF generators (Chrome, Google Docs, four reportlab variants, pikepdf synthetic) with 100% character agreement across all of them. Benchmarks on a 100-page PDF: 0.3 s to index 900 matches, 0.03 s to replace on a single page, 0.1 s for a 50-edit batch, under 500 MB of memory. The engine powers pdf-edit-mcp — a 38-tool MCP server that brings format-preserving editing to AI agents.
Tech Stack
Python 3.12 + pikepdf
pikepdf gives byte-level access to content streams and can unparse modified operators back into a valid PDF — the foundation of the in-place edit approach.
fonttools
Font introspection, CMap parsing, and glyph metrics drive both the two-tier subset-extension algorithm and the kerning-redistribution that preserves original text widths.
pdfminer.six
Position-aware text extraction. The engine correlates pdfminer's layout with pikepdf's content streams so operator-level edits target the correct glyphs.
pytest + mypy strict
628 tests across seven PDF generators catch encoding and font edge cases; mypy strict keeps the public API type-safe so downstream tools like pdf-edit-mcp get reliable typings.