4.5 KiB
4.5 KiB
Multi-renderer Strategy (HTML→PDF adapters)
We should not bet the product on a single PDF engine. iftypeset should be renderer-agnostic: the “meaning” is in the rule registry + profiles + QA gates; the PDF renderer is an interchangeable adapter.
Principles
- Determinism first: the adapter must emit
render-log.jsonwith engine name + version + key options. - No-network capable: engines must run with
--network=none/offline mode in CI where possible. - Graceful degradation: if no PDF engine exists, HTML artifacts + HTML-based QA must still run.
- Capability disclosure: if a gate can’t be measured with an engine, report it explicitly (don’t silently pass).
Adapter interface (contract)
All PDF engines implement the same interface:
class PdfEngine(Protocol):
name: str
def is_available(self) -> bool: ...
def version(self) -> str: ...
def render(self, *, html_path: str, css_path: str, assets_dir: str | None, out_pdf: str, options: dict) -> dict:
"""Returns a structured log: timings, warnings, engine opts, feature flags."""
The CLI should support:
--engine auto|playwright|weasyprint|prince|antenna|vivliostyle|wkhtmltopdf--engine-opts <json>
“Majors” to target (pragmatic)
Tier 1 (easy to run, common)
- Playwright (browser-backed PDF)
- via Playwright (preferred)
- Pros: ubiquitous, good HTML/CSS coverage, easy containerization.
- Cons: paged-media features vary; footnotes/running headers are limited unless carefully built.
- WeasyPrint
- Pros: pure Python workflow, good paged-media support, easy CI story.
- Cons: CSS compatibility differs; some complex layouts may need workarounds.
Tier 2 (best print fidelity; commercial)
- PrinceXML
- Pros: excellent paged media, footnotes, running headers, print-quality output.
- Cons: license cost; needs binary distribution policy.
- Antenna House Formatter
- Pros: top-tier print fidelity; standards publishing; robust PDF/A options.
- Cons: license + operational complexity.
Tier 3 (useful but limited)
- Vivliostyle / Paged.js
- Pros: strong paged-media model in the web ecosystem.
- Cons: heavier runtime; often “HTML+JS render” rather than simple CLI.
- wkhtmltopdf
- Pros: simple deploy story in legacy environments.
- Cons: outdated rendering model; limited CSS; not ideal for “high quality”.
Capability matrix (what we care about)
We should encode an engine capability report (per run) for:
- paged media (margins, page size, running headers)
- hyphenation support + dictionaries
- font embedding/subsetting
- link handling (wrap/break strategy)
- footnotes (if we later support them)
- PDF/A options (later)
This capability map feeds QA:
- if engine can’t support a gate (e.g., true widow/orphan detection on PDF), QA should:
- run the best available approximation, and
- mark the gate as
skippedwith a reason, notpassed.
Determinism knobs (must record)
For every PDF render, write out/render-log.json including:
- engine name + version
- invocation args
- environment hints (OS, locale)
- “self-contained” mode on/off
- fonts policy + resolution (requested primary fonts, what fontconfig matched, and what fonts were embedded in the PDF)
- any warnings from the engine
If the engine is a browser:
- fix viewport
- disable external requests
- pin print settings (margins, background graphics, scaling)
Security model
- Assume untrusted Markdown input (CI context). Mitigations:
- never execute embedded JS during HTML render (or use a hardened renderer container)
- disable network
- restrict filesystem access (mount only
out/and input)
- If using headless browsers, treat them as an attack surface; run in locked-down containers.
Recommended v0.1 path (fastest)
- Implement adapters for:
- Playwright (auto-detect)
- WeasyPrint (if installed)
- Keep Prince/AH as optional adapters (stub + docs) until needed.
- Use QA gates as the real value:
- link wrap, code/table overflow, stranded headings (HTML and PDF when possible)
This keeps delivery fast while preserving “compatible with the majors”.
Future: “Engine parity” testing
Once adapters exist, add an integration job that renders the same fixtures through 2 engines (when available) and compares:
- gate metrics (should be within thresholds)
- file size ranges
- major layout regressions (e.g., table clipping incidents)
We don’t need pixel-perfect equivalence; we need “quality gates still pass”.