Indic script shaping in PDF: why most PDF tools render Hindi as boxes
The reason Hindi, Bengali, Tamil, and other Indic scripts come out as tofu boxes in DOMPDF, mPDF, TCPDF, and most other PDF libraries is not a font problem. It is a text-shaping problem. Here is what the shaping does, why HarfBuzz exists, and what running Chromium gets you for free.
Indic scripts look fine in your browser. The same scripts come out as boxes, isolated letters, or visually scrambled text in most PDF libraries. The reason is not that your font is missing the glyphs. It is that the PDF library never asked the question "which glyphs, in which order, with which positions?"
That question is what text shaping answers. Here is what it does.
What shaping does
A Hindi word like नमस्ते (namaste) is six Unicode codepoints in logical order:
न U+0928 DEVANAGARI LETTER NA
म U+092E DEVANAGARI LETTER MA
स U+0938 DEVANAGARI LETTER SA
् U+094D DEVANAGARI SIGN VIRAMA
त U+0924 DEVANAGARI LETTER TA
े U+0947 DEVANAGARI VOWEL SIGN E
To render it correctly, three transformations have to happen.
1. Conjunct formation. The VIRAMA (U+094D) between स and त signals that the two consonants combine into a single conjunct ligature, स्त. The font ships a glyph for स्त as one shape. The shaping engine has to know to look it up.
2. Vowel repositioning. The vowel sign े (U+0947) is logically after the consonant त but visually appears above and to the left of the consonant. The shaping engine has to know to move the glyph.
3. Cluster identification. The whole sequence स्ते is one rendered unit, not three glyphs in a row. The shaping engine has to group it before passing to the renderer.
A naive PDF library that does none of this prints न म स ् त े as six standalone glyphs. The reader sees nonsense.
What HarfBuzz does
HarfBuzz is a text-shaping engine first released in 2007. It reads an OpenType font's GSUB and GPOS tables (the substitution and positioning tables), applies the Universal Shaping Engine rules for Indic scripts, the Arabic Joining table for RTL, and the dozens of other script-specific shaping rules that the Unicode Technical Reports specify, and returns the glyph IDs and positions ready for the renderer.
Every browser ships HarfBuzz. Firefox uses it. Chrome uses it. Safari uses Apple's Core Text, which does the same thing differently. The reason a paragraph of Hindi looks correct in your browser is that the browser asked HarfBuzz what glyphs to use, in what order, at what positions.
Chromium is a browser. A Chromium-backed PDF engine inherits HarfBuzz. The shaping happens for free.
What DOMPDF, mPDF, and TCPDF do
DOMPDF, mPDF, and TCPDF are the three PHP PDF libraries that ship with most WordPress invoice plugins. They are excellent at simple Latin-script invoices. They share an architectural limitation: they do not implement shaping.
DOMPDF documents this honestly. The maintainers have replied on the WordPress.org support forum and on their issue tracker that complex scripts (Devanagari, Bengali, Tamil, Khmer) are not supported and there is no plan to add support. The reason is that adding shaping would require rewriting the rendering pipeline; the libraries map each Unicode codepoint to a single glyph in the source order, and the source order is wrong for Indic and several other scripts.
mPDF has limited Arabic and Hebrew support added through a third-party shaping library. The author of mPDF wrote the shaping by hand for the languages they cared about. Indic scripts are not in that list.
TCPDF is similar.
The result is that when a WordPress invoice plugin running on any of these engines tries to render a Hindi customer name, the customer sees either boxes (the font does not have the glyphs), isolated letters in source order (the font has the glyphs but no shaping happens), or a mix of correct and incorrect characters (some shaping rules are accidentally satisfied and others are not).
Why this matters for PDFs
A web page that renders Hindi incorrectly can be fixed by the reader. They reload, they install a font, they file a bug. A PDF is delivered as a final artifact. The reader does not have a way to ask for a re-render. If the PDF is wrong, the document is wrong, and the seller looks unprofessional.
For B2B invoices in India, Bangladesh, Sri Lanka, Nepal, Thailand, Saudi Arabia, the UAE, Israel, Japan, Korea, and China, the readability of the customer-facing PDF directly affects whether the buyer pays on time, calls support, or chooses a different supplier next quarter.
How LightningPDF handles it
Our rendering engine is Chromium. The same HarfBuzz that powers Chrome is what shapes our PDF text. The Unicode block-by-block fallback chain that Chrome ships with is what selects fonts for any script you put in the HTML.
What the API customer has to do:
Nothing.
No font configuration. No upload of .ttf files. No shaping library. You write Hindi (or Bengali, or Tamil, or Arabic, or Hebrew, or Thai, or Korean) in the HTML you pass to the API. The PDF comes back rendered the same way Chrome would render it.
What the alternatives could do
A PDF library that wants to support Indic scripts has three paths.
1. Implement shaping in their language. Rewrite the Universal Shaping Engine rules in PHP / Python / Ruby / JavaScript. This is what mPDF's author did for a few scripts. It is years of work, the upstream specs are large, and the maintenance burden is permanent because Unicode adds and refines shaping rules every release.
2. Bind to HarfBuzz natively. Call HarfBuzz from the library's language. This works but most maintainers do not want to ship native dependencies in a pure-PHP package.
3. Delegate to a browser engine. Run Chromium or Firefox in headless mode and let the browser do the work. This is what Puppeteer / Playwright wrappers do. This is what LightningPDF does. The browser is heavy but the correctness comes for free.
There is no fourth option that produces a correct Hindi PDF from naive codepoint-to-glyph mapping.
FAQ
What scripts are affected by this?
All Indic scripts (Devanagari, Bengali, Gujarati, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala), Khmer, Lao, Tibetan, Mongolian, Arabic, Hebrew, Syriac, N'Ko, Mandaic, Adlam. Latin, Cyrillic, Greek, and most CJK ideograph rendering do not require shaping in the same way (CJK has its own challenges but the codepoint-to-glyph mapping is closer to one-to-one).
Does this affect Arabic and Hebrew?
Arabic requires both shaping (contextual letter forms: initial, medial, final, isolated) and RTL bidirectional layout. mPDF supports both at a basic level. TCPDF has partial Arabic support. DOMPDF has none. Hebrew is simpler than Arabic but still requires bidirectional layout that the PHP libraries handle inconsistently.
Can I test whether my current PDF tool handles this?
Generate a PDF that includes the string नमस्ते. If it renders as one connected glyph with the vowel sign above the right letter, your tool shapes correctly. If it renders as separate letters or shows the vowel sign in the wrong position, it does not.
Why is this not a font problem?
Fonts contain the glyphs. Without a Devanagari font, you get tofu boxes (the missing-glyph indicator). With a Devanagari font but no shaping, you get correct glyphs in the wrong positions, in the wrong order, with the wrong combining. Fixing the font without fixing shaping does not solve the rendering problem; it changes the visible failure mode.
LightningPDF
Building fast, reliable PDF generation tools for developers.