Skip to content Skip to sidebar Skip to footer

Unset Pdf Font With Script

I'm creating PDFs automatically using xhtml2pdf library. A couple months ago I had this problem (the library embedded fonts that I didn't use, so the printing company can not print

Solution 1:

Java

It can be done with such tools as the iText library; see example here. But that is in Java.

(Actually, I've tried and built a very simple JAR doing just the above (i.e., open a Stamper and calling unused object removal. TFM says that this will remove unused fonts, so if your troublesome fonts are really unused, it ought to do the trick). If you have a PDF on which to test it, I can give it a go - or I can send you the .java and .jar files. They are built against iText 5.4.2, you can upgrade them at 5.5.3):

java -jar pdftrim.jar input.pdf output.pdf

Other languages (in theory even bash script)

In Python, C or shell there are no tools that I know of that are capable of doing this, yet. But it is not impossible to write one yourself.

As a first step you would need to uncompress the PDF file using pdftk (not uncoincidentally, it's made out of iText). The resulting PDF is a text file (well, apart from the first line and multibyte considerations...) and can be examined at leisure. grep will work, for example.

To detect font usage, you need to check all lines in the format

/Font NNNNNN 0 R

which would tell you that font reference object NNNNNN is in use by some text. The list of font references (not fonts) is then given by

grep "^\/Font "$PDFFILE  | sort -n -k2.1 | uniq

We now look in the file for an item like this

 NNNNNN 0 obj
 <<
 /F0 XXXXXX 0 R
 /F1 YYYYYY 0 R
 >>

This will give us more object numbers for different typefaces of the same font. XXXXXX might be the header for the bold font and YYYYYY the one for the bold-italicized font, say. XXXXXX and YYYYYY (and maybe ZZZZZZ...) are our "true" font numbers. And at those object offsets one would find something like

XXXXXX 0 obj
<<
/Encoding /WinAnsiEncoding
/ToUnicode AAAAAA 0 R
/FontDescriptor BBBBBB 0 R
/Widths [...]
/Subtype /TrueType
/Type /Font
/FirstChar 32
/LastChar 121
/BaseFont /Whatever+Font+Name
>>

which would tell us that this header references a descriptor at offset BBBBBB and a font data block at address AAAAAA. The font data block may in turn be made up of child streams.

So with a bit of dictionary lookup storage to handle the fact that we have these levels of indirection, and one directive such as /Font refers to a number while the corresponding /BaseFont refers to another, we can now:

  • find what fonts are installed (through the /BaseFont directive, following it if needed)
  • find what fonts are used (through the /Font directive)

Removal is possible (even though not for the faint of heart) by removing the unused font object subtree, starting at the addresses supplied by BaseFont and FontDescriptor, renumbering the object IDs with higher ID number and then recalculating all file offsets (they are at the bottom of the PDF file); in practice this last is achieved by copying the objects from the old PDF to the new and reading the file offset in the new file via ftell(). Then the PDF XREF at bottom can be rewritten

xref                     -- start of XREF (NOT NECESSARILY AT A NEWLINE)03315-- there are 3315 objects000000000065535 f       -- not an object; flags000000001500000 n       -- first object is 15 bytes past the beginning of the file000003300300000 n
...
001016910100000 n
trailer
<</Info 33140 R -- the info table, usually just before the XREF (needs renumbering)/Root 32590 R -- the root object ID (needs renumbering)/Size 3315-- number of objects, again>>
startxref
10169367-- file offset of XREF table above.%%EOF

pdftk can then be used to recompress the resulting PDF file.

I've also tried using tools such as PDFEdit but with scarce success.

Solution 2:

Typically, font is included in the file if some of its characters have been used. A safer approach would be to embed all fonts in your pdf file. Assuming a requirement of prepress quality for output.pdf, you can use

  gswin64c -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress -dCompressFonts=true -dSubsetFonts=true -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf -f input.pdf

You need to install ghostscript (http://www.ghostscript.com/), description of the options is given here http://www.ghostscript.com/doc/9.14/Ps2pdf.htm#Options

Post a Comment for "Unset Pdf Font With Script"