Fix ToUnicode handling for symbolic fonts #572

servn · 2025-11-13T16:07:14Z

Fix ToUnicode handling for symbolic fonts
Issue: #570

Problem
The previous implementation in Font#extract_base_info would attempt to process any truthy value in the ToUnicode field as a stream reference. This caused issues with certain PDF fonts, particularly symbolic fonts, where the ToUnicode field might contain symbolic values that are not actually streams or references to streams.

Solution
Added a new private method stream_or_reference? that properly validates whether an object is either a PDF::Reader::Stream or a PDF::Reader::Reference before attempting to process it as a ToUnicode stream
Updated the ToUnicode processing logic in extract_base_info to use this type check instead of a simple truthy check
This prevents attempts to dereference non-stream objects that could cause runtime errors

Changes
Added stream_or_reference?(obj) method with proper type checking
Modified the ToUnicode processing condition from if obj[:ToUnicode] to if stream_or_reference?(obj[:ToUnicode])
Added Sorbet type annotations for the new method

Impact
This fix ensures more robust font handling, particularly for PDFs containing symbolic fonts or fonts with non-standard ToUnicode field values, preventing potential crashes during font processing.

yob

Thanks! So the goal here is to avoid raising an error on PDFs like this, but not enable extracting the text?

I'd love to get a minimal test case and sample PDF in spec/integration_spec.rb. I assume the file you're testing with isn't suitable for adding to our test corpus? If not, I can look at adding one post-merge

servn · 2025-11-14T13:53:58Z

Yes, the goal is to prevent avoidable errors for now. I can’t share the PDF I was testing due to PII restrictions, but I understand the importance of having proper specs. I’ll request a similar PDF that uses the same fonts but doesn’t contain any personal data.

servn · 2025-11-17T07:19:05Z

Basically, this change doesn’t just suppress the exception — it actually allows the parser to continue and process the font correctly.

Below is the data PDF::Reader extracted from the PDF, followed by what pdfalyze produced.

[1] pry(main)> pdf = PDF::Reader.new('PDF.pdf')
=> #<PDF::Reader:0x000000013f25ffb8
[2] pry(main)> pdf.pages.map { |p| p.fonts.map { |k,f| [k, f[:BaseFont], f[:DescendantFonts].pluck(:BaseFont)] } }
=> [[[:C2_0, :"FAUXTE+Arial", [:"FAUXTE+Arial"]], [:C2_1, :"TIQDHC+Arial,Bold", [:"TIQDHC+Arial,Bold"]], [:C2_2, :FreeSansBold, [:FreeSansBold]], [:C2_3, :FreeSans, [:FreeSans]]]]

pdfalyze PDF.pdf

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                                              │
│ 4 fonts found in PDF.pdf                                                                                                                                     │
│                                                                                                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯


╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                      │
│ 13. Font /C2_0 (Type0)                                                                                               │
│                                                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
┌─────────────────────────┬───────────────┐
│                sub_type │ /Type0        │
│               base_font │ /FAUXTE+Arial │
│                   flags │ None          │
│            bounding_box │ None          │
│      /Length properties │ None          │
│ total advertised length │ None          │
└─────────────────────────┴───────────────┘



╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                      │
│ 21. Font /C2_1 (Type0)                                                                                               │
│                                                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
┌─────────────────────────┬────────────────────┐
│                sub_type │ /Type0             │
│               base_font │ /TIQDHC+Arial,Bold │
│                   flags │ None               │
│            bounding_box │ None               │
│      /Length properties │ None               │
│ total advertised length │ None               │
└─────────────────────────┴────────────────────┘



╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                      │
│ 29. Font /C2_2 (Type0)                                                                                               │
│                                                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
┌─────────────────────────┬───────────────┐
│                sub_type │ /Type0        │
│               base_font │ /FreeSansBold │
│                   flags │ None          │
│            bounding_box │ None          │
│      /Length properties │ None          │
│ total advertised length │ None          │
└─────────────────────────┴───────────────┘



╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                      │
│ 34. Font /C2_3 (Type0)                                                                                               │
│                                                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
┌─────────────────────────┬───────────┐
│                sub_type │ /Type0    │
│               base_font │ /FreeSans │
│                   flags │ None      │
│            bounding_box │ None      │
│      /Length properties │ None      │
│ total advertised length │ None      │
└─────────────────────────┴───────────┘

Fix ToUnicode handling for symbolic fonts

36c526a

yob reviewed Nov 13, 2025

View reviewed changes

servn requested a review from yob November 17, 2025 06:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ToUnicode handling for symbolic fonts #572

Fix ToUnicode handling for symbolic fonts #572

Uh oh!

servn commented Nov 13, 2025

Uh oh!

yob left a comment

Uh oh!

servn commented Nov 14, 2025

Uh oh!

servn commented Nov 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix ToUnicode handling for symbolic fonts #572

Are you sure you want to change the base?

Fix ToUnicode handling for symbolic fonts #572

Uh oh!

Conversation

servn commented Nov 13, 2025

Uh oh!

yob left a comment

Choose a reason for hiding this comment

Uh oh!

servn commented Nov 14, 2025

Uh oh!

servn commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

servn commented Nov 17, 2025 •

edited

Loading