Skip to content

Conversation

@servn
Copy link

@servn servn commented Nov 13, 2025

Fix ToUnicode handling for symbolic fonts
Issue: #570

Problem
The previous implementation in Font#extract_base_info would attempt to process any truthy value in the ToUnicode field as a stream reference. This caused issues with certain PDF fonts, particularly symbolic fonts, where the ToUnicode field might contain symbolic values that are not actually streams or references to streams.

Solution
Added a new private method stream_or_reference? that properly validates whether an object is either a PDF::Reader::Stream or a PDF::Reader::Reference before attempting to process it as a ToUnicode stream
Updated the ToUnicode processing logic in extract_base_info to use this type check instead of a simple truthy check
This prevents attempts to dereference non-stream objects that could cause runtime errors

Changes
Added stream_or_reference?(obj) method with proper type checking
Modified the ToUnicode processing condition from if obj[:ToUnicode] to if stream_or_reference?(obj[:ToUnicode])
Added Sorbet type annotations for the new method

Impact
This fix ensures more robust font handling, particularly for PDFs containing symbolic fonts or fonts with non-standard ToUnicode field values, preventing potential crashes during font processing.

Copy link
Owner

@yob yob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! So the goal here is to avoid raising an error on PDFs like this, but not enable extracting the text?

I'd love to get a minimal test case and sample PDF in spec/integration_spec.rb. I assume the file you're testing with isn't suitable for adding to our test corpus? If not, I can look at adding one post-merge

@servn
Copy link
Author

servn commented Nov 14, 2025

Yes, the goal is to prevent avoidable errors for now. I can’t share the PDF I was testing due to PII restrictions, but I understand the importance of having proper specs. I’ll request a similar PDF that uses the same fonts but doesn’t contain any personal data.

@servn servn requested a review from yob November 17, 2025 06:20
@servn
Copy link
Author

servn commented Nov 17, 2025

Basically, this change doesn’t just suppress the exception — it actually allows the parser to continue and process the font correctly.

Below is the data PDF::Reader extracted from the PDF, followed by what pdfalyze produced.

[1] pry(main)> pdf = PDF::Reader.new('PDF.pdf')
=> #<PDF::Reader:0x000000013f25ffb8
[2] pry(main)> pdf.pages.map { |p| p.fonts.map { |k,f| [k, f[:BaseFont], f[:DescendantFonts].pluck(:BaseFont)] } }
=> [[[:C2_0, :"FAUXTE+Arial", [:"FAUXTE+Arial"]], [:C2_1, :"TIQDHC+Arial,Bold", [:"TIQDHC+Arial,Bold"]], [:C2_2, :FreeSansBold, [:FreeSansBold]], [:C2_3, :FreeSans, [:FreeSans]]]]

pdfalyze PDF.pdf

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                                              │
│ 4 fonts found in PDF.pdf                                                                                                                                     │
│                                                                                                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯


╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                      │
│ 13. Font /C2_0 (Type0)                                                                                               │
│                                                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
┌─────────────────────────┬───────────────┐
│                sub_type │ /Type0        │
│               base_font │ /FAUXTE+Arial │
│                   flags │ None          │
│            bounding_box │ None          │
│      /Length properties │ None          │
│ total advertised length │ None          │
└─────────────────────────┴───────────────┘



╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                      │
│ 21. Font /C2_1 (Type0)                                                                                               │
│                                                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
┌─────────────────────────┬────────────────────┐
│                sub_type │ /Type0             │
│               base_font │ /TIQDHC+Arial,Bold │
│                   flags │ None               │
│            bounding_box │ None               │
│      /Length properties │ None               │
│ total advertised length │ None               │
└─────────────────────────┴────────────────────┘



╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                      │
│ 29. Font /C2_2 (Type0)                                                                                               │
│                                                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
┌─────────────────────────┬───────────────┐
│                sub_type │ /Type0        │
│               base_font │ /FreeSansBold │
│                   flags │ None          │
│            bounding_box │ None          │
│      /Length properties │ None          │
│ total advertised length │ None          │
└─────────────────────────┴───────────────┘



╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                      │
│ 34. Font /C2_3 (Type0)                                                                                               │
│                                                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
┌─────────────────────────┬───────────┐
│                sub_type │ /Type0    │
│               base_font │ /FreeSans │
│                   flags │ None      │
│            bounding_box │ None      │
│      /Length properties │ None      │
│ total advertised length │ None      │
└─────────────────────────┴───────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants