-
Notifications
You must be signed in to change notification settings - Fork 285
Fix ToUnicode handling for symbolic fonts #572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix ToUnicode handling for symbolic fonts #572
Conversation
yob
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! So the goal here is to avoid raising an error on PDFs like this, but not enable extracting the text?
I'd love to get a minimal test case and sample PDF in spec/integration_spec.rb. I assume the file you're testing with isn't suitable for adding to our test corpus? If not, I can look at adding one post-merge
|
Yes, the goal is to prevent avoidable errors for now. I can’t share the PDF I was testing due to PII restrictions, but I understand the importance of having proper specs. I’ll request a similar PDF that uses the same fonts but doesn’t contain any personal data. |
|
Basically, this change doesn’t just suppress the exception — it actually allows the parser to continue and process the font correctly. Below is the data
|
Fix ToUnicode handling for symbolic fonts
Issue: #570
Problem
The previous implementation in
Font#extract_base_infowould attempt to process any truthy value in the ToUnicode field as a stream reference. This caused issues with certain PDF fonts, particularly symbolic fonts, where the ToUnicode field might contain symbolic values that are not actually streams or references to streams.Solution
Added a new private method
stream_or_reference?that properly validates whether an object is either a PDF::Reader::Stream or a PDF::Reader::Reference before attempting to process it as a ToUnicode streamUpdated the ToUnicode processing logic in
extract_base_infoto use this type check instead of a simple truthy checkThis prevents attempts to dereference non-stream objects that could cause runtime errors
Changes
Added
stream_or_reference?(obj)method with proper type checkingModified the ToUnicode processing condition from
if obj[:ToUnicode]toif stream_or_reference?(obj[:ToUnicode])Added Sorbet type annotations for the new method
Impact
This fix ensures more robust font handling, particularly for PDFs containing symbolic fonts or fonts with non-standard ToUnicode field values, preventing potential crashes during font processing.