-
Notifications
You must be signed in to change notification settings - Fork 1.5k
MAINT: Use CORE_FONT_METRICS for widths where possible #3526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This patch reads the character widths for the core 14 Type 1 PDF fonts from the _codecs/core_fontmetrics.py file.
The _font_widths.py file contains a set of standard widths for the core 14 Type 1 PDF fonts. However, it is incomplete; for instance, it sets all Helvetica variants to the same set of widths. Since we have imported FONT_METRICS from the original Adobe core fot afm files now, which are complete, complete, port over the Font class to use that data.
This file contained STANDARD_WIDTHS, which was contained font widths for the core 14 Type 1 PDF fonts. This is no longer used since we imported FONT_METRICS from the afm files. Also, it would appear that the _font_widths.py did not include the associated licensing information, which, with its removal, is now also remedied.
It appears that, with the new font metrics some space widths in two tests are recognised as slightly smaller than before. Change the associated tests accordingly.
Default width, if not given or defined in a font resource, is calculated either by averaging all font widths or, in the case of unembedded fonts, calculated as the width of two spaces. Previously, this width was part of a separate table in _cmap.py. This patch changes the associated logic and instead reads space width from CORE_FONT_METRICS, if a default width is not given or found elsewhere. Note that the patch removes the info for four space widths from pypdf, i.e., the four Helvetica-Narrow fonts. However, these are not among the fourteen Adobe standard fonts, so, they should be entirely embedded (and have a default value or be able to compute one by averaging all font widths). Note that it's slightly inconsistent now, that CID fonts that don't have a default width defined will need to get it from the value set by the user or when, by accident, the have the same name as one of the fourteen Adobe standard fonts, whereas other embedded fonts get a default width equal to the average character width, and the fourteen standard fonts get a default of two space widths. However, this inconsistency was already part of _cmap.py before.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3526 +/- ##
==========================================
- Coverage 97.16% 97.16% -0.01%
==========================================
Files 57 56 -1
Lines 9809 9805 -4
Branches 1781 1783 +2
==========================================
- Hits 9531 9527 -4
Misses 167 167
Partials 111 111 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| Style Desc: 1172 KNIT SHORTIE | ||
| Fit / Other: | ||
| PRODUCT SUMMARY | ||
| Date Printed: 17/Nov/2022 2022 AEO Management Co. All Rights Reserved. Proprietary and Confidential AEO Business Information. Subject to Legal Action if Disclosed Without Authorization from AEO. Page 1of 9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change regarding the page number looks like the results got worse here, compared to pdftotext which keeps the whitespace in layout mode as well.
I guess that we cannot really influence this as the font data is just like this? Which of the changes is most likely the culprit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll bisect and find it. In any case, that "extra" space would be there in the stream data.
This PR tries to harmonise on using the Adobe standard fonts information in CORE_FONT_METRICS where possible and applicable.
Currently, pypdf collects character widths in three different places. First, there is build_font_width_map
pypdf/pypdf/_cmap.py
Line 402 in e9e3735
pypdf/pypdf/_text_extraction/_layout_mode/_font.py
Line 39 in e9e3735
This PR tries to make use of the data in CORE_FONT_METRICS in both build_font_width_map and the Font class post_init method, and addresses resulting changes in test results. As it is, this information was unavailable in cmap, and only partially used (incomplete) in the Font class post_init. In that respect, this PR means improved coverage of available font widths in these places. Also, the PR removes incomplete information where it was previously used.
It possibly is a first step towards harmonising the ways in which we collect character widths, which now still seem to differ in important ways between cmap and the Font class. Ultimately, I hope to arrive at one solution that either is part of the FontDescriptor dataclass or that can be used there.