MAINT: Use CORE_FONT_METRICS for widths where possible #3526

PJBrs · 2025-11-19T10:30:08Z

This PR tries to harmonise on using the Adobe standard fonts information in CORE_FONT_METRICS where possible and applicable.

Currently, pypdf collects character widths in three different places. First, there is build_font_width_map

pypdf/pypdf/_cmap.py

Line 402 in e9e3735

def build_font_width_map(

in _cmap.py , then there is the post_init method

pypdf/pypdf/_text_extraction/_layout_mode/_font.py

Line 39 in e9e3735

def __post_init__(self) -> None:

of the Font class in _text_extraction/_layout_mode/_font.py, and there are the 14 adobe font widths in _font, in the FontDesciptor class.

This PR tries to make use of the data in CORE_FONT_METRICS in both build_font_width_map and the Font class post_init method, and addresses resulting changes in test results. As it is, this information was unavailable in cmap, and only partially used (incomplete) in the Font class post_init. In that respect, this PR means improved coverage of available font widths in these places. Also, the PR removes incomplete information where it was previously used.

It possibly is a first step towards harmonising the ways in which we collect character widths, which now still seem to differ in important ways between cmap and the Font class. Ultimately, I hope to arrive at one solution that either is part of the FontDescriptor dataclass or that can be used there.

This patch reads the character widths for the core 14 Type 1 PDF fonts from the _codecs/core_fontmetrics.py file.

The _font_widths.py file contains a set of standard widths for the core 14 Type 1 PDF fonts. However, it is incomplete; for instance, it sets all Helvetica variants to the same set of widths. Since we have imported FONT_METRICS from the original Adobe core fot afm files now, which are complete, complete, port over the Font class to use that data.

This file contained STANDARD_WIDTHS, which was contained font widths for the core 14 Type 1 PDF fonts. This is no longer used since we imported FONT_METRICS from the afm files. Also, it would appear that the _font_widths.py did not include the associated licensing information, which, with its removal, is now also remedied.

It appears that, with the new font metrics some space widths in two tests are recognised as slightly smaller than before. Change the associated tests accordingly.

Default width, if not given or defined in a font resource, is calculated either by averaging all font widths or, in the case of unembedded fonts, calculated as the width of two spaces. Previously, this width was part of a separate table in _cmap.py. This patch changes the associated logic and instead reads space width from CORE_FONT_METRICS, if a default width is not given or found elsewhere. Note that the patch removes the info for four space widths from pypdf, i.e., the four Helvetica-Narrow fonts. However, these are not among the fourteen Adobe standard fonts, so, they should be entirely embedded (and have a default value or be able to compute one by averaging all font widths). Note that it's slightly inconsistent now, that CID fonts that don't have a default width defined will need to get it from the value set by the user or when, by accident, the have the same name as one of the fourteen Adobe standard fonts, whereas other embedded fonts get a default width equal to the average character width, and the fourteen standard fonts get a default of two space widths. However, this inconsistency was already part of _cmap.py before.

codecov · 2025-11-19T11:41:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.16%. Comparing base (310e571) to head (8726179).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3526      +/-   ##
==========================================
- Coverage   97.16%   97.16%   -0.01%     
==========================================
  Files          57       56       -1     
  Lines        9809     9805       -4     
  Branches     1781     1783       +2     
==========================================
- Hits         9531     9527       -4     
  Misses        167      167              
  Partials      111      111

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stefan6419846 · 2025-11-27T10:57:11Z

resources/AEO.1172.layout.txt

-                                      Style Desc:                 1172 KNIT SHORTIE
-                                      Fit / Other:
-                                                                                                           PRODUCT SUMMARY
-Date Printed: 17/Nov/2022                                        2022 AEO Management Co. All Rights Reserved. Proprietary and Confidential AEO Business Information. Subject to Legal Action if Disclosed Without Authorization from AEO.     Page  1of 9


The change regarding the page number looks like the results got worse here, compared to pdftotext which keeps the whitespace in layout mode as well.

I guess that we cannot really influence this as the font data is just like this? Which of the changes is most likely the culprit here?

I'll bisect and find it. In any case, that "extra" space would be there in the stream data.

PJBrs added 4 commits November 19, 2025 11:22

ENH: build_font_width_map: Add widths for core 14 fonts

efefa12

This patch reads the character widths for the core 14 Type 1 PDF fonts from the _codecs/core_fontmetrics.py file.

MAINT: Fix tests after moving to complete font metrics

6e1ca67

It appears that, with the new font metrics some space widths in two tests are recognised as slightly smaller than before. Change the associated tests accordingly.

PJBrs marked this pull request as draft November 19, 2025 10:47

PJBrs force-pushed the widths branch from a4dbf2e to 662e977 Compare November 19, 2025 11:31

PJBrs marked this pull request as ready for review November 19, 2025 12:06

Merge branch 'main' into widths

8726179

stefan6419846 reviewed Nov 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MAINT: Use CORE_FONT_METRICS for widths where possible #3526

MAINT: Use CORE_FONT_METRICS for widths where possible #3526

PJBrs commented Nov 19, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 19, 2025 •

edited

Loading

Uh oh!

stefan6419846 Nov 27, 2025

Uh oh!

PJBrs Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MAINT: Use CORE_FONT_METRICS for widths where possible #3526

Are you sure you want to change the base?

MAINT: Use CORE_FONT_METRICS for widths where possible #3526

Conversation

PJBrs commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stefan6419846 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

PJBrs Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PJBrs commented Nov 19, 2025 •

edited

Loading

codecov bot commented Nov 19, 2025 •

edited

Loading