Improving Textract/Google Vision hOCR/HTML formatting #62
Unanswered
bluebox-steven
asked this question in
Q&A
Replies: 1 comment
-
A feature that corrects low-quality or missing positioning and style data is planned and should come out in an upcoming release. Much of this logic already exists and is used for the internal recognition feature. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all. Came across scribe.js and so far the project looks great for what I'm looking for. However, I'm running into issues trying to get the formatting right when inputting Textract/Google Vision JSON files. I understand these are experimental, but looking more for a discussion around improving the hOCR/HTML output.
At the bottom of this post is an example of the problem, which isn't present when using Tesseract. The difference from what I can tell is the font metrics and scribe's internal
x_x_heightandx_asc_heightcalculations, which looks to be based off ascending/descending characters. This is obviously lacking in other engine imports in their calculations.My thinking is trying to combine Tesseract and Textract/Google Vision results, using Tesseract for formatting, while Textract/Google Vision for source of truth of character recognition. There's obviously potential problems with this, such as differing bounding boxes and character recognition between the engines, but I'm hoping a fairly simple intersection calculation will do.
Has anyone else tackled something like this, and/or have suggestions on how best to improve the output?
Thanks!
Original PDF:
Tesseract:
Textract:
hOCR output:
Tesseract:
Textract:
Beta Was this translation helpful? Give feedback.
All reactions