Improving Textract/Google Vision hOCR/HTML formatting #62

bluebox-steven · 2025-12-12T08:20:07Z

bluebox-steven
Dec 12, 2025

Hi all. Came across scribe.js and so far the project looks great for what I'm looking for. However, I'm running into issues trying to get the formatting right when inputting Textract/Google Vision JSON files. I understand these are experimental, but looking more for a discussion around improving the hOCR/HTML output.

At the bottom of this post is an example of the problem, which isn't present when using Tesseract. The difference from what I can tell is the font metrics and scribe's internal x_x_height and x_asc_height calculations, which looks to be based off ascending/descending characters. This is obviously lacking in other engine imports in their calculations.

My thinking is trying to combine Tesseract and Textract/Google Vision results, using Tesseract for formatting, while Textract/Google Vision for source of truth of character recognition. There's obviously potential problems with this, such as differing bounding boxes and character recognition between the engines, but I'm hoping a fairly simple intersection calculation will do.

Has anyone else tackled something like this, and/or have suggestions on how best to improve the output?

Thanks!

Original PDF:

Tesseract:

Textract:

hOCR output:

Tesseract:

		<span class='ocr_line' title="bbox 384 1082 580 1119; baseline 0 -8; x_x_height 20; x_asc_height 29">
			<span class='ocrx_word' id='word_3_245' title='bbox 384 1082 580 1119;x_wconf 100;x_font Times_New_Roman_Bold_Italic' style='font-style:italic;font-family:Times_New_Roman_Bold_Italic'>Exchanges.</span>
		</span>
		<span class='ocr_line' title="bbox 384 1188 2275 1228; baseline 0 -9; x_x_height 21; x_asc_height 29">
			<span class='ocrx_word' id='word_3_246' title='bbox 384 1190 424 1224;x_wconf 0;x_font Times_New_Roman' style='font-family:Times_New_Roman'>(1)</span>
			<span class='ocrx_word' id='word_3_247' title='bbox 503 1189 610 1219;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>Unless</span>
			<span class='ocrx_word' id='word_3_248' title='bbox 621 1189 777 1219;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>otherwise</span>
			<span class='ocrx_word' id='word_3_249' title='bbox 787 1188 929 1228;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>specified</span>
			<span class='ocrx_word' id='word_3_250' title='bbox 941 1190 971 1219;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>in</span>
			<span class='ocrx_word' id='word_3_251' title='bbox 983 1189 1145 1228;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>Paragraph</span>
			<span class='ocrx_word' id='word_3_252' title='bbox 1161 1190 1203 1223;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>11,</span>
			<span class='ocrx_word' id='word_3_253' title='bbox 1217 1189 1263 1219;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>the</span>
			<span class='ocrx_word' id='word_3_254' title='bbox 1275 1188 1443 1219;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>Transferor</span>
			<span class='ocrx_word' id='word_3_255' title='bbox 1453 1198 1522 1228;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>may</span>
			<span class='ocrx_word' id='word_3_256' title='bbox 1533 1198 1574 1219;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>on</span>
			<span class='ocrx_word' id='word_3_257' title='bbox 1584 1198 1642 1228;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>any</span>
			<span class='ocrx_word' id='word_3_258' title='bbox 1655 1189 1744 1219;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>Local</span>
			<span class='ocrx_word' id='word_3_259' title='bbox 1757 1190 1897 1219;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>Business</span>
			<span class='ocrx_word' id='word_3_260' title='bbox 1910 1190 1975 1228;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>Day</span>
			<span class='ocrx_word' id='word_3_261' title='bbox 1990 1188 2029 1228;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>by</span>
			<span class='ocrx_word' id='word_3_262' title='bbox 2047 1190 2148 1219;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>notice</span>
			<span class='ocrx_word' id='word_3_263' title='bbox 2165 1188 2275 1219;x_wconf 100;x_font Times_New_Roman' style='font-family:Times_New_Roman'>inform</span>
		</span>
		<span class='ocr_line' title="bbox 503 1242 2276 1282; baseline 0 -9; x_x_height 21; x_asc_height 29">
			<span class='ocrx_word' id='word_3_264' title='bbox 503 1243 552 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>the</span>
			<span class='ocrx_word' id='word_3_265' title='bbox 570 1242 747 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Transferee</span>
			<span class='ocrx_word' id='word_3_266' title='bbox 767 1243 827 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>that</span>
			<span class='ocrx_word' id='word_3_267' title='bbox 845 1244 866 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>it</span>
			<span class='ocrx_word' id='word_3_268' title='bbox 883 1243 994 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>wishes</span>
			<span class='ocrx_word' id='word_3_269' title='bbox 1013 1249 1040 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>to</span>
			<span class='ocrx_word' id='word_3_270' title='bbox 1061 1242 1188 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>transfer</span>
			<span class='ocrx_word' id='word_3_271' title='bbox 1205 1249 1235 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>to</span>
			<span class='ocrx_word' id='word_3_272' title='bbox 1253 1243 1302 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>the</span>
			<span class='ocrx_word' id='word_3_273' title='bbox 1317 1242 1497 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Transferee</span>
			<span class='ocrx_word' id='word_3_274' title='bbox 1514 1242 1641 1282;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Eligible</span>
			<span class='ocrx_word' id='word_3_275' title='bbox 1659 1242 1763 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Credit</span>
			<span class='ocrx_word' id='word_3_276' title='bbox 1775 1243 1904 1282;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Support</span>
			<span class='ocrx_word' id='word_3_275b' title='bbox 1923 1238 2003 1286;x_wconf 0;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>(VM)</span>
			<span class='ocrx_word' id='word_3_276b' title='bbox 2019 1238 2159 1286;x_wconf 0;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>specified</span>
			<span class='ocrx_word' id='word_3_278' title='bbox 2174 1244 2204 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>in</span>
			<span class='ocrx_word' id='word_3_279' title='bbox 2216 1243 2276 1273;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>that</span>
		</span>
		<span class='ocr_line' title="bbox 502 1296 2277 1336; baseline 0 -9; x_x_height 21; x_asc_height 29">
			<span class='ocrx_word' id='word_3_280' title='bbox 502 1298 603 1327;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>notice</span>
			<span class='ocrx_word' id='word_3_281' title='bbox 621 1297 684 1332;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>(the</span>
			<span class='ocrx_word' id='word_3_282' title='bbox 708 1298 802 1327;x_wconf 100;x_font Times_New_Roman_Bold_Italic' style='font-style:italic;font-family:Times_New_Roman_Bold_Italic'>“New</span>
			<span class='ocrx_word' id='word_3_283' title='bbox 821 1297 925 1327;x_wconf 100;x_font Times_New_Roman_Bold_Italic' style='font-style:italic;font-family:Times_New_Roman_Bold_Italic'>Credit</span>
			<span class='ocrx_word' id='word_3_284' title='bbox 941 1297 1081 1336;x_wconf 100;x_font Times_New_Roman_Bold_Italic' style='font-style:italic;font-family:Times_New_Roman_Bold_Italic'>Support</span>
			<span class='ocrx_word' id='word_3_285' title='bbox 1097 1297 1222 1333;x_wconf 0;x_font Times_New_Roman_Bold_Italic' style='font-style:italic;font-variant:small-caps;font-family:Times_New_Roman_Bold_Italic'>(vm)&quot;)</span>
			<span class='ocrx_word' id='word_3_286' title='bbox 1241 1298 1271 1327;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>in</span>
			<span class='ocrx_word' id='word_3_287' title='bbox 1291 1297 1446 1336;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>exchange</span>
			<span class='ocrx_word' id='word_3_288' title='bbox 1467 1296 1512 1327;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>for</span>
			<span class='ocrx_word' id='word_3_289' title='bbox 1527 1298 1640 1327;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>certain</span>
			<span class='ocrx_word' id='word_3_290' title='bbox 1664 1296 1791 1336;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Eligible</span>
			<span class='ocrx_word' id='word_3_291' title='bbox 1815 1296 1916 1327;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Credit</span>
			<span class='ocrx_word' id='word_3_292' title='bbox 1940 1297 2072 1336;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Support</span>
			<span class='ocrx_word' id='word_3_293' title='bbox 2095 1298 2188 1332;x_wconf 0;x_font Times_New_Roman_Bold' style='font-variant:small-caps;font-family:Times_New_Roman_Bold'>(VM)</span>
			<span class='ocrx_word' id='word_3_294' title='bbox 2212 1297 2277 1332;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>(the</span>
		</span>
		<span class='ocr_line' title="bbox 507 1350 2276 1390; baseline 0 -9; x_x_height 21; x_asc_height 29">
			<span class='ocrx_word' id='word_3_295' title='bbox 507 1351 673 1389;x_wconf 100;x_font Times_New_Roman_Bold_Italic' style='font-style:italic;font-family:Times_New_Roman_Bold_Italic'>“Original</span>
			<span class='ocrx_word' id='word_3_296' title='bbox 689 1351 793 1381;x_wconf 100;x_font Times_New_Roman_Bold_Italic' style='font-style:italic;font-family:Times_New_Roman_Bold_Italic'>Credit</span>
			<span class='ocrx_word' id='word_3_297' title='bbox 806 1351 946 1390;x_wconf 100;x_font Times_New_Roman_Bold_Italic' style='font-style:italic;font-family:Times_New_Roman_Bold_Italic'>Support</span>
			<span class='ocrx_word' id='word_3_298' title='bbox 962 1352 1090 1387;x_wconf 100;x_font Times_New_Roman_Bold_Italic' style='font-style:italic;font-family:Times_New_Roman_Bold_Italic'>(VM)&quot;)</span>
			<span class='ocrx_word' id='word_3_299' title='bbox 1108 1350 1256 1390;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>specified</span>
			<span class='ocrx_word' id='word_3_300' title='bbox 1274 1352 1304 1381;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>in</span>
			<span class='ocrx_word' id='word_3_301' title='bbox 1322 1351 1382 1381;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>that</span>
			<span class='ocrx_word' id='word_3_302' title='bbox 1402 1352 1500 1381;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>notice</span>
			<span class='ocrx_word' id='word_3_303' title='bbox 1515 1350 1688 1390;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>comprised</span>
			<span class='ocrx_word' id='word_3_304' title='bbox 1703 1352 1733 1381;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>in</span>
			<span class='ocrx_word' id='word_3_305' title='bbox 1751 1351 1800 1381;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>the</span>
			<span class='ocrx_word' id='word_3_306' title='bbox 1815 1350 2014 1381;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Transferor’s</span>
			<span class='ocrx_word' id='word_3_307' title='bbox 2025 1350 2129 1381;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Credit</span>
			<span class='ocrx_word' id='word_3_308' title='bbox 2147 1351 2276 1390;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Support</span>
		</span>
		<span class='ocr_line' title="bbox 503 1405 749 1440; baseline 0 -5; x_x_height 21; x_asc_height 29">
			<span class='ocrx_word' id='word_3_309' title='bbox 503 1405 636 1435;x_wconf 100;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>Balance</span>
			<span class='ocrx_word' id='word_3_310' title='bbox 648 1406 749 1440;x_wconf 0;x_font Times_New_Roman_Bold' style='font-family:Times_New_Roman_Bold'>(VM).</span>
		</span>

Textract:

		<span class='ocr_line' title="bbox 266 1080 311 1119; baseline 0 0; x_x_height 29.25">
			<span class='ocrx_word' id='word_3_15_1' title='bbox 266 1080 311 1119;x_wconf 96.66314697265625' lang='eng'>(c)</span>
		</span>
		<span class='ocr_line' title="bbox 382 1080 580 1121; baseline 0 -14">
			<span class='ocrx_word' id='word_3_16_1' title='bbox 382 1080 580 1121;x_wconf 97.83781433105469' lang='eng'>Exchanges.</span>
		</span>
		<span class='ocr_line' title="bbox 384 1188 426 1228; baseline 0 -13">
			<span class='ocrx_word' id='word_3_17_1' title='bbox 384 1188 426 1228;x_wconf 99.80229949951172' lang='eng'>(i)</span>
		</span>
		<span class='ocr_line' title="bbox 501 1187 2278 1229; baseline 0 -8; x_x_height 17.71875; x_asc_height 26.260000000000005">
			<span class='ocrx_word' id='word_3_18_1' title='bbox 501 1187 611 1221;x_wconf 99.990234375' lang='eng'>Unless</span>
			<span class='ocrx_word' id='word_3_18_2' title='bbox 621 1187 778 1221;x_wconf 99.970703125' lang='eng'>otherwise</span>
			<span class='ocrx_word' id='word_3_18_3' title='bbox 786 1187 931 1228;x_wconf 99.970703125' lang='eng'>specified</span>
			<span class='ocrx_word' id='word_3_18_4' title='bbox 940 1188 972 1220;x_wconf 99.85271453857422' lang='eng'>in</span>
			<span class='ocrx_word' id='word_3_18_5' title='bbox 981 1188 1146 1228;x_wconf 99.990234375' lang='eng'>Paragraph</span>
			<span class='ocrx_word' id='word_3_18_6' title='bbox 1159 1188 1204 1226;x_wconf 98.8721694946289' lang='eng'>11,</span>
			<span class='ocrx_word' id='word_3_18_7' title='bbox 1215 1188 1265 1220;x_wconf 99.970703125' lang='eng'>the</span>
			<span class='ocrx_word' id='word_3_18_8' title='bbox 1274 1188 1444 1221;x_wconf 99.91051483154297' lang='eng'>Transferor</span>
			<span class='ocrx_word' id='word_3_18_9' title='bbox 1452 1197 1526 1229;x_wconf 99.892578125' lang='eng'>may</span>
			<span class='ocrx_word' id='word_3_18_10' title='bbox 1532 1197 1576 1221;x_wconf 99.921875' lang='eng'>on</span>
			<span class='ocrx_word' id='word_3_18_11' title='bbox 1584 1197 1645 1229;x_wconf 99.951171875' lang='eng'>any</span>
			<span class='ocrx_word' id='word_3_18_12' title='bbox 1653 1188 1745 1220;x_wconf 99.970703125' lang='eng'>Local</span>
			<span class='ocrx_word' id='word_3_18_13' title='bbox 1755 1187 1898 1220;x_wconf 99.990234375' lang='eng'>Business</span>
			<span class='ocrx_word' id='word_3_18_14' title='bbox 1908 1188 1978 1228;x_wconf 99.990234375' lang='eng'>Day</span>
			<span class='ocrx_word' id='word_3_18_15' title='bbox 1989 1187 2033 1229;x_wconf 99.951171875' lang='eng'>by</span>
			<span class='ocrx_word' id='word_3_18_16' title='bbox 2046 1188 2151 1221;x_wconf 100' lang='eng'>notice</span>
			<span class='ocrx_word' id='word_3_18_17' title='bbox 2163 1187 2278 1221;x_wconf 99.95037078857422' lang='eng'>inform</span>
		</span>
		<span class='ocr_line' title="bbox 500 1241 2278 1283; baseline 0 -8; x_asc_height 26.044444444444526">
			<span class='ocrx_word' id='word_3_19_1' title='bbox 500 1242 554 1275;x_wconf 99.990234375' lang='eng'>the</span>
			<span class='ocrx_word' id='word_3_19_2' title='bbox 569 1241 749 1275;x_wconf 99.91131591796875' lang='eng'>Transferee</span>
			<span class='ocrx_word' id='word_3_19_3' title='bbox 764 1242 829 1275;x_wconf 99.970703125' lang='eng'>that</span>
			<span class='ocrx_word' id='word_3_19_4' title='bbox 843 1242 868 1274;x_wconf 99.75267028808594' lang='eng'>it</span>
			<span class='ocrx_word' id='word_3_19_5' title='bbox 882 1241 995 1275;x_wconf 100' lang='eng'>wishes</span>
			<span class='ocrx_word' id='word_3_19_6' title='bbox 1010 1247 1042 1275;x_wconf 99.833984375' lang='eng'>to</span>
			<span class='ocrx_word' id='word_3_19_7' title='bbox 1058 1242 1189 1274;x_wconf 99.990234375' lang='eng'>transfer</span>
			<span class='ocrx_word' id='word_3_19_8' title='bbox 1203 1246 1237 1275;x_wconf 99.833984375' lang='eng'>to</span>
			<span class='ocrx_word' id='word_3_19_9' title='bbox 1250 1242 1304 1274;x_wconf 99.970703125' lang='eng'>the</span>
			<span class='ocrx_word' id='word_3_19_10' title='bbox 1316 1242 1499 1274;x_wconf 99.970703125' lang='eng'>Transferee</span>
			<span class='ocrx_word' id='word_3_19_11' title='bbox 1513 1241 1643 1283;x_wconf 99.990234375' lang='eng'>Eligible</span>
			<span class='ocrx_word' id='word_3_19_12' title='bbox 1658 1241 1765 1275;x_wconf 99.951171875' lang='eng'>Credit</span>
			<span class='ocrx_word' id='word_3_19_13' title='bbox 1774 1242 1906 1283;x_wconf 99.951171875' lang='eng'>Support</span>
			<span class='ocrx_word' id='word_3_19_14' title='bbox 1914 1242 2008 1282;x_wconf 99.85192108154297' lang='eng'>(VM)</span>
			<span class='ocrx_word' id='word_3_19_15' title='bbox 2016 1241 2166 1283;x_wconf 99.873046875' lang='eng'>specified</span>
			<span class='ocrx_word' id='word_3_19_16' title='bbox 2173 1242 2205 1274;x_wconf 99.48162078857422' lang='eng'>in</span>
			<span class='ocrx_word' id='word_3_19_17' title='bbox 2214 1242 2278 1274;x_wconf 99.970703125' lang='eng'>that</span>
		</span>
		<span class='ocr_line' title="bbox 500 1295 2279 1337; baseline 0 -6; x_asc_height 28">
			<span class='ocrx_word' id='word_3_20_1' title='bbox 500 1295 605 1329;x_wconf 100' lang='eng'>notice</span>
			<span class='ocrx_word' id='word_3_20_2' title='bbox 621 1295 686 1336;x_wconf 99.77300262451172' lang='eng'>(the</span>
			<span class='ocrx_word' id='word_3_20_3' title='bbox 706 1295 805 1329;x_wconf 99.72496795654297' lang='eng'>“New</span>
			<span class='ocrx_word' id='word_3_20_4' title='bbox 821 1296 927 1328;x_wconf 99.96014404296875' lang='eng'>Credit</span>
			<span class='ocrx_word' id='word_3_20_5' title='bbox 940 1296 1083 1337;x_wconf 99.990234375' lang='eng'>Support</span>
			<span class='ocrx_word' id='word_3_20_6' title='bbox 1095 1296 1224 1336;x_wconf 99.25482177734375' lang='eng'>(VM)&quot;)</span>
			<span class='ocrx_word' id='word_3_20_7' title='bbox 1240 1295 1273 1328;x_wconf 99.91131591796875' lang='eng'>in</span>
			<span class='ocrx_word' id='word_3_20_8' title='bbox 1290 1296 1449 1337;x_wconf 100' lang='eng'>exchange</span>
			<span class='ocrx_word' id='word_3_20_9' title='bbox 1464 1295 1513 1329;x_wconf 99.970703125' lang='eng'>for</span>
			<span class='ocrx_word' id='word_3_20_10' title='bbox 1527 1295 1642 1328;x_wconf 99.96014404296875' lang='eng'>certain</span>
			<span class='ocrx_word' id='word_3_20_11' title='bbox 1663 1296 1793 1336;x_wconf 99.990234375' lang='eng'>Eligible</span>
			<span class='ocrx_word' id='word_3_20_12' title='bbox 1815 1295 1919 1329;x_wconf 99.990234375' lang='eng'>Credit</span>
			<span class='ocrx_word' id='word_3_20_13' title='bbox 1939 1295 2075 1337;x_wconf 99.990234375' lang='eng'>Support</span>
			<span class='ocrx_word' id='word_3_20_14' title='bbox 2094 1296 2190 1336;x_wconf 100' lang='eng'>(VM)</span>
			<span class='ocrx_word' id='word_3_20_15' title='bbox 2212 1296 2279 1336;x_wconf 99.92107391357422' lang='eng'>(the</span>
		</span>
		<span class='ocr_line' title="bbox 505 1349 2279 1391; baseline 0 -8; x_asc_height 26.666666666666607">
			<span class='ocrx_word' id='word_3_21_1' title='bbox 505 1349 675 1390;x_wconf 99.84135437011719' lang='eng'>“Original</span>
			<span class='ocrx_word' id='word_3_21_2' title='bbox 688 1349 795 1383;x_wconf 100' lang='eng'>Credit</span>
			<span class='ocrx_word' id='word_3_21_3' title='bbox 805 1349 949 1390;x_wconf 100' lang='eng'>Support</span>
			<span class='ocrx_word' id='word_3_21_4' title='bbox 960 1350 1091 1390;x_wconf 99.69168090820312' lang='eng'>(VM)&quot;)</span>
			<span class='ocrx_word' id='word_3_21_5' title='bbox 1107 1349 1258 1390;x_wconf 100' lang='eng'>specified</span>
			<span class='ocrx_word' id='word_3_21_6' title='bbox 1273 1350 1306 1382;x_wconf 99.91131591796875' lang='eng'>in</span>
			<span class='ocrx_word' id='word_3_21_7' title='bbox 1320 1350 1384 1383;x_wconf 100' lang='eng'>that</span>
			<span class='ocrx_word' id='word_3_21_8' title='bbox 1401 1349 1502 1383;x_wconf 99.96014404296875' lang='eng'>notice</span>
			<span class='ocrx_word' id='word_3_21_9' title='bbox 1514 1349 1690 1391;x_wconf 100' lang='eng'>comprised</span>
			<span class='ocrx_word' id='word_3_21_10' title='bbox 1702 1349 1735 1382;x_wconf 99.88201141357422' lang='eng'>in</span>
			<span class='ocrx_word' id='word_3_21_11' title='bbox 1749 1350 1802 1383;x_wconf 99.970703125' lang='eng'>the</span>
			<span class='ocrx_word' id='word_3_21_12' title='bbox 1813 1349 2015 1383;x_wconf 99.89098358154297' lang='eng'>Transferor’s</span>
			<span class='ocrx_word' id='word_3_21_13' title='bbox 2025 1349 2132 1383;x_wconf 100' lang='eng'>Credit</span>
			<span class='ocrx_word' id='word_3_21_14' title='bbox 2146 1350 2279 1391;x_wconf 100' lang='eng'>Support</span>
		</span>
		<span class='ocr_line' title="bbox 501 1403 751 1444; baseline 0 0; x_asc_height 32.800000000000004">
			<span class='ocrx_word' id='word_3_22_1' title='bbox 501 1403 637 1437;x_wconf 99.9609375' lang='eng'>Balance</span>
			<span class='ocrx_word' id='word_3_22_2' title='bbox 648 1404 751 1444;x_wconf 99.71281433105469' lang='eng'>(VM).</span>
		</span>

Balearica · 2025-12-19T09:20:11Z

Balearica
Dec 19, 2025
Maintainer

My thinking is trying to combine Tesseract and Textract/Google Vision results, using Tesseract for formatting, while Textract/Google Vision for source of truth of character recognition. There's obviously potential problems with this, such as differing bounding boxes and character recognition between the engines, but I'm hoping a fairly simple intersection calculation will do.

A feature that corrects low-quality or missing positioning and style data is planned and should come out in an upcoming release. Much of this logic already exists and is used for the internal recognition feature.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving Textract/Google Vision hOCR/HTML formatting #62

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improving Textract/Google Vision hOCR/HTML formatting #62

Uh oh!

Uh oh!

bluebox-steven Dec 12, 2025

Replies: 1 comment

Uh oh!

Balearica Dec 19, 2025 Maintainer

bluebox-steven
Dec 12, 2025

Balearica
Dec 19, 2025
Maintainer