forked from PyYoshi/cChardet
-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
The UniversalDetector() when combined with feed returns {'encoding': None, 'confidence': None} when the "detect" produces the correct result.
import cchardet
from pathlib import Path
UniversalDetect = cchardet.UniversalDetector()
smi = Path('file.smi')
def cchardet_detect(input_file: Path):
with input_file.open(mode="rb") as ifp:
data = ifp.read()
return cchardet.detect(data)
def cchardet_universal(input_file: Path):
with input_file.open(mode="rb") as ifp:
for line in ifp:
UniversalDetect.feed(line)
if UniversalDetect.done:
break
return UniversalDetect.result
cchardet_detect(smi)
cchardet_universal(smi)
The file in question is encoded in UHC (aka CP949).
Output:
>>> cchardet_detect(smi)
{'encoding': 'UHC', 'confidence': 0.9900000095367432}
>>> cchardet_universal(smi)
{'encoding': None, 'confidence': None}
Also of note that:
- This seems to be related to cChardet returns {'encoding': None, 'confidence': None} on very large file PyYoshi/cChardet#69
But the issue linked is reported as being related to the input file being large. My guess is that it was incorrect, as I obtained the same results here with a file that's ~100 kiB. - chardet.universaldetector.UniversalDetector() does not suffer from this.
Metadata
Metadata
Assignees
Labels
No labels