Skip to content

cchardet.UniversalDetector with feed returns None for everything #35

@slycordinator

Description

@slycordinator

The UniversalDetector() when combined with feed returns {'encoding': None, 'confidence': None} when the "detect" produces the correct result.

import cchardet
from pathlib import Path
UniversalDetect = cchardet.UniversalDetector()
smi = Path('file.smi')

def cchardet_detect(input_file: Path):
    with input_file.open(mode="rb") as ifp:
        data = ifp.read()
    return cchardet.detect(data)

def cchardet_universal(input_file: Path):
    with input_file.open(mode="rb") as ifp:
        for line in ifp:
            UniversalDetect.feed(line)
            if UniversalDetect.done:
                break
    return UniversalDetect.result

cchardet_detect(smi)
cchardet_universal(smi)

The file in question is encoded in UHC (aka CP949).

Output:

>>> cchardet_detect(smi)
{'encoding': 'UHC', 'confidence': 0.9900000095367432}
>>> cchardet_universal(smi)
{'encoding': None, 'confidence': None}

Also of note that:

  1. This seems to be related to cChardet returns {'encoding': None, 'confidence': None} on very large file  PyYoshi/cChardet#69
    But the issue linked is reported as being related to the input file being large. My guess is that it was incorrect, as I obtained the same results here with a file that's ~100 kiB.
  2. chardet.universaldetector.UniversalDetector() does not suffer from this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions