@@ -25,56 +25,47 @@ ELD is also available in [Javascript](https://github.com/nitotm/efficient-langua
2525``` bash
2626$ pip install eld
2727```
28- Alternatively, download / clone the files will work just fine .
28+ Alternatively, download / clone the files can work too, by changing the import path .
2929
3030## How to use?
3131
3232``` python
33- # from src.eld.languageDetector import LanguageDetector # To load ELD without install. Update path.
3433from eld import LanguageDetector
3534detector = LanguageDetector()
36-
37- print (detector.detect(' Hola, cómo te llamas?' ))
38- ```
39- ` detect() ` expects a UTF-8 string, and returns a list, with a value named 'language', which will be either an * ISO 639-1 code* or ` False `
4035```
41- {'language': 'es'}
42- {'language': False, 'error': 'Some error', 'scores': {}}
43- ```
44-
45- - To get the best guess, turn off minimum length & confidence threshold; also used for benchmarking.
36+ ` detect() ` expects a UTF-8 string, and returns an object, with a 'language' variable, which is either an * ISO 639-1 code* or ` None `
4637``` python
47- print (detector.detect(' To' , False , False , 0 , 1 ))
48- # To improve readability Named Parameters can be used
49- detector.detect(text = ' To' , clean_text = False , check_confidence = False , min_byte_length = 0 , min_ngrams = 1 )
50- # clean_text=True, Removes Urls, domains, emails, alphanumerical & numbers
51- ```
38+ print (detector.detect(' Hola, cómo te llamas?' ))
39+ # Object { language: "es", scores(): {"es": 0.53, "et": 0.21, ...}, is_reliable(): True }
40+ # Object { language: None|str, scores(): None|dict, is_reliable(): bool }
5241
53- - To retrieve the scores of all languages detected, we will set ` return_scores ` to ` True ` , just once
54- ``` python
55- detector.return_scores = True
56- print (detector.detect(' How are you? Bien, gracias' ))
57- # {'language': 'en', 'scores': {'en': 0.32, 'es': 0.31, ...}}
58- ```
42+ print (detector.detect(' Hola, cómo te llamas?' ).language)
43+ # "es"
5944
45+ # if clean_text(True), detect() removes Urls, domains, emails, alphanumerical & numbers
46+ detector.clean_text(True ) # Default is False
47+ ```
6048- To reduce the languages to be detected, there are 3 different options, they only need to be executed once. (Check available [ languages] ( #languages ) below)
6149``` python
6250lang_subset = [' en' , ' es' , ' fr' , ' it' , ' nl' , ' de' ]
6351
64- # with dynamic_lang_subset() the detector executes normally, and then filters excluded languages
52+ # Option 1
53+ # with dynamic_lang_subset(), detect() executes normally, and then filters excluded languages
6554detector.dynamic_lang_subset(lang_subset)
55+ # Returns an object with a list named 'languages', with the validated languages or 'None'
6656
67- # lang_subset() Will first remove the excluded languages, from the n-grams database
57+ # Option 2. lang_subset() Will first remove the excluded languages, from the n-grams database
6858# For a single detection is slower than dynamic_lang_subset(), but for several will be faster
6959# If save option is true (default), the new Ngrams subset will be stored, and loaded next call
7060detector.lang_subset(lang_subset) # lang_subset(langs, save=True)
61+ # Returns object {success: True, languages: ['de', 'en', ...], error: None, file: 'ngramsM60...'}
7162
72- # To remove either dynamic_lang_subset() or lang_subset(), call the methods with False as argument
73- detector.lang_subset(False )
63+ # To remove either dynamic_lang_subset() or lang_subset(), call the methods with None as argument
64+ detector.lang_subset(None )
7465
75- # Finally the fastest way to regularly use a languages subset: we create the instance with a file
76- # The file in the argument can be a subset by lang_subset() or another database like ngrams_L.php
77- langSubsetDetect = LanguageDetector(' ngrams_2f37045c74780aba1d36d6717f3244dc025fb935 ' )
66+ # Finally the optimal way to regularly use a languages subset: we create the instance with a file
67+ # The file in the argument can be a subset by lang_subset() or another database like 'ngramsL60'
68+ langSubsetDetect = LanguageDetector(' ngramsL60 ' )
7869```
7970
8071## Benchmarks
@@ -106,7 +97,7 @@ These are the results, first, accuracy and then execution time.
10697| **CLD3** | 92.2% | 95.8% | 94.7% | 69.0% | 51.5% |
10798| **franc** | 89.8% | 92.0% | 90.5% | 65.9% | 52.9% |
10899-->
109- <img alt =" accuracy table " width =" 800 " src =" https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/benchmarks /table_accuracy_py.svg " >
100+ <img alt =" accuracy table " width =" 800 " src =" https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/misc /table_accuracy_py.svg " >
110101
111102<!-- - Time table
112103| | Tweets | Big test | Sentences | Word pairs | Single words |
@@ -120,7 +111,7 @@ These are the results, first, accuracy and then execution time.
120111| **franc** | 1.2" | 8" | 7.8" | 2.8" | 2" |
121112| **Nito-ELD-php** | 0.31" | 2.5" | 2.2" | 0.66" | 0.48" |
122113-->
123- <img alt =" time table " width =" 800 " src =" https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/benchmarks /table_time_py.svg " >
114+ <img alt =" time table " width =" 800 " src =" https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/misc /table_time_py.svg " >
124115
125116<sup style =" color :#08e " >1.</sup > <sup style =" color :#777 " >Lingua could have a small advantage as it participates with 54 languages, 6 less.</sup >
126117<sup style =" color :#08e " >2.</sup > <sup style =" color :#777 " >CLD2 and CLD3, return a list of languages, the ones not included in this test where discarded, but usually they return one language, I believe they have a disadvantage.
@@ -135,7 +126,7 @@ I added *ELD-L* for comparison, which has a 2.3x bigger database, but only incre
135126
136127Here is the average, per benchmark, of Tweets, Big test & Sentences.
137128
138- ![ Sentences tests average] ( https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/benchmarks /sentences_avg_py.png )
129+ ![ Sentences tests average] ( https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/misc /sentences_avg_py.png )
139130<!-- - Sentences average
140131| | Time | Accuracy |
141132|:--------------------|:------------:|:------------:|
@@ -154,11 +145,9 @@ These are the *ISO 639-1 codes* of the 60 supported languages for *Nito-ELD* v1
154145
155146> 'am', 'ar', 'az', 'be', 'bg', 'bn', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'gu', 'he', 'hi', 'hr', 'hu', 'hy', 'is', 'it', 'ja', 'ka', 'kn', 'ko', 'ku', 'lo', 'lt', 'lv', 'ml', 'mr', 'ms', 'nl', 'no', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tl', 'tr', 'uk', 'ur', 'vi', 'yo', 'zh'
156147
157-
158148Full name languages:
159149
160- > 'Amharic', 'Arabic', 'Azerbaijani (Latin)', 'Belarusian', 'Bulgarian', 'Bengali', 'Catalan', 'Czech', 'Danish', 'German', 'Greek', 'English', 'Spanish', 'Estonian', 'Basque', 'Persian', 'Finnish', 'French', 'Gujarati', 'Hebrew', 'Hindi', 'Croatian', 'Hungarian', 'Armenian', 'Icelandic', 'Italian', 'Japanese', 'Georgian', 'Kannada', 'Korean', 'Kurdish (Arabic)', 'Lao', 'Lithuanian', 'Latvian', 'Malayalam', 'Marathi', 'Malay (Latin)', 'Dutch', 'Norwegian', 'Oriya', 'Punjabi', 'Polish', 'Portuguese', 'Romanian', 'Russian', 'Slovak', 'Slovene', 'Albanian', 'Serbian (Cyrillic)', 'Swedish', 'Tamil', 'Telugu', 'Thai', 'Tagalog', 'Turkish', 'Ukrainian', 'Urdu', 'Vietnamese', 'Yoruba', 'Chinese'
161-
150+ > Amharic, Arabic, Azerbaijani (Latin), Belarusian, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, English, Spanish, Estonian, Basque, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Icelandic, Italian, Japanese, Georgian, Kannada, Korean, Kurdish (Arabic), Lao, Lithuanian, Latvian, Malayalam, Marathi, Malay (Latin), Dutch, Norwegian, Oriya, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovene, Albanian, Serbian (Cyrillic), Swedish, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Yoruba, Chinese
162151
163152## Future improvements
164153
0 commit comments