Skip to content

Commit 84f30f8

Browse files
author
Nito
committed
Update of the latest usability changes
1 parent 78b33b9 commit 84f30f8

File tree

2 files changed

+46
-62
lines changed

2 files changed

+46
-62
lines changed

README.md

Lines changed: 24 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -25,56 +25,47 @@ ELD is also available in [Javascript](https://github.com/nitotm/efficient-langua
2525
```bash
2626
$ pip install eld
2727
```
28-
Alternatively, download / clone the files will work just fine.
28+
Alternatively, download / clone the files can work too, by changing the import path.
2929

3030
## How to use?
3131

3232
```python
33-
# from src.eld.languageDetector import LanguageDetector # To load ELD without install. Update path.
3433
from eld import LanguageDetector
3534
detector = LanguageDetector()
36-
37-
print(detector.detect('Hola, cómo te llamas?'))
38-
```
39-
`detect()` expects a UTF-8 string, and returns a list, with a value named 'language', which will be either an *ISO 639-1 code* or `False`
4035
```
41-
{'language': 'es'}
42-
{'language': False, 'error': 'Some error', 'scores': {}}
43-
```
44-
45-
- To get the best guess, turn off minimum length & confidence threshold; also used for benchmarking.
36+
`detect()` expects a UTF-8 string, and returns an object, with a 'language' variable, which is either an *ISO 639-1 code* or `None`
4637
```python
47-
print(detector.detect('To', False, False, 0, 1))
48-
# To improve readability Named Parameters can be used
49-
detector.detect(text='To', clean_text=False, check_confidence=False, min_byte_length=0, min_ngrams=1)
50-
# clean_text=True, Removes Urls, domains, emails, alphanumerical & numbers
51-
```
38+
print(detector.detect('Hola, cómo te llamas?'))
39+
# Object { language: "es", scores(): {"es": 0.53, "et": 0.21, ...}, is_reliable(): True }
40+
# Object { language: None|str, scores(): None|dict, is_reliable(): bool }
5241

53-
- To retrieve the scores of all languages detected, we will set `return_scores` to `True`, just once
54-
```python
55-
detector.return_scores = True
56-
print(detector.detect('How are you? Bien, gracias'))
57-
# {'language': 'en', 'scores': {'en': 0.32, 'es': 0.31, ...}}
58-
```
42+
print(detector.detect('Hola, cómo te llamas?').language)
43+
# "es"
5944

45+
# if clean_text(True), detect() removes Urls, domains, emails, alphanumerical & numbers
46+
detector.clean_text(True) # Default is False
47+
```
6048
- To reduce the languages to be detected, there are 3 different options, they only need to be executed once. (Check available [languages](#languages) below)
6149
```python
6250
lang_subset = ['en', 'es', 'fr', 'it', 'nl', 'de']
6351

64-
# with dynamic_lang_subset() the detector executes normally, and then filters excluded languages
52+
# Option 1
53+
# with dynamic_lang_subset(), detect() executes normally, and then filters excluded languages
6554
detector.dynamic_lang_subset(lang_subset)
55+
# Returns an object with a list named 'languages', with the validated languages or 'None'
6656

67-
# lang_subset() Will first remove the excluded languages, from the n-grams database
57+
# Option 2. lang_subset() Will first remove the excluded languages, from the n-grams database
6858
# For a single detection is slower than dynamic_lang_subset(), but for several will be faster
6959
# If save option is true (default), the new Ngrams subset will be stored, and loaded next call
7060
detector.lang_subset(lang_subset) # lang_subset(langs, save=True)
61+
# Returns object {success: True, languages: ['de', 'en', ...], error: None, file: 'ngramsM60...'}
7162

72-
# To remove either dynamic_lang_subset() or lang_subset(), call the methods with False as argument
73-
detector.lang_subset(False)
63+
# To remove either dynamic_lang_subset() or lang_subset(), call the methods with None as argument
64+
detector.lang_subset(None)
7465

75-
# Finally the fastest way to regularly use a languages subset: we create the instance with a file
76-
# The file in the argument can be a subset by lang_subset() or another database like ngrams_L.php
77-
langSubsetDetect = LanguageDetector('ngrams_2f37045c74780aba1d36d6717f3244dc025fb935')
66+
# Finally the optimal way to regularly use a languages subset: we create the instance with a file
67+
# The file in the argument can be a subset by lang_subset() or another database like 'ngramsL60'
68+
langSubsetDetect = LanguageDetector('ngramsL60')
7869
```
7970

8071
## Benchmarks
@@ -106,7 +97,7 @@ These are the results, first, accuracy and then execution time.
10697
| **CLD3** | 92.2% | 95.8% | 94.7% | 69.0% | 51.5% |
10798
| **franc** | 89.8% | 92.0% | 90.5% | 65.9% | 52.9% |
10899
-->
109-
<img alt="accuracy table" width="800" src="https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/benchmarks/table_accuracy_py.svg">
100+
<img alt="accuracy table" width="800" src="https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/misc/table_accuracy_py.svg">
110101

111102
<!--- Time table
112103
| | Tweets | Big test | Sentences | Word pairs | Single words |
@@ -120,7 +111,7 @@ These are the results, first, accuracy and then execution time.
120111
| **franc** | 1.2" | 8" | 7.8" | 2.8" | 2" |
121112
| **Nito-ELD-php** | 0.31" | 2.5" | 2.2" | 0.66" | 0.48" |
122113
-->
123-
<img alt="time table" width="800" src="https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/benchmarks/table_time_py.svg">
114+
<img alt="time table" width="800" src="https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/misc/table_time_py.svg">
124115

125116
<sup style="color:#08e">1.</sup> <sup style="color:#777">Lingua could have a small advantage as it participates with 54 languages, 6 less.</sup>
126117
<sup style="color:#08e">2.</sup> <sup style="color:#777">CLD2 and CLD3, return a list of languages, the ones not included in this test where discarded, but usually they return one language, I believe they have a disadvantage.
@@ -135,7 +126,7 @@ I added *ELD-L* for comparison, which has a 2.3x bigger database, but only incre
135126

136127
Here is the average, per benchmark, of Tweets, Big test & Sentences.
137128

138-
![Sentences tests average](https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/benchmarks/sentences_avg_py.png)
129+
![Sentences tests average](https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/misc/sentences_avg_py.png)
139130
<!--- Sentences average
140131
| | Time | Accuracy |
141132
|:--------------------|:------------:|:------------:|
@@ -154,11 +145,9 @@ These are the *ISO 639-1 codes* of the 60 supported languages for *Nito-ELD* v1
154145

155146
> 'am', 'ar', 'az', 'be', 'bg', 'bn', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'gu', 'he', 'hi', 'hr', 'hu', 'hy', 'is', 'it', 'ja', 'ka', 'kn', 'ko', 'ku', 'lo', 'lt', 'lv', 'ml', 'mr', 'ms', 'nl', 'no', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tl', 'tr', 'uk', 'ur', 'vi', 'yo', 'zh'
156147
157-
158148
Full name languages:
159149

160-
> 'Amharic', 'Arabic', 'Azerbaijani (Latin)', 'Belarusian', 'Bulgarian', 'Bengali', 'Catalan', 'Czech', 'Danish', 'German', 'Greek', 'English', 'Spanish', 'Estonian', 'Basque', 'Persian', 'Finnish', 'French', 'Gujarati', 'Hebrew', 'Hindi', 'Croatian', 'Hungarian', 'Armenian', 'Icelandic', 'Italian', 'Japanese', 'Georgian', 'Kannada', 'Korean', 'Kurdish (Arabic)', 'Lao', 'Lithuanian', 'Latvian', 'Malayalam', 'Marathi', 'Malay (Latin)', 'Dutch', 'Norwegian', 'Oriya', 'Punjabi', 'Polish', 'Portuguese', 'Romanian', 'Russian', 'Slovak', 'Slovene', 'Albanian', 'Serbian (Cyrillic)', 'Swedish', 'Tamil', 'Telugu', 'Thai', 'Tagalog', 'Turkish', 'Ukrainian', 'Urdu', 'Vietnamese', 'Yoruba', 'Chinese'
161-
150+
> Amharic, Arabic, Azerbaijani (Latin), Belarusian, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, English, Spanish, Estonian, Basque, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Icelandic, Italian, Japanese, Georgian, Kannada, Korean, Kurdish (Arabic), Lao, Lithuanian, Latvian, Malayalam, Marathi, Malay (Latin), Dutch, Norwegian, Oriya, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovene, Albanian, Serbian (Cyrillic), Swedish, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Yoruba, Chinese
162151
163152
## Future improvements
164153

demo.py

Lines changed: 22 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -14,28 +14,19 @@
1414
See the License for the specific language governing permissions and
1515
limitations under the License.
1616
"""
17-
18-
from src.eld.languageDetector import LanguageDetector
19-
# from eld import LanguageDetector
17+
from eld import LanguageDetector
2018

2119
detector = LanguageDetector()
2220

23-
# detect() expects a UTF-8 string, and returns a dictionary, with key 'language', value: ISO 639-1 code or false
21+
# detect() expects a UTF-8 string, returns an object, with a 'language' variable : ISO 639-1 code or null
2422
print(detector.detect('Hola, cómo te llamas?'))
25-
# {'language': 'es'}
26-
# {'language': False, 'error': 'Some error', 'scores': {}}
27-
28-
# To get the best guess, turn off minimum length and confidence threshold; also used for benchmarking.
29-
print(detector.detect('To', False, False, 0, 1))
23+
# Object { language: "es", scores(): {"es": 0.53, "et": 0.21, ...}, is_reliable(): True }
24+
# Object { language: None|str, scores(): None|dict, is_reliable(): bool }
25+
print(detector.detect('Hola, cómo te llamas?').language)
26+
# "es"
3027

31-
# To improve readability Named Parameters can be used
32-
detector.detect(text='To', clean_text=False, check_confidence=False, min_byte_length=0, min_ngrams=1)
33-
# clean_text=True, Removes Urls, domains, emails, alphanumerical & numbers
34-
35-
# To retrieve the scores of all languages detected, we will set returnScores to True, just once
36-
detector.return_scores = True
37-
print(detector.detect('How are you? Bien, gracias'))
38-
# {'language': 'en', 'scores': {'en': 0.32, 'es': 0.31, ...}}
28+
# clean_text(True) Removes Urls, domains, emails, alphanumerical & numbers
29+
detector.clean_text(True) # Default is False
3930

4031
# To reduce the languages to be detected, there are 3 different options, they only need to be executed once.
4132
# This is the complete list on languages for ELD v1, using ISO 639-1 codes:
@@ -46,24 +37,28 @@
4637
"""
4738
lang_subset = ['en', 'es', 'fr', 'it', 'nl', 'de']
4839

49-
# dynamic_lang_subset() Will execute the detector normally, but at the end will filter the excluded languages.
40+
# Option 1. With dynamic_lang_subset(), detect() executes normally, but at the end will filter the excluded languages.
5041
detector.dynamic_lang_subset(lang_subset)
42+
# Returns an object with a list named 'languages', with the validated languages or 'None'
5143

5244
# to remove the subset
53-
detector.dynamic_lang_subset(False)
45+
detector.dynamic_lang_subset(None)
5446

55-
""" lang_subset(langs,save=True) Will previously remove the excluded languages form the Ngrams database; for a single
56-
detection might be slower than dynamic_lang_subset(), but for several strings will be faster. If save option is true
57-
(default), the new ngrams subset will be stored, and loaded for the same languages subset, increasing startup speed
47+
""" Option 2. lang_subset(langs,save=True) Will previously remove the excluded languages form the Ngrams database; for
48+
a single detection might be slower than dynamic_lang_subset(), but for several strings will be faster. If 'save' option
49+
is true (default), the new ngrams subset will be stored and cached for next time.
5850
"""
5951
detector.lang_subset(lang_subset)
52+
# Returns object {success: True, languages: ['de', 'en', ...], error: None, file: 'ngramsM60...'}
6053

6154
# to remove the subset
62-
detector.lang_subset(False)
55+
detector.lang_subset(None)
56+
57+
print(detector.VERSION)
6358

64-
""" Finally the fastest option to regularly use the same language subset, will be to add as an argument the file
65-
stored by lang_subset(), when creating an instance of the class. In this case the subset Ngrams database will
59+
""" Finally the optimal way to regularly use the same language subset, will be to add as an argument the file stored
60+
(and returned) by lang_subset(), when creating an instance of the class. In this case the subset Ngrams database will
6661
be loaded directly, and not the default database. Also, you can use this option to load different ngram databases
67-
stored at src/ngrams/
62+
stored at eld/resources/ngrams
6863
"""
69-
langSubsetDetect = LanguageDetector('ngrams_2f37045c74780aba1d36d6717f3244dc025fb935')
64+
langSubsetDetect = LanguageDetector('ngramsM60-6_5ijqhj4oecs310zqtm8u9pgmd9ox2yd')

0 commit comments

Comments
 (0)