Pythainlp

Latest version: v5.0.3

Safety actively analyzes 630052 Python packages for vulnerabilities to keep your Python projects secure.

Page 7 of 21

3.0.0

After a long time of the development of PyThaiNLP 3.0, We released `PyThaiNLP 3.0`. `PyThaiNLP 3.0` has many improvements and new features to help with Thai language processing tasks.

You can install by `pip install pythainlp` or upgrade by `pip install -U pythainlp`.

Documentation: https://pythainlp.github.io/docs/3.0/index.html

Report bug: https://github.com/PyThaiNLP/pythainlp/issues

See [PyThaiNLP 3.0 change log545](https://github.com/PyThaiNLP/pythainlp/issues/545)

If you want to contribute to PyThaiNLP, you can read [Contributing to PyThaiNLP](https://github.com/PyThaiNLP/pythainlp/blob/dev/CONTRIBUTING.md).

News

> Since PyThaiNLP 3.0, We will end supporting PyThaiNLP on Python 3.6. Python 3.6 users can use PyThaiNLP 2.3.2.

> We have updated the Thai word dictionary & rule for newmm. We recommend retraining your model if you use newmm for word tokenization in your model.

What is new?

Deprecation and other API changes

- Deprecated syllable_tokenize. `syllable_tokenize` is deprecated, use `subword_tokenize` instead
- `pythainlp.tag.named_entity.ThaiNameTagger` is change to `pythainlp.tag.thainer.ThaiNameTagger`. This old class will be deprecated in PyThaiNLP version 3.1.

Augment
- Add Thai Text Augmentation

Corpus
- Fix lots of misspellings in the dictionary (words_th.txt)
- Add get_corpus_default_db and thainer 1.5 model. You can add corpus on `default_db.json`, and you don't load the last trainer model from the Internet.

Tag
- Add TLTK (pos_tag and ner) - add TLTK wrapper to pythainlp functions ex ner, word_tokenize and more.
- Add NER class - `NER` class for Named-entity recognizer tasks.

Translate
- Add `pythainlp.translate.Translate` Class
- Add Chinese-Thai Machine Translation
- Add Thai-French Machine Translation

Tokenization
- Tokenize repeating dots and commas from numbers
- Fix token_max_len bug that makes it always zero
- Tokenize repeating dots and commas from numbers (fix 461)
- Retrained sentenceseg_crfcut.model for PyThaiNLP 2.4
- Add SEFR CUT to pythainlp
- Add TLTK (sentence_tokenize and word_tokenize) - add TLTK wrapper to pythainlp functions ex ner, word_tokenize, and more.
- Add nlpo3

Transliterate
- Refactor Royin Transliterate: Avoid embedded if blocks and simplified consonant replacing operations
- Manually merge update-royin branch with dev branch to add O-ANG rule
- Add TLTK (g2p and ipa) - add TLTK wrapper to pythainlp functions ex ner, word_tokenize, and more.
- Add pythainlp.transliterate.puan

Word Vector
- Fix token_max_len bug that makes it always zero
- Add `pythainlp.word_vector.WordVector`

Spell
- Add more spelling engine
- Add TLTK (spell) - add TLTK wrapper to pythainlp functions ex ner, word_tokenize, and more.

Generate
- Add pythainlp.generate to generate a text.

Tool
- Add misspell module

Other
- Add TLTK - add TLTK wrapper to pythainlp functions ex ner, word_tokenize, and more.
- Update requirements from ssg 0.0.6 to ssg 0.0.8
- Spoonerism: Add supports words more three syllables
- Add maiyamok; This function is preprocessing MaiYaMok in a Thai sentence.

Contributors

<a href="https://github.com/PyThaiNLP/pythainlp/graphs/contributors">
<img src="https://contributors-img.firebaseapp.com/image?repo=PyThaiNLP/pythainlp" />
</a>

Thanks all the [contributors](https://github.com/PyThaiNLP/pythainlp/graphs/contributors). (Image made with [contributors-img](https://contributors-img.firebaseapp.com))

If you want to contributing to PyThaiNLP, you can read [Contributing to PyThaiNLP](https://github.com/PyThaiNLP/pythainlp/blob/dev/CONTRIBUTING.md).

> This year is the 6th year's PyThaiNLP, and PyThaiNLP has more than one million downloads. I started to develop PyThaiNLP to help me do Thai language processing tasks. Now, PyThaiNLP has been used in many research and works worldwide. PyThaiNLP can't be grown if it doesn't have contributors, sponsors, and users.
>
> Thank you for all supporting.
>
> Thank you for using PyThaiNLP.

Wannaphong Phatthiyaphaibun

PyThaiNLP Founder

27 January 2022

3.0.0beta0

3.0.0dev0

Docs: https://pythainlp.github.io/dev-docs/index.html
Report bug: https://github.com/PyThaiNLP/pythainlp/issues
GitHub: https://github.com/PyThaiNLP/pythainlp

News
> Since PyThaiNLP 2.4, We will end support PyThaiNLP on Python 3.6. Python 3.6 users can use PyThaiNLP 2.3.1
> We have updated the dict & rule for newmm. If you use newmm for word tokenization in your model, we recommend you retrain your model.

What is new?

Deprecation and other API changes
- 550 Deprecated syllable_tokenize. `syllable_tokenize` is deprecated, use `subword_tokenize` instead
- https://github.com/PyThaiNLP/pythainlp/commit/701fb3a7842b3abd0b2318ba9074f1902c2f32e9 `pythainlp.tag.named_entity.ThaiNameTagger` is change to `pythainlp.tag.thainer.ThaiNameTagger`. This old class will be deprecated in PyThaiNLP version 2.5.

Augment
- 580 Add Thai Text Augmentation

Corpus
- 557 Fix lots of misspellings in dictionary (words_th.txt)
- 576 Add get_corpus_default_db and thainer 1.5 model. Now, You can add corpus on `default_db.json` and you dont load last thainer model from Internet.

Tag
- 599 Add tltk (pos_tag and ner) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.
- 600 Add NER class - `NER` class for Named-entity recognizer tasks.

Translate
- 589 Add `pythainlp.translate.Translate` Class
- 588 Add Chinese-Thai Machine Translation

Tokenization
- 562 Tokenize repeating dots and commas from numbers
- 585 Fix token_max_len bug that makes it always zero
- 562 Tokenize repeating dots and commas from numbers (fix 461)
- 594 Retrained sentenceseg_crfcut.model for PyThaiNLP 2.4
- https://github.com/PyThaiNLP/pythainlp/commit/314411086707b60ba8790724301224916f4670b8 Add SEFR CUT to pythainlp
- 599 Add tltk (sentence_tokenize and word_tokenize) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.
- 622 Add nlpo3

Transliterate
- 566 Refactor Royin Transliterate: Avoid embedded if blocks and simplified consonant replacing operations
- 585 Manually merge update-royin branch with dev branch to add O-ANG rule
- 599 Add tltk (g2p and ipa) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.
- 624 Add pythainlp.transliterate.puan

Word Vector
- 573 Fix token_max_len bug that makes it always zero
- 583 Add `pythainlp.word_vector.WordVector`

Spell
- 591 Add more spelling engine
- 599 Add tltk (spell) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.

Generate
- 579 Add pythainlp.generate

Tool
- 614 Add misspell module

Other
- 599 Add tltk - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.
- https://github.com/PyThaiNLP/pythainlp/commit/e357cf8f9b626e3a633dc33b8557fe45dc837aba Update requirements from ssg 0.0.6 to ssg 0.0.8
- Spoonerism: Add supports words more 3 syllables 631
- Add maiyamok 623 This function is preprocessing MaiYaMok in Thai sentence.

2.4.0dev0

Documentation: [https://pythainlp.github.io/dev-docs/index.html](https://pythainlp.github.io/dev-docs/index.html
)
Report bug: [https://github.com/PyThaiNLP/pythainlp/issues](https://github.com/PyThaiNLP/pythainlp/issues)

See [PyThaiNLP 2.4 change log](https://github.com/PyThaiNLP/pythainlp/issues/545) #545

News
> Since PyThaiNLP 2.4, We will end support PyThaiNLP on Python 3.6. Python 3.6 users can use PyThaiNLP 2.3.1
> We have updated the dict & rule for newmm. If you use newmm for word tokenization in your model, we recommend you retrain your model.

Deprecation and other API changes
- 550 Deprecated syllable_tokenize. `syllable_tokenize` is deprecated, use `subword_tokenize` instead
- https://github.com/PyThaiNLP/pythainlp/commit/701fb3a7842b3abd0b2318ba9074f1902c2f32e9 `pythainlp.tag.named_entity.ThaiNameTagger` is change to `pythainlp.tag.thainer.ThaiNameTagger`. This old class will be deprecated in PyThaiNLP version 2.5.

Augment
- 580 Add Thai Text Augmentation

Corpus
- 557 Fix lots of misspellings in dictionary (words_th.txt)
- 576 Add get_corpus_default_db and thainer 1.5 model. Now, You can add corpus on `default_db.json` and you dont load last thainer model from Internet.

Tag
- 599 Add tltk (pos_tag and ner) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.
- 600 Add NER class - `NER` class for Named-entity recognizer tasks.

Translate
- 589 Add `pythainlp.translate.Translate` Class
- 588 Add Chinese-Thai Machine Translation

Tokenization
- 562 Tokenize repeating dots and commas from numbers
- 585 Fix token_max_len bug that makes it always zero
- 562 Tokenize repeating dots and commas from numbers (fix 461)
- 594 Retrained sentenceseg_crfcut.model for PyThaiNLP 2.4
- https://github.com/PyThaiNLP/pythainlp/commit/314411086707b60ba8790724301224916f4670b8 Add SEFR CUT to pythainlp
- 599 Add tltk (sentence_tokenize and word_tokenize) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.

Transliterate
- 566 Refactor Royin Transliterate: Avoid embedded if blocks and simplified consonant replacing operations
- 585 Manually merge update-royin branch with dev branch to add O-ANG rule
- 599 Add tltk (g2p and ipa) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.

Word Vector
- 573 Fix token_max_len bug that makes it always zero
- 583 Add `pythainlp.word_vector.WordVector`

Spell
- 591 Add more spelling engine
- 599 Add tltk (spell) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.

Generate
- 579 Add pythainlp.generate

Other
- 599 Add tltk - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.

2.3.2

PyThaiNLP v2.3.2` is This release is a bug fix release of PyThaiNLP 2.3.

**Bug Fixed**
- Fixed clause_tokenize returns an empty list. 609

Documentation: [https://pythainlp.github.io/docs/2.3/index.html](https://pythainlp.github.io/docs/2.3/index.html
)
Report bug: [https://github.com/PyThaiNLP/pythainlp/issues](https://github.com/PyThaiNLP/pythainlp/issues)

You can install or upgrade using *pip install -U pythainlp*

See [PyThaiNLP 2.3 change log](https://github.com/PyThaiNLP/pythainlp/issues/445) #445

2.3.1

PyThaiNLP v2.3.1` is This release is a bug fix release of PyThaiNLP 2.3.

**Bug Fixed**
- Fix gensim 546

Documentation: [https://pythainlp.github.io/docs/2.3/index.html](https://pythainlp.github.io/docs/2.3/index.html
)
Report bug: [https://github.com/PyThaiNLP/pythainlp/issues](https://github.com/PyThaiNLP/pythainlp/issues)

You can install or upgrade using *pip install -U pythainlp*

See [PyThaiNLP 2.3 change log](https://github.com/PyThaiNLP/pythainlp/issues/445) #445

Deprecation and other API changes
- NER change a ThaiNER model (from ThaiNER 1.4 to ThaiNER 1.5). If you need use ThaiNER 1.4 model, You can use version in ThaiNameTagger class. `pythainlp.tag.named_entity.ThaiNameTagger(version: str = '1.4')` (Docs: https://pythainlp.github.io/dev-docs/api/tag.html#pythainlp.tag.named_entity.ThaiNameTagger)

Tokenizer
- 484 Add: model option for `attacut.tokenize()`
- 502 Add: `corpus.util.revise_wordset()` to revise tokenization dictionary
- 503 Add: `NERCut` tokenization engine

Corpus
- **License change:**
- All corpora, datasets, and documentation created by PyThaiNLP project are now released under [Creative Commons Zero 1.0 Universal Public Domain Dedication License](https://creativecommons.org/publicdomain/zero/1.0/) (CC0).
- All language models created by PyThaiNLP project are released under [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/) (CC-by).
- 449 Fix: remove instances with `[` or `]` from etcc.txt
- 467 Add: `corpus.common.provinces()` can now return romanized names
- 476 Add: `thai_family_names()` to get a set of Thai family names
- 487 Fix: `thailand_provinces_th.csv` not found issue
- 492 Fix: remove erroneous `AITT` tag from ORCHID to UD table -- thanks c4n for the fix

POS Tagger
- 464 Add: `LST20` language model for part-of-speech tagging
- 468 Add: port `PerceptronTagger` from NTLK. POS tagging no longer needs NLTK for dependency.
- 478 Update: ORCHID POS tags documentation

Name Entity Tagging
- 526 Update ThaiNER 1.4 to ThaiNER 1.5
- 538 Add ThaiNameTagger version and add ThaiNER 1.4 support

Transliterate
- 485 Fixed Romanize failed in some examples
- 511 Add Thai W2P (Thai Word-to-Phoneme converter)

Text Summarize
- 523 Add mT5 text summarize to `pythainlp.summarize`

Chunk parser
- 524 Add `pythainlp.tag.chunk`

Util
- 481 Fix: `remove_repeat_vowels()` bug that remove spaces between different vowels
- 483 Add: add `remove()` method to remove a word from a trie -- thanks korakot
- 490 Fix: `thai_strftime()` - normalize output for unsupported directive (running in glibc and musl should produce the same output)
- 512 Add: `emoji_to_thai()` to convert emoji to Thai description -- thanks ppirch for the development
- 513 Add: `thai_keyboard_dist()` to calculate euclidean distance between two characters according to their location on a Thai keyboard layout -- thanks ppirch for the development

Thanks all the [contributors](https://github.com/PyThaiNLP/pythainlp/graphs/contributors). (Image made with [contributors-img](https://contributors-img.firebaseapp.com))
<a href="https://github.com/PyThaiNLP/pythainlp/graphs/contributors">
<img src="https://contributors-img.firebaseapp.com/image?repo=PyThaiNLP/pythainlp" />
</a>

We build Thai NLP.

PyThaiNLP

Page 7 of 21

Releases

Has known vulnerabilities

Previous Next

Pythainlp

Page 7 of 21

3.0.0

3.0.0beta0

3.0.0dev0

2.4.0dev0

2.3.2

2.3.1

Page 7 of 21

Links

Releases