Fast-mosestokenizer

Latest version: v0.0.8.1

Safety actively analyzes 618306 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.0.8.1

Changes
- `other_letters` option exposed in python API.

0.0.8

Changes
- Segmentation by `\p{So}` not automatically enabled.
- The performance of `\p{So}` segmentation drastically improved.

0.0.7.2

Hotfix
Fixed regex.

0.0.7.1

Hotfix
Hotfix for `other_letters` since they might contain `nonspacing mark`.

0.0.6

Features
Improved tokenization rules for Logogram languages

0.0.5

Features
- Installation of the C++ library and command-line tools can finally be done using `make install`
- `make build-cli` has been changed to `make build`

Bug fixes
- Capture case where `in_num_p` is not switched off.
- Before: `"文字123汉语" -> ["文字", "123", "汉", "语"]`
- After: `"文字123汉语" -> ["文字", "123", "汉语"]`

Todo
- To determine how characters belonging to the ["other letters"](https://www.compart.com/en/unicode/category) category
should be handled by the tokenizer.
- Reduce the number of flags.
- Remove those out of the scope of this package. Eg. lowercase
- Or adds unnecessary bloat to the logic. Eg. url handling

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.