Somajo

Latest version: v2.4.2

Safety actively analyzes 628903 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 9

2.0.5

- Add heuristics for ambiguous quotation marks (issue 11).
- Avoid false positives for emoticons that contain a space (issue 12).
- Correctly tokenize obfuscated email addresses that contain spaces.
- Do not split tl;dr and its German variant zl;ng.

2.0.4

- Bugfix: Prevent race conditions between tokenizer and sentence
splitter in parallel processing (--parallel > 1).

2.0.3

- Skip tests for unimplemented features (some builds will fail if any
of the unit tests fail).

2.0.2

- Bugfix: Parallel tokenization (--parallel > 1) works again.
- Support for musical notes (sharps).

2.0.1

- Bugfix.

2.0.0

New features and improvements

- New API: Use new class SoMaJo instead of Tokenizer and
SentenceSplitter. Currently, the old API is still supported but will
issue deprecation warnings.
- Speed-up: Due to a new internal representation of the input text
during processing (as a doubly linked list of Token objects),
tokenization is now two to three times faster.
- Incremental and parallel processing of XML: If a sensible set of
eos_tags is specified, the XML input will be processed incrementally
(allowing for arbitrarily large XML input). In addition, if a
sensible set of eos_tags is specified, processing can also be
parallelized.
- New option --strip-tags to suppress the output of XML tags.
- Support for textual representations of emojis (:smile:,
:stuck_out_tongue_winking_eye:, etc.).
- Support for textfaces (༼ʘ̚ل͜ʘ̚༽, ╚(ಠ_ಠ)=┐, etc.).

Breaking changes

- Removed the tokenizer script (deprecated since version 1.5.0
released in October 2017). Use somajo-tokenizer instead.
- Language codes contain the tokenization guideline: "de_CMC" instead
of "de" and "en_PTB" instead of "en".

Page 4 of 9

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.