Somajo

Latest version: v2.4.2

Safety actively analyzes 628903 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 7 of 9

1.8.1

- Fixed the following bug: When using option -e, “nasty” characters
between whitespace within tokens that are allowed to contain
whitespace (e.g. XML tags) caused SoMaJo to hang.
- Added zero-width no-break space (FEFF) to “nasty” characters.

1.8.0

- New language: SoMaJo can tokenize English texts (using the new
option -l/--language).
- Small improvements to tokenization (URLs, emoticons, number
compounds, …).

1.7.0

SoMaJo has now full XML support. To tokenize an XML file, use the
option -x/--xml. Via the option --tag (can be used multiple times),
you can specify which tags always constitute sentence breaks, e.g.
title, h1 or p tags in an HTML file.

1.6.0

- XML declarations are recognized as single tokens.
- Additional “nasty” characters (zero-width joiners and non-joiners,
left-to-right and right-to-left marks) are removed from the input.
- The input is normalized to Unicode normal form C (NFC).

1.5.0

- Bugfix: Removed trailing space from last token in
paragraph/sentence.
- SoMaJo should be run as 'somajo-tokenizer'. The 'tokenizer' command
is deprecated.
- XML entities (&, &75;, &x7f;) are recognized as single tokens.
- Some abbreviations (usw., usf., etc., uvam.) indicate sentence
boundaries if they are followed by a potential sentence start.
- We also print a log message that indicates tokenization speed.

1.4.4

This release improves sentence splitting for sentences ending in
German closing quotation marks (“).

Page 7 of 9

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.