Somajo

Latest version: v2.4.2

Safety actively analyzes 628924 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 9

2.1.4

- Add a few abbreviations.
- Improve detection of sentence boundaries when punctuation is
followed by emoticons, mentions or hashtags.

2.1.3

- Add a few abbreviations.
- Improve tokenization of protocol-less URLs.
- Improve tokenization of a few emoticons and symbols/dingbats.
- Improve tokenization of gendered nouns (gender star, gender colon).
- Improve tokenization of simple arithmetic operations.

2.1.2

- Allow hyphens in hashtags. While hyphens cannot be part of Twitter
hashtags, we do not want to split compounds like
“Refugeeswelcome-Bewegung”.

2.1.1

- Detection of quotes delimited by apostrophes ('…') is more
conservative, now (issue 16).

2.1.0

- New feature: Delimit sentences with XML tags (via the command line
option --sentence-tag TAGNAME or by passing xml_sentences="TAGNAME"
to the constructor). When using this option with XML input, SoMaJo
tries hard to produce well-formed XML as output. To achieve this,
some tags will need to be closed and re-opened at sentence
boundaries. In this paragraph, for example, the italic region
contains a sentence boundary:
<p>Hi <i>there! How</i> are you?</p>
SoMaJo will close the i tag before the end of the sentence and
re-open it afterwards:
<p> <s> Hi <i> there ! </i> </s> <s> <i> How </i> are you ? </s> </p>

2.0.6

- Support all textual smileys and textfaces from Signal messenger.
- Raise a TypeError if tokenize_text is called with a string instead
of an iterable of strings (issue 13)

Page 3 of 9

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.