Cltk

Latest version: v1.3.0

Safety actively analyzes 628969 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 10 of 12

0.1.29

This release adds basic Word2Vec support, including the introduction of Greek and Latin Word2Vec models (https://github.com/cltk/latin_word2vec_cltk & https://github.com/cltk/greek_word2vec_cltk). The key functionality is a keyword expander for use when querying the TLG and PHI5 corpora.

From the docs

Word2Vec is a Vector space model especially powerful for comparing words in relation to each other. For instance, it is commonly used to discover words which appear in similar contexts (something akin to synonyms; think of them as lexical clusters).

The CLTK repository contains pre-trained Word2Vec models for Latin (import as latin_word2vec_cltk), one lemmatized and the other not. They were trained on the PHI5 corpus. To train your own, see the README at the Latin Word2Vec repository.

One of the most useful simple features of Word2Vec is as a keyword expander. Keyword expansion is the taking of a query term, finding synonyms, and searching for those, too. Here's an example of its use:

python
In [1]: from cltk.ir.query import search_corpus

In [2]: for x in search_corpus('amicitia', 'phi5', context='sentence', case_insensitive=True, expand_keyword=True, threshold=0.25):
print(x)
...:
The following similar terms will be added to the 'amicitia' query: '['societate', 'praesentia', 'uita', 'sententia', 'promptu', 'beneuolentia', 'dignitate', 'monumentis', 'somnis', 'philosophia']'.
('L. Iunius Moderatus Columella', 'hospitem, nisi ex *amicitia* domini, quam raris-\nsime recipiat.')
('L. Iunius Moderatus Columella', ' \n Xenophon Atheniensis eo libro, Publi Siluine, qui Oeconomicus \ninscribitur, prodidit maritale coniugium sic comparatum esse \nnatura, ut non solum iucundissima, uerum etiam utilissima uitae \nsocietas iniretur: nam primum, quod etiam Cicero ait, ne genus \nhumanum temporis longinquitate occideret, propter \nhoc marem cum femina esse coniunctum, deinde, ut ex \nhac eadem *societate* mortalibus adiutoria senectutis nec \nminus propugnacula praeparentur.')
('L. Iunius Moderatus Columella', 'ac ne ista quidem \npraesidia, ut diximus, non adsiduus labor et experientia \nuilici, non facultates ac uoluntas inpendendi tantum pollent \nquantum uel una *praesentia* domini, quae nisi frequens \noperibus interuenerit, ut in exercitu, cum abest imperator, \ncuncta cessant officia.')



threshold is the closeness of the query term to its neighboring words. Note that when expand_keyword=True, the search term will be stripped of any regular expression syntax.

The keyword expander leverages get_sims() (which in turn leverages functionality of the Gensim package) to find similar terms. Some examples of it in action:

python

In [3]: from cltk.vector.word2vec import get_sims

In [4]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.7)
Matches found, but below the threshold of 'threshold=0.7'. Lower it to see these results.
Out[4]: []

In [5]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.2)
Out[5]:
['lictor',
'extemplo',
'cena',
'nuntio',
'aduenio',
'iniussus2',
'forum',
'dictator',
'fabium',
'caesarem']

In [6]: get_sims('iube', 'latin', lemmatized=True, threshold=0.7)
"word 'iube' not in vocabulary"
The following terms in the Word2Vec model you may be looking for: '['iubet”', 'iubet', 'iubilo', 'iubĕ', 'iubar', 'iubes', 'iubatus', 'iuba1', 'iubeo']'.

In [7]: get_sims('dictator', 'latin', lemmatized=False, threshold=0.7)
Out[7]:
['consul',
'caesar',
'seruilius',
'praefectus',
'flaccus',
'manlius',
'sp',
'fuluius',
'fabio',
'ualerius']


To add and subtract vectors, you need to load the models yourself with Gensim.

0.1.28

About

Addition of information retrieval module for pattern matching in text.

From docs

Several functions are available for querying text in order to match regular expression patterns. match_regex() is the most basic. Punctuation rules are included for texts using Latin sentence–final punctuation ('.', '!', '?') and Greek ('.', ';'). For returned strings, you may choose between a context of the match's sentence, paragraph, or custom number of characters on each side of a hit. Note that this function and the next each return a generator.

Here is an example in Latin with a sentence context, case-insensitive:

python
In [1]: from cltk.ir.query import match_regex

In [2]: text = 'Ita fac, mi Lucili; vindica te tibi. et tempus, quod adhuc aut auferebatur aut subripiebatur aut excidebat, collige et serva.'

In [3]: matches = match_regex(text, r'tempus', language='latin', context='sentence', case_insensitive=True)

In [4]: for match in matches:
print(match)
...:
et *tempus*, quod adhuc aut auferebatur aut subripiebatur aut excidebat, collige et serva.



And here with context of 40 characters:

python
In [5]: matches = match_regex(text, r'tempus', language='latin', context=40, case_insensitive=True)

In [6]: for match in matches:
print(match)
...:
Ita fac, mi Lucili; vindica te tibi. et *tempus*, quod adhuc aut auferebatur aut subripi



For querying the entirety of a corpus, see search_corpus(), which returns a tuple of ('author_name': 'match_context').

python
In [7]: from cltk.ir.query import search_corpus

In [8]: for match in search_corpus('ὦ ἄνδρες Ἀθηναῖοι', 'tlg', context='sentence'):
print(match)
...:
('Ammonius Phil.', ' \nκαλοῦντας ἑτέρους ἢ προστάσσοντας ἢ ἐρωτῶντας ἢ εὐχομένους περί τινων, \nπολλάκις δὲ καὶ αὐτοπροσώπως κατά τινας τῶν ἐνεργειῶν τούτων ἐνεργοῦ-\nσι “πρῶτον μέν, *ὦ ἄνδρες Ἀθηναῖοι*, τοῖς θεοῖς εὔχομαι πᾶσι καὶ πάσαις” \nλέγοντες ἢ “ἀπόκριναι γὰρ δεῦρό μοι ἀναστάς”. οἱ οὖν περὶ τῶν τεχνῶν \nτούτων πραγματευόμενοι καὶ τοὺς λόγους εἰς θεωρίαν ')
('Sopater Rhet.', "θόντα, ἢ συγγνωμονηκέναι καὶ ἐλεῆσαι. ψυχῆς γὰρ \nπάθος ἐπὶ συγγνώμῃ προτείνεται. παθητικὴν οὖν ποιή-\nσῃ τοῦ πρώτου προοιμίου τὴν ἔννοιαν: ἁπάντων, ὡς ἔοι-\nκεν, *ὦ ἄνδρες Ἀθηναῖοι*, πειρασθῆναί με τῶν παραδό-\nξων ἀπέκειτο, πόλιν ἰδεῖν ἐν μέσῃ Βοιωτίᾳ κειμένην. καὶ \nμετὰ Θήβας οὐκ ἔτ' οὔσας, ὅτι μὴ στεφανοῦντας Ἀθη-\nναίους ἀπέδειξα παρὰ τὴ")

0.1.27

Update to be compatible with Python 3.5. This fix was dependent on NLTK 3.1, which fixed an error with the Punkt word tokenizer in 3.5.

0.1.26

Add wrapper function, `nltk_tokenize_words()`, for `PunktLanguageVars.word_tokenize()`.

Move NER to `cltk.tag` (from `cltk.ner`).

Switch versioning to semantic style ([read about here](http://semver.org/)).

0.0.1.24

Addition of basic named entity recognition (NER) for Greek and Latin.

This works by taking word lists ([Latin](https://github.com/cltk/latin_proper_names_cltk) and [Greek](https://github.com/cltk/greek_proper_names_cltk)) and matching against tokens of incoming text.

0.0.1.23

This release features the first robust interface for using the TLG indices, which were parsed by Stephen Margheim (smargh).

Also included are JSON versions of the PHI5 by Martín Pozzi (marpozzi), though no helper functions have been written yet.

Docs available at: http://docs.cltk.org/en/latest/greek.html#tlg-indices.

Page 10 of 12

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.