Gensim

Latest version: v4.3.2

Safety actively analyzes 629678 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 15

4.1

* [Ensemble LDA](https://radimrehurek.com/gensim/auto_examples/tutorials/run_ensemblelda.html) for robust training, selection and comparison of LDA models.
* [FastSS module](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/similarities/fastss.pyx) for super fast Levenshtein "fuzzy search" queries. Used e.g. for ["soft term similarity"](https://github.com/RaRe-Technologies/gensim/pull/3146) calculations.

There are several minor changes that are **not** backwards compatible with previous versions of Gensim.
The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump.
Nevertheless, we describe them below.

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

We now handle both ``positive`` and ``negative`` keyword parameters consistently.
They may now be either:

1. A string, in which case the value is reinterpreted as a list of one element (the string value)
2. A vector, in which case the value is reinterpreted as a list of one element (the vector)
3. A list of strings
4. A list of vectors

So you can now simply do:

python
model.most_similar(positive='war', negative='peace')


instead of the slightly more involved

python
model.most_similar(positive=['war'], negative=['peace'])


Both invocations remain correct, so you can use whichever is most convenient.
If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

python
model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])


then you will need to specify the lists explicitly in gensim 4.1.
Deprecated obsolete `step` parameter from doc2vec

With the newer version, do this:

python
model.infer_vector(..., epochs=123)


instead of this:

python
model.infer_vector(..., steps=123)


Plus a large number of smaller improvements and fixes, as usual.

**⚠️ If migrating from old Gensim 3.x, read the [Migration guide](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4) first.**

:+1: New features

* [3169](https://github.com/RaRe-Technologies/gensim/pull/3169): Implement `shrink_windows` argument for Word2Vec, by [M-Demay](https://github.com/M-Demay)
* [3163](https://github.com/RaRe-Technologies/gensim/pull/3163): Optimize word mover distance (WMD) computation, by [flowlight0](https://github.com/flowlight0)
* [3157](https://github.com/RaRe-Technologies/gensim/pull/3157): New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by [Witiko](https://github.com/Witiko)
* [3153](https://github.com/RaRe-Technologies/gensim/pull/3153): Vectorize word2vec.predict_output_word for speed, by [M-Demay](https://github.com/M-Demay)
* [3146](https://github.com/RaRe-Technologies/gensim/pull/3146): Use FastSS for fast kNN over Levenshtein distance, by [Witiko](https://github.com/Witiko)
* [3128](https://github.com/RaRe-Technologies/gensim/pull/3128): Materialize and copy the corpus passed to SoftCosineSimilarity, by [Witiko](https://github.com/Witiko)
* [3115](https://github.com/RaRe-Technologies/gensim/pull/3115): Make LSI dispatcher CLI param for number of jobs optional, by [robguinness](https://github.com/robguinness)
* [3091](https://github.com/RaRe-Technologies/gensim/pull/3091): LsiModel: Only log top words that actually exist in the dictionary, by [kmurphy4](https://github.com/kmurphy4)
* [2980](https://github.com/RaRe-Technologies/gensim/pull/2980): Added EnsembleLda for stable LDA topics, by [sezanzeb](https://github.com/sezanzeb)
* [2978](https://github.com/RaRe-Technologies/gensim/pull/2978): Optimize performance of Author-Topic model, by [horpto](https://github.com/horpto)
* [3000](https://github.com/RaRe-Technologies/gensim/pull/3000): Tidy up KeyedVectors.most_similar() API, by [simonwiles](https://github.com/simonwiles)

:books: Tutorials and docs

* [3155](https://github.com/RaRe-Technologies/gensim/pull/3155): Correct parameter name in documentation of fasttext.py, by [bizzyvinci](https://github.com/bizzyvinci)
* [3148](https://github.com/RaRe-Technologies/gensim/pull/3148): Fix broken link to mycorpus.txt in documentation, by [rohit901](https://github.com/rohit901)
* [3142](https://github.com/RaRe-Technologies/gensim/pull/3142): Use more permanent pdf link and update code link, by [dymil](https://github.com/dymil)
* [3141](https://github.com/RaRe-Technologies/gensim/pull/3141): Update link for online LDA paper, by [dymil](https://github.com/dymil)
* [3133](https://github.com/RaRe-Technologies/gensim/pull/3133): Update link to Hoffman paper (online VB LDA), by [jonaschn](https://github.com/jonaschn)
* [3129](https://github.com/RaRe-Technologies/gensim/pull/3129): Add bronze sponsor: TechTarget, by [piskvorky](https://github.com/piskvorky)
* [3126](https://github.com/RaRe-Technologies/gensim/pull/3126): Fix typos in make_wiki_online.py and make_wikicorpus.py, by [nicolasassi](https://github.com/nicolasassi)
* [3125](https://github.com/RaRe-Technologies/gensim/pull/3125): Improve & unify docs for dirichlet priors, by [jonaschn](https://github.com/jonaschn)
* [3123](https://github.com/RaRe-Technologies/gensim/pull/3123): Fix hyperlink for doc2vec tutorial, by [AdityaSoni19031997](https://github.com/AdityaSoni19031997)
* [3121](https://github.com/RaRe-Technologies/gensim/pull/3121): Add bronze sponsor: eaccidents.com, by [piskvorky](https://github.com/piskvorky)
* [3120](https://github.com/RaRe-Technologies/gensim/pull/3120): Fix URL for ldamodel.py, by [jonaschn](https://github.com/jonaschn)
* [3118](https://github.com/RaRe-Technologies/gensim/pull/3118): Fix URL in doc string, by [jonaschn](https://github.com/jonaschn)
* [3107](https://github.com/RaRe-Technologies/gensim/pull/3107): Draw attention to sponsoring in README, by [piskvorky](https://github.com/piskvorky)
* [3105](https://github.com/RaRe-Technologies/gensim/pull/3105): Fix documentation links: Travis to Github Actions, by [piskvorky](https://github.com/piskvorky)
* [3057](https://github.com/RaRe-Technologies/gensim/pull/3057): Clarify doc comment in LdaModel.inference(), by [yocen](https://github.com/yocen)
* [2964](https://github.com/RaRe-Technologies/gensim/pull/2964): Document that preprocessing.strip_punctuation is limited to ASCII, by [sciatro](https://github.com/sciatro)


:red_circle: Bug fixes

* [3178](https://github.com/RaRe-Technologies/gensim/pull/3178): Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by [Witiko](https://github.com/Witiko)
* [3174](https://github.com/RaRe-Technologies/gensim/pull/3174): Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by [emgucv](https://github.com/emgucv)
* [3136](https://github.com/RaRe-Technologies/gensim/pull/3136): Fix indexing error in word2vec_inner.pyx, by [bluekura](https://github.com/bluekura)
* [3131](https://github.com/RaRe-Technologies/gensim/pull/3131): Add missing import to NMF docs and models/__init__.py, by [properGrammar](https://github.com/properGrammar)
* [3116](https://github.com/RaRe-Technologies/gensim/pull/3116): Fix bug where saved Phrases model did not load its connector_words, by [aloknayak29](https://github.com/aloknayak29)
* [2830](https://github.com/RaRe-Technologies/gensim/pull/2830): Fixed KeyError in coherence model, by [pietrotrope](https://github.com/pietrotrope)


:warning: Removed functionality & deprecations

* [3176](https://github.com/RaRe-Technologies/gensim/pull/3176): Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by [rock420](https://github.com/rock420)
* [2965](https://github.com/RaRe-Technologies/gensim/pull/2965): Remove strip_punctuation2 alias of strip_punctuation, by [sciatro](https://github.com/sciatro)
* [3180](https://github.com/RaRe-Technologies/gensim/pull/3180): Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by [rock420](https://github.com/rock420)

🔮 Testing, CI, housekeeping

* [3156](https://github.com/RaRe-Technologies/gensim/pull/3156): Update Numpy minimum version to 1.17.0, by [PrimozGodec](https://github.com/PrimozGodec)
* [3143](https://github.com/RaRe-Technologies/gensim/pull/3143): replace _mul function with explicit casts, by [mpenkov](https://github.com/mpenkov)
* [2952](https://github.com/RaRe-Technologies/gensim/pull/2952): Allow newer versions of the Morfessor module for the tests, by [pabs3](https://github.com/pabs3)
* [2965](https://github.com/RaRe-Technologies/gensim/pull/2965): Remove strip_punctuation2 alias of strip_punctuation, by [sciatro](https://github.com/sciatro)

4.1.0

4.0.1

Bugfix release to address issues with Wheels on Windows:

- https://github.com/RaRe-Technologies/gensim/issues/3095
- https://github.com/RaRe-Technologies/gensim/issues/3097

4.0

Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.

Main highlights (see also *👍 Improvements* below)

* Massively optimized popular algorithms the community has grown to love: [fastText](https://radimrehurek.com/gensim/models/fasttext.html), [word2vec](https://radimrehurek.com/gensim/models/word2vec.html), [doc2vec](https://radimrehurek.com/gensim/models/doc2vec.html), [phrases](https://radimrehurek.com/gensim/models/phrases.html):

a. **Efficiency**

| model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput |
|----------|------------|--------|
| fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / **1.26 GB** / 914k words/s |
| word2vec | 1.7h / 0.36 GB / 1685k words/s | **1.2h** / 0.33 GB / 1762k words/s |

In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. ([4.0 benchmarks](https://github.com/RaRe-Technologies/gensim/issues/2887#issuecomment-711097334))

b. **Robustness**. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

c. **Simplified OOP model** for easier model exports and integration with TensorFlow, PyTorch &co.

These improvements come to you transparently aka "for free", but see [Migration guide](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4) for some changes that break the old Gensim 3.x API. **Update your code accordingly**.

* Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.
- Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.

So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.

* Dropped Python 2. Gensim 4.0 is Py3.6+. Read our [Python version support policy](https://github.com/RaRe-Technologies/gensim/wiki/Gensim-And-Compatibility).
- If you still need Python 2 for some reason, stay at [Gensim 3.8.3](https://github.com/RaRe-Technologies/gensim/releases/tag/3.8.3).

* A new [Gensim website](https://radimrehurek.com/gensim_4.0.0) – finally! 🙃

So, a major clean-up release overall. We're happy with this **tighter, leaner and faster Gensim**.

This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.

Why pre-release?

This 4.0.0beta pre-release is for users who want the **cutting edge performance and bug fixes**. Plus users who want to help out, by **testing and providing feedback**: code, documentation, workflows… Please let us know on the [mailing list](https://groups.google.com/g/gensim)!

Install the pre-release with:

bash
pip install --pre --upgrade gensim


What will change between this pre-release and a "full" 4.0 release?

Production stability is important to Gensim, so we're improving the process of **upgrading already-trained saved models**. There'll be an explicit model upgrade script between each `4.n` to `4.(n+1)` Gensim release. Check progress [here](https://github.com/RaRe-Technologies/gensim/milestone/3).


:+1: Improvements

* [2947](https://github.com/RaRe-Technologies/gensim/pull/2947): Bump minimum Python version to 3.6, by [gojomo](https://github.com/gojomo)
* [2939](https://github.com/RaRe-Technologies/gensim/pull/2939) + [#2984](https://github.com/RaRe-Technologies/gensim/pull/2984): Code style & py3 migration clean up, by [piskvorky](https://github.com/piskvorky)
* [2300](https://github.com/RaRe-Technologies/gensim/pull/2300): Use less RAM in LdaMulticore, by [horpto](https://github.com/horpto)
* [2698](https://github.com/RaRe-Technologies/gensim/pull/2698): Streamline KeyedVectors & X2Vec API, by [gojomo](https://github.com/gojomo)
* [2864](https://github.com/RaRe-Technologies/gensim/pull/2864): Speed up random number generation in word2vec, by [zygm0nt](https://github.com/zygm0nt)
* [2976](https://github.com/RaRe-Technologies/gensim/pull/2976): Speed up phrase (collocation) detection, by [piskvorky](https://github.com/piskvorky)
* [2979](https://github.com/RaRe-Technologies/gensim/pull/2979): Allow skipping common English words in multi-word phrases, by [piskvorky](https://github.com/piskvorky)
* [2867](https://github.com/RaRe-Technologies/gensim/pull/2867): Expose `max_final_vocab` parameter in fastText constructor, by [mpenkov](https://github.com/mpenkov)
* [2931](https://github.com/RaRe-Technologies/gensim/pull/2931): Clear up job queue parameters in word2vec, by [lunastera](https://github.com/lunastera)
* [2939](https://github.com/RaRe-Technologies/gensim/pull/2939): X2Vec SaveLoad improvements, by [piskvorky](https://github.com/piskvorky)

:books: Tutorials and docs

* [2954](https://github.com/RaRe-Technologies/gensim/pull/2954): New theme for the Gensin website, [dvorakvaclav](https://github.com/dvorakvaclav)
* [2960](https://github.com/RaRe-Technologies/gensim/issues/2960): Added [Gensim and Compatibility](https://github.com/RaRe-Technologies/gensim/wiki/Gensim-And-Compatibility) Wiki page, by [piskvorky](https://github.com/piskvorky)
* [2960](https://github.com/RaRe-Technologies/gensim/issues/2960): Reworked & simplified the [Developer Wiki page](https://github.com/RaRe-Technologies/gensim/wiki/Developer-page), by [piskvorky](https://github.com/piskvorky)
* [2968](https://github.com/RaRe-Technologies/gensim/pull/2968): Migrate tutorials & how-tos to 4.0.0, by [piskvorky](https://github.com/piskvorky)
* [2899](https://github.com/RaRe-Technologies/gensim/pull/2899): Clean up of language and formatting of docstrings, by [piskvorky](https://github.com/piskvorky)
* [2899](https://github.com/RaRe-Technologies/gensim/pull/2899): Added documentation for NMSLIB indexer, by [piskvorky](https://github.com/piskvorky)
* [2832](https://github.com/RaRe-Technologies/gensim/pull/2832): Clear up LdaModel documentation by [FyzHsn](https://github.com/FyzHsn)
* [2871](https://github.com/RaRe-Technologies/gensim/pull/2871): Clarify that license is LGPL-2.1, by [pombredanne](https://github.com/pombredanne)
* [2896](https://github.com/RaRe-Technologies/gensim/pull/2896): Make docs clearer on `alpha` parameter in LDA model, by [xh2](https://github.com/xh2)
* [2897](https://github.com/RaRe-Technologies/gensim/pull/2897): Update Hoffman paper link for Online LDA, by [xh2](https://github.com/xh2)
* [2910](https://github.com/RaRe-Technologies/gensim/pull/2910): Refresh docs for run_annoy tutorial, by [piskvorky](https://github.com/piskvorky)
* [2935](https://github.com/RaRe-Technologies/gensim/pull/2935): Fix "generator" language in word2vec docs, by [polm](https://github.com/polm)

:red_circle: Bug fixes

* [2891](https://github.com/RaRe-Technologies/gensim/pull/2891): Fix fastText word-vectors with ngrams off, by [gojomo](https://github.com/gojomo)
* [2907](https://github.com/RaRe-Technologies/gensim/pull/2907): Fix doc2vec crash for large sets of doc-vectors, by [gojomo](https://github.com/gojomo)
* [2899](https://github.com/RaRe-Technologies/gensim/pull/2899): Fix similarity bug in NMSLIB indexer, by [piskvorky](https://github.com/piskvorky)
* [2899](https://github.com/RaRe-Technologies/gensim/pull/2899): Fix deprecation warnings in Annoy integration, by [piskvorky](https://github.com/piskvorky)
* [2901](https://github.com/RaRe-Technologies/gensim/pull/2901): Fix inheritance of WikiCorpus from TextCorpus, by [jenishah](https://github.com/jenishah)
* [2940](https://github.com/RaRe-Technologies/gensim/pull/2940); Fix deprecations in SoftCosineSimilarity, by [Witiko](https://github.com/Witiko)
* [2944](https://github.com/RaRe-Technologies/gensim/pull/2944): Fix `save_facebook_model` failure after update-vocab & other initialization streamlining, by [gojomo](https://github.com/gojomo)
* [2846](https://github.com/RaRe-Technologies/gensim/pull/2846): Fix for Python 3.9/3.10: remove `xml.etree.cElementTree`, by [hugovk](https://github.com/hugovk)
* [2973](https://github.com/RaRe-Technologies/gensim/issues/2973): phrases.export_phrases() doesn't yield all bigrams
* [2942](https://github.com/RaRe-Technologies/gensim/issues/2942): Segfault when training doc2vec

:warning: Removed functionality & deprecations

* [6](https://github.com/RaRe-Technologies/gensim-wheels/pull/6): No more binary wheels for x32 platforms, by [menshikh-iv](https://github.com/menshikh-iv)
* [2899](https://github.com/RaRe-Technologies/gensim/pull/2899): Renamed overly broad `similarities.index` to the more appropriate `similarities.annoy`, by [piskvorky](https://github.com/piskvorky)
* [2958](https://github.com/RaRe-Technologies/gensim/pull/2958): Remove gensim.summarization subpackage, docs and test data, by [mpenkov](https://github.com/mpenkov)
* [2926](https://github.com/RaRe-Technologies/gensim/pull/2926): Rename `num_words` to `topn` in dtm_coherence, by [MeganStodel](https://github.com/MeganStodel)
* [2937](https://github.com/RaRe-Technologies/gensim/pull/2937): Remove Keras dependency, by [piskvorky](https://github.com/piskvorky)
* Removed all code, methods, attributes and functions marked as deprecated in [Gensim 3.8.3](https://github.com/RaRe-Technologies/gensim/releases/tag/3.8.3).
* Removed pattern dependency (PR [3012](https://github.com/RaRe-Technologies/gensim/pull/3012), [mpenkov](https://github.com/mpenkov)). If you need to lemmatize, do it prior to passing the corpus to gensim.

---

4.0.0

4.0.0.rc1

Page 2 of 15

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.