:star2: New features
* Fast Online NMF ([anotherbugmaster](https://github.com/anotherbugmaster), [#2007](https://github.com/RaRe-Technologies/gensim/pull/2007))
- Benchmark `wiki-english-20171001`
| Model | Perplexity | Coherence | L2 norm | Train time (minutes) |
|-------|------------|-----------|---------|----------------------|
| LDA | 4727.07 | -2.514 | 7.372 | 138 |
| NMF | **975.74** | -2.814 | **7.265** | **73** |
| NMF (with regularization) | 985.57 | **-2.436** | 7.269 | 441 |
- Simple to use (same interface as `LdaModel`)
python
from gensim.models.nmf import Nmf
from gensim.corpora import Dictionary
import gensim.downloader as api
text8 = api.load('text8')
dictionary = Dictionary(text8)
dictionary.filter_extremes()
corpus = [
dictionary.doc2bow(doc) for doc in text8
]
nmf = Nmf(
corpus=corpus,
num_topics=5,
id2word=dictionary,
chunksize=2000,
passes=5,
random_state=42,
)
nmf.show_topics()
"""
[(0, '0.007*"km" + 0.006*"est" + 0.006*"islands" + 0.004*"league" + 0.004*"rate" + 0.004*"female" + 0.004*"economy" + 0.003*"male" + 0.003*"team" + 0.003*"elections"'),
(1, '0.006*"actor" + 0.006*"player" + 0.004*"bwv" + 0.004*"writer" + 0.004*"actress" + 0.004*"singer" + 0.003*"emperor" + 0.003*"jewish" + 0.003*"italian" + 0.003*"prize"'),
(2, '0.036*"college" + 0.007*"institute" + 0.004*"jewish" + 0.004*"universidad" + 0.003*"engineering" + 0.003*"colleges" + 0.003*"connecticut" + 0.003*"technical" + 0.003*"jews" + 0.003*"universities"'),
(3, '0.016*"import" + 0.008*"insubstantial" + 0.007*"y" + 0.006*"soviet" + 0.004*"energy" + 0.004*"info" + 0.003*"duplicate" + 0.003*"function" + 0.003*"z" + 0.003*"jargon"'),
(4, '0.005*"software" + 0.004*"games" + 0.004*"windows" + 0.003*"microsoft" + 0.003*"films" + 0.003*"apple" + 0.003*"video" + 0.002*"album" + 0.002*"fiction" + 0.002*"characters"')]
"""
- See also:
- [NMF tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/nmf_tutorial.ipynb)
- [Full NMF Benchmark](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/nmf_wikipedia.ipynb)
* Massive improvement`FastText` compatibilities ([mpenkov](https://github.com/mpenkov), [#2313](https://github.com/RaRe-Technologies/gensim/pull/2313))
python
from gensim.models import FastText
'cc.ru.300.bin' - Russian Facebook FT model trained on Common Crawl
Can be downloaded from https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ru.300.bin.gz
model = FastText.load_fasttext_format("cc.ru.300.bin")
Fixed hash-function allow to produce same output as FB FastText & works correctly for non-latin languages (for example, Russian)
assert "мяу" in m.wv.vocab 'мяу' - vocab word
model.wv.most_similar("мяу")
"""
[('Мяу', 0.6820122003555298),
('МЯУ', 0.6373013257980347),
('мяу-мяу', 0.593108594417572),
('кис-кис', 0.5899622440338135),
('гав', 0.5866007804870605),
('Кис-кис', 0.5798211097717285),
('Кис-кис-кис', 0.5742273330688477),
('Мяу-мяу', 0.5699705481529236),
('хрю-хрю', 0.5508339405059814),
('ав-ав', 0.5479759573936462)]
"""
assert "котогород" not in m.wv.vocab 'котогород' - out-of-vocab word
model.wv.most_similar("котогород", topn=3)
"""
[('автогород', 0.5463314652442932),
('ТагилНовокузнецкНовомосковскНовороссийскНовосибирскНовотроицкНовочеркасскНовошахтинскНовый',
0.5423436164855957),
('областьНовосибирскБарабинскБердскБолотноеИскитимКарасукКаргатКуйбышевКупиноОбьТатарскТогучинЧерепаново',
0.5377570390701294)]
"""
Now we load full model, for this reason, we can continue an training
from gensim.test.utils import datapath
from smart_open import smart_open
with smart_open(datapath("crime-and-punishment.txt"), encoding="utf-8") as infile: russian text
corpus = [line.strip().split() for line in infile]
model.train(corpus, total_examples=len(corpus), epochs=5)
* Similarity search improvements ([Witiko](https://github.com/Witiko), [#2016](https://github.com/RaRe-Technologies/gensim/pull/2016))
- Add similarity search using the Levenshtein distance in `gensim.similarities.LevenshteinSimilarityIndex`
- Performance optimizations to `gensim.similarities.SoftCosineSimilarity` ([full benchmark](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_benchmark.ipynb))
| dictionary size | corpus size | speed |
|-----------------|-------------|--------------:|
| 1000 | 100 | 1.0× |
| 1000 | 1000 | **53.4×** |
| 1000 | 100000 | **156784.8×** |
| 100000 | 100 | **3.8×** |
| 100000 | 1000 | **405.8×** |
| 100000 | 100000 | **66262.0×** |
- See [updated soft-cosine tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb) for more information and usage examples
* Add `python3.7` support ([menshikh-iv](https://github.com/menshikh-iv), [#2211](https://github.com/RaRe-Technologies/gensim/pull/2211))
- Wheels for Window, OSX and Linux platforms ([menshikh-iv](https://github.com/menshikh-iv), [MacPython/gensim-wheels/#12](https://github.com/MacPython/gensim-wheels/pull/12))
- Faster installation
:+1: Improvements
Optimizations
* Reduce `Phraser` memory usage (drop frequencies) ([jenishah](https://github.com/jenishah), [#2208](https://github.com/RaRe-Technologies/gensim/pull/2208))
* Reduce memory consumption of summarizer ([horpto](https://github.com/horpto), [#2298](https://github.com/RaRe-Technologies/gensim/pull/2298))
* Replace inline slow equivalent of mean_absolute_difference with fast ([horpto](https://github.com/horpto), [#2284](https://github.com/RaRe-Technologies/gensim/pull/2284))
* Reuse precalculated updated prior in `ldamodel.update_dir_prior` ([horpto](https://github.com/horpto), [#2274](https://github.com/RaRe-Technologies/gensim/pull/2274))
* Improve `KeyedVector.wmdistance` ([horpto](https://github.com/horpto), [#2326](https://github.com/RaRe-Technologies/gensim/pull/2326))
* Optimize `remove_unreachable_nodes` in `gensim.summarization` ([horpto](https://github.com/horpto), [#2263](https://github.com/RaRe-Technologies/gensim/pull/2263))
* Optimize `mz_entropy` from `gensim.summarization` ([horpto](https://github.com/horpto), [#2267](https://github.com/RaRe-Technologies/gensim/pull/2267))
* Improve `filter_extremes` methods in `Dictionary` and `HashDictionary` ([horpto](https://github.com/horpto), [#2303](https://github.com/RaRe-Technologies/gensim/pull/2303))
Additions
* Add `KeyedVectors.relative_cosine_similarity` ([rsdel2007](https://github.com/rsdel2007), [#2307](https://github.com/RaRe-Technologies/gensim/pull/2307))
* Add `random_seed` to `LdaMallet` ([Zohaggie](https://github.com/Zohaggie) & [menshikh-iv](https://github.com/menshikh-iv), [#2153](https://github.com/RaRe-Technologies/gensim/pull/2153))
* Add `common_terms` parameter to `sklearn_api.PhrasesTransformer` ([pmlk](https://github.com/pmlk), [#2074](https://github.com/RaRe-Technologies/gensim/pull/2074))
* Add method for patch `corpora.Dictionary` based on special tokens ([Froskekongen](https://github.com/Froskekongen), [#2200](https://github.com/RaRe-Technologies/gensim/pull/2200))
Cleanup
* Improve `six` usage (`xrange`, `map`, `zip`) ([horpto](https://github.com/horpto), [#2264](https://github.com/RaRe-Technologies/gensim/pull/2264))
* Refactor `line2doc` methods of `LowCorpus` and `MalletCorpus` ([horpto](https://github.com/horpto), [#2269](https://github.com/RaRe-Technologies/gensim/pull/2269))
* Get rid most of warnings in testing ([menshikh-iv](https://github.com/menshikh-iv), [#2191](https://github.com/RaRe-Technologies/gensim/pull/2191))
* Fix non-deterministic test failures (pin `PYTHONHASHSEED`) ([menshikh-iv](https://github.com/menshikh-iv), [#2196](https://github.com/RaRe-Technologies/gensim/pull/2196))
* Fix "aliasing chunkize to chunkize_serial" warning on Windows ([aquatiko](https://github.com/aquatiko), [#2202](https://github.com/RaRe-Technologies/gensim/pull/2202))
* Remove `getitem` code duplication in `gensim.models.phrases` ([jenishah](https://github.com/jenishah), [#2206](https://github.com/RaRe-Technologies/gensim/pull/2206))
* Add `flake8-rst` for docstring code examples ([kataev](https://github.com/kataev), [#2192](https://github.com/RaRe-Technologies/gensim/pull/2192))
* Get rid `py26` stuff ([menshikh-iv](https://github.com/menshikh-iv), [#2214](https://github.com/RaRe-Technologies/gensim/pull/2214))
* Use `itertools.chain` instead of `sum` to concatenate lists ([Stigjb](https://github.com/Stigjb), [#2212](https://github.com/RaRe-Technologies/gensim/pull/2212))
* Fix flake8 warnings W605, W504 ([horpto](https://github.com/horpto), [#2256](https://github.com/RaRe-Technologies/gensim/pull/2256))
* Remove unnecessary creations of lists at all ([horpto](https://github.com/horpto), [#2261](https://github.com/RaRe-Technologies/gensim/pull/2261))
* Fix extra list creation in `utils.get_max_id` ([horpto](https://github.com/horpto), [#2254](https://github.com/RaRe-Technologies/gensim/pull/2254))
* Fix deprecation warning `np.sum(generator)` ([rsdel2007](https://github.com/rsdel2007), [#2296](https://github.com/RaRe-Technologies/gensim/pull/2296))
* Refactor `BM25` ([horpto](https://github.com/horpto), [#2275](https://github.com/RaRe-Technologies/gensim/pull/2275))
* Fix pyemd import ([ramprakash-94](https://github.com/ramprakash-94), [#2240](https://github.com/RaRe-Technologies/gensim/pull/2240))
* Set `metadata=True` for `make_wikicorpus` script by default ([Xinyi2016](https://github.com/Xinyi2016), [#2245](https://github.com/RaRe-Technologies/gensim/pull/2245))
* Remove unimportant warning from `Phrases` ([rsdel2007](https://github.com/rsdel2007), [#2331](https://github.com/RaRe-Technologies/gensim/pull/2331))
* Replace `open()` by `smart_open()` in `gensim.models.fasttext._load_fasttext_format` ([rsdel2007](https://github.com/rsdel2007), [#2335](https://github.com/RaRe-Technologies/gensim/pull/2335))
:red_circle: Bug fixes
* Fix overflow error for `*Vec` corpusfile-based training ([bm371613](https://github.com/bm371613), [#2239](https://github.com/RaRe-Technologies/gensim/pull/2239))
* Fix `malletmodel2ldamodel` conversion ([horpto](https://github.com/horpto), [#2288](https://github.com/RaRe-Technologies/gensim/pull/2288))
* Replace custom epsilons with numpy equivalent in `LdaModel` ([horpto](https://github.com/horpto), [#2308](https://github.com/RaRe-Technologies/gensim/pull/2308))
* Add missing content to tarball ([menshikh-iv](https://github.com/menshikh-iv), [#2194](https://github.com/RaRe-Technologies/gensim/pull/2194))
* Fixes divided by zero when w_star_count==0 ([allenyllee](https://github.com/allenyllee), [#2259](https://github.com/RaRe-Technologies/gensim/pull/2259))
* Fix check for callbacks ([allenyllee](https://github.com/allenyllee), [#2251](https://github.com/RaRe-Technologies/gensim/pull/2251))
* Fix `SvmLightCorpus.serialize` if `labels` instance of numpy.ndarray ([aquatiko](https://github.com/aquatiko), [#2243](https://github.com/RaRe-Technologies/gensim/pull/2243))
* Fix poincate viz incompatibility with `plotly>=3.0.0` ([jenishah](https://github.com/jenishah), [#2226](https://github.com/RaRe-Technologies/gensim/pull/2226))
* Fix `keep_n` behavior for `Dictionary.filter_extremes` ([johann-petrak](https://github.com/johann-petrak), [#2232](https://github.com/RaRe-Technologies/gensim/pull/2232))
* Fix for `sphinx==1.8.1` (last r ([menshikh-iv](https://github.com/menshikh-iv), [#None](https://github.com/RaRe-Technologies/gensim/pull/None))
* Fix `np.issubdtype` warnings ([marioyc](https://github.com/marioyc), [#2210](https://github.com/RaRe-Technologies/gensim/pull/2210))
* Drop wrong key `-c` from `gensim.downloader` description ([horpto](https://github.com/horpto), [#2262](https://github.com/RaRe-Technologies/gensim/pull/2262))
* Fix gensim build (docs & pyemd issues) ([menshikh-iv](https://github.com/menshikh-iv), [#2318](https://github.com/RaRe-Technologies/gensim/pull/2318))
* Limit visdom version (avoid py2 issue from the latest visdom release) ([menshikh-iv](https://github.com/menshikh-iv), [#2334](https://github.com/RaRe-Technologies/gensim/pull/2334))
* Fix visdom integration (using `viz.line()` instead of `viz.updatetrace()`) ([allenyllee](https://github.com/allenyllee), [#2252](https://github.com/RaRe-Technologies/gensim/pull/2252))
:books: Tutorial and doc improvements
* Add gensim-data repo to `gensim.downloader` & fix rendering of code examples ([menshikh-iv](https://github.com/menshikh-iv), [#2327](https://github.com/RaRe-Technologies/gensim/pull/2327))
* Fix typos in `gensim.models` ([rsdel2007](https://github.com/rsdel2007), [#2323](https://github.com/RaRe-Technologies/gensim/pull/2323))
* Fixed typos in notebooks ([rsdel2007](https://github.com/rsdel2007), [#2322](https://github.com/RaRe-Technologies/gensim/pull/2322))
* Update `Doc2Vec` documentation: how tags are assigned in `corpus_file` mode ([persiyanov](https://github.com/persiyanov), [#2320](https://github.com/RaRe-Technologies/gensim/pull/2320))
* Fix typos in `gensim/models/keyedvectors.py` ([rsdel2007](https://github.com/rsdel2007), [#2290](https://github.com/RaRe-Technologies/gensim/pull/2290))
* Add documentation about ranges to scoring functions for `Phrases` ([jenishah](https://github.com/jenishah), [#2242](https://github.com/RaRe-Technologies/gensim/pull/2242))
* Update return sections for `KeyedVectors.evaluate_word_*` ([Stigjb](https://github.com/Stigjb), [#2205](https://github.com/RaRe-Technologies/gensim/pull/2205))
* Fix return type in `KeyedVector.evaluate_word_analogies` ([Stigjb](https://github.com/Stigjb), [#2207](https://github.com/RaRe-Technologies/gensim/pull/2207))
* Fix `WmdSimilarity` documentation ([jagmoreira](https://github.com/jagmoreira), [#2217](https://github.com/RaRe-Technologies/gensim/pull/2217))
* Replace `fify -> fifty` in `gensim.parsing.preprocessing.STOPWORDS` ([coderwassananmol](https://github.com/coderwassananmol), [#2220](https://github.com/RaRe-Technologies/gensim/pull/2220))
* Remove `alpha="auto"` from `LdaMulticore` (not supported yet) ([johann-petrak](https://github.com/johann-petrak), [#2225](https://github.com/RaRe-Technologies/gensim/pull/2225))
* Update Adopters in README ([piskvorky](https://github.com/piskvorky), [#2234](https://github.com/RaRe-Technologies/gensim/pull/2234))
* Fix broken link in `tutorials.md` ([rsdel2007](https://github.com/rsdel2007), [#2302](https://github.com/RaRe-Technologies/gensim/pull/2302))
:warning: Deprecations (will be removed in the next major release)
* Remove
- `gensim.models.wrappers.fasttext` (obsoleted by the new native `gensim.models.fasttext` implementation)
- `gensim.examples`
- `gensim.nosy`
- `gensim.scripts.word2vec_standalone`
- `gensim.scripts.make_wiki_lemma`
- `gensim.scripts.make_wiki_online`
- `gensim.scripts.make_wiki_online_lemma`
- `gensim.scripts.make_wiki_online_nodebug`
- `gensim.scripts.make_wiki` (all of these obsoleted by the new native `gensim.scripts.segment_wiki` implementation)
- "deprecated" functions and attributes
* Move
- `gensim.scripts.make_wikicorpus` ➡ `gensim.scripts.make_wiki.py`
- `gensim.summarization` ➡ `gensim.models.summarization`
- `gensim.topic_coherence` ➡ `gensim.models._coherence`
- `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work)
- `gensim.parsing.*` ➡ `gensim.utils.text_utils`