Sadedegel

Latest version: v0.21.2

Safety actively analyzes 619197 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 5

0.20.1

New Features
* Exceptional handling of `emoji`, `hashtag`and `mention`tokens by word tokenizers. Refer to `sadedegel config` for details.
* Options also into `Text2Doc` text to sadedegel `Document` converter
* **[Incomplete]** `HashVectorizer` (Works far better than TfIdf or BM25 vectorization for majority of the prebuilt models)
* `unary`option for `idf`

Datasets
We do keep adding new datasets with this new release. Refer to [Dataset ReadMe](sadedegel/dataset/README.md) for details.
* Customer Review dataset
* Telco (Turkcell) Sentiment dataset
* Movie Sentiment dataset
* Hotel Sentiment dataset
* Categorized Product Sentiment dataset

Prebuilt Models
We do keep adding new prebuilt models with this new release. Refer to [Prebuilt Model ReadMe](sadedegel/prebuilt/README.md) for details.

* Turkish Movie Review Sentiment Classification
* Telco Brand Tweet Sentiment Classification
* Turkish Customer Reviews Classification

Others
* Lazy evaluation of word `shape`property

Behavioural Changes

* Significant behavior change on `tokens`property. Previously property returns `List[str]`, now `List[Token]`
* Sentence Tokenizer is renamed to be Sentence Boundary Detector to prevent confusion with Word Tokenizer



Contribution

* Welcome our new contributor [ertugruldemir](https://github.com/ertugrul-dmr)

0.19.1

Sadedegel is now not only "An extraction based Turkish news summarizer", but rather "A General Purpose NLP library for Turkish"
News

We have added icu tokenizer as the default tokenizer (word tokenizer) which is [very fast](sadedegel/bblock/TOKENIZER.md) and [accurate](sadedegel/bblock/TOKENIZER.md).
* We have moved BERT as optional dependency which can be installed using `pip install sadedegel[bert]`
* Word embeddings are introduced. To retrain use `pip install sadedegel[w2v]`
* By making those dependencies optional sadedegel installation is now way faster than before
* `pip install sadedegel` takes 3 minutes 40Mbps for **version 0.18**
* `pip install sadedegel` takes 40 sec 40Mbps for **version 0.19**
* Vocabulary files are now in hdf5 format. `bert`, `icu` and `simple` have their own vocabulary files.
* Only `icu` vocabulary file includes word embeddings.
* Relax dependencies (less strict module version coupling)


Feature Drop & Deprecation

Others

* Pre-trained models under `prebuilt` are refreshed
* They now use `icu` tokenizer
* They now return class probabilities for predictions

0.18.2

0.18

News

* Our main contributor dafajon has implemented a new BM25Summarizer similary to TfIdf summarizer. BM25Summarizer outperforms slightly in short summaries.
* We have packaged two new prebuilt models (Refer to [README](sadedegel/prebuilt/README.md) for model accuracies )
1. tweeter profanity classification (`sadedegel.prebuilt.tweet_profanity`)
2. tweeter sentiment classification (`sadedegel.prebuilt.tweet_sentiment`)

* Change the way we report summarizer performance. Instead of a grid search of summarizer options, we now use a RandomSearch to decide optimal summarizer and parameters. Refer to [README](sadedegel/summarize/README.md) for details.

Feature Drop & Deprecation

* `sents` property on `Doc` is dropped. use `__iter__(Doc)` instead.
* `tf` property on `Doc` is deprecated (will be dropped by 0.18) in favor of `get_tf` function which gives a more flexible way to access document level tf vectors.
* `tfidf` function on `Doc` is deprecated (will be dropped by 0.18) in favor of `get_tfidf` function which gives a more flexible way to access document level tf-idf vectors.
* `lexrank` external dependency is dropped and `LexRankPureSummarizer` is renamed to be `LexRankSummarizer`
* `set_config`, `get_config`, `describe_config` and `get_all_configs` are dropped in favor of new configuration implementation.

Others

* `tf` property is now a part of `TfImpl` class using default configuration settings to yield a tf vector for a `Doc` or `Sentence`
* We've updated [documentation](sadedegel/dataset/README.md) for our datasets.
* `idf` property is now a part of `IdfImp` class using default configuration settings to yield a idf vector for a `Doc` or `Sentence`
* More default parameters in `default.ini` based on our summarizer performance.

0.17.1.1

0.17

News

* Starting with this release, sadedegel now ships prebuilt models for various basic NLP tasks. The purpose is to allow developers to load & use those models with minimal configuration.
* Our first model is a news classifier (Thanks [Taner Sezer](https://github.com/tanerim) for his corpus support)
* We [report](sadedegel/bblock/TOKENIZER.md) accuracy of our tokenizers (word) for potential enhancement points in future releases (Thanks [Taner Sezer](https://github.com/tanerim) for his corpus support)
* To support the development of prebuilt models, `sklearn` compatiblle `extension.sklearn` module is introduced for feature engineering
* `Token.is_stopword`is added to flag stopword token types.
* `LexRankSummarizer` (based on lexrank external module, to be deprecate in future releases) and `LexRankPureSummarizer` (pure sadedegel version of the same method) is added into set of extractive summarizers.


Feature Drop & Deprecation

* `sents` property on `Doc` is dropped. use `__iter__(Doc)` instead.
* `tf` property on `Doc` is deprecated (will be dropped by 0.18) in favor of `get_tf` function which gives a more flexible way to access document level tf vectors.
* `tfidf` function on `Doc` is deprecated (will be dropped by 0.18) in favor of `get_tfidf` function which gives a more flexible way to access document level tf-idf vectors.

Others

* We have pushed up TF and IDF implementations from Sentence and Doc to separate classes using python multiple inheritance support to reduce code duplication.

Page 1 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.