Gobbli

Latest version: v0.2.4

Safety actively analyzes 613734 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.2.3

Fix a bug when running interactive apps on Windows.

0.2.2

Fixes a bug that was causing errors in the classification report printed by experiment results.

0.2.1

Some minor fixes for the docs -- no functionality should be changed.

0.2.0

Major items in this release:

Python 3.8 support
We now run CI on both Python 3.7 and Python 3.8, making both versions officially supported. The only major change for the upgrade is that we now require the installed version of `ray` to be at least `0.8.4`.

Multilabel classification
Most models now transparently support true multilabel classification, where the output layer of the model reports a predicted probability for each label rather than a single predicted class. Simply pass a `List[List[str]]` in place of a `List[str]` whenever you're setting labels, where each inner `List[str]` is a list of labels that apply to the document. The model will infer the set of all labels from your data and generate a predicted probability for each label on all new data. Also added a benchmark dataset for the multilabel classification case: the [CMU Movie Summary dataset](http://www.cs.cmu.edu/~ark/personas/). The interactive apps should also work -- note labels are delimited in CSV/TSV files using nested commas by default, but this can be changed using a command line argument.

Backtranslation
Implemented a new data augmentation approach based on [transformers' implementation of the Marian Machine Translation model](https://huggingface.co/transformers/model_doc/marian.html). Pass a list of target languages, and the model will translate each document from English to each language and back to generate a list of texts which are similar but not exactly the same as the original.

Miscellaneous improvements

- Fix a potential error installing newer versions of sentencepiece (>=0.1.90).
- Fix an error installing an older version of gensim (<3.8.2).
- Fix errors running the interactive apps with tiny sample sizes (although you probably weren't trying to run them with 1 document... right?).
- Fix some encoding errors reading data in the `Transformer` and `SpaCyModel` models.
- Rework charting in benchmark output to prevent timeout errors during benchmarks.
- Upgraded the version of transformers in the `Transformer` model to 2.8.0, allowing for use of the [ELECTRA model](https://huggingface.co/transformers/model_doc/electra.html).

0.1.0

This is a large release including many new features:

New models

Implemented support for arbitrary scikit-learn models via SKLearnClassifier and TF-IDF as a baseline embedding approach via TfidfEmbedder. Implemented support for spaCy text categorizer models and spacy-transformers models via SpaCyModel. Upgraded `pytorch_transformers` v1.0.0 to `transformers` v2.4.1, which added support for several new models.

Interactive apps

gobbli now comes bundled with a few [Streamlit](https://www.streamlit.io/) apps that can be used to explore datasets, evaluate gobbli model performance, and generate local explanations for gobbli model predictions. See [the docs](https://gobbli.readthedocs.io/en/latest/interactive_apps.html) for more information.

Overhauled benchmarks

Completely overhauled the benchmark framework. Benchmark output is now stored as Markdown files, which can much more easily be read on GitHub, and benchmarks can be selectively rerun when new models are added. Also developed a "benchmark" for embeddings, which plots the model embeddings in 2 dimensions and allows for a qualitative assessment of how well each model differentiates between the classes in the dataset. See [the benchmark output folder](https://github.com/RTIInternational/gobbli/tree/master/benchmark/benchmark_output).

Miscellaneous improvements

- Add new BERT weights from NCBI trained on PubMed data (`ncbi-bert-base-pubmed-uncased`, `ncbi-bert-base-pubmed-mimic-uncased`, `ncbi-bert-large-pubmed-uncased`, `ncbi-bert-large-pubmed-mimic-uncased`) (thanks pmbaumgartner!)
- Upgrade fastText to a more recent version which supports [autotuning parameters](https://gobbli.readthedocs.io/en/latest/auto/gobbli.model.fasttext.html#gobbli.model.fasttext.FastText.init).
- Add support for optional [gradient accumulation](https://gobbli.readthedocs.io/en/latest/auto/gobbli.model.transformer.html#gobbli.model.transformer.Transformer.init) in Transformer models, allowing for smaller batch sizes and larger models while retaining performance
- Upgrade [USE implementation](https://gobbli.readthedocs.io/en/latest/auto/gobbli.model.use.html#gobbli.model.use.USE) to the TensorFlow 2.0 version and add support for multilingual weights (`universal-sentence-encoder-multilingual`, `universal-sentence-encoder-multilingual-large`)
- Add a couple of utilities for [inspecting and cleaning up disk usage](https://gobbli.readthedocs.io/en/latest/advanced_usage.html#housekeeping)
- Fix memory issues with USE model by batching input data
- Fix potential encoding issues with non-ASCII text in USE model
- Reuse static pretrained weights across instances of models instead of redownloading every time

0.0.7

Adds requirements.txt (which is required for building some models) to the PyPI upload.

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.