Flair

Latest version: v0.13.1

Safety actively analyzes 628924 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 5 of 6

0.4.2

New way of loading data (768)

The data loading part has been completely refactored to enable streaming data loading from disk using PyTorch's DataLoaders. I.e. training no longer requires the full dataset to be kept in memory, allowing us to train models over much larger datasets. This version also changes the syntax of how to load datasets.

Old way (now deprecated):
python
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)


New way:
python
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()


To use streaming loading, i.e. to not load into memory, you can pass the `in_memory` parameter:
python
import flair.datasets
corpus = flair.datasets.UD_ENGLISH(in_memory=False)


Embeddings

Flair embeddings (614)

This release brings Flair embeddings to 11 new languages (thanks stefan-it!): Arabic (ar), Danish (da), Persian (fa), Finnish (fi), Hebrew (he), Hindi (hi), Croatian (hr), Indonesian (id), Italian (it), Norwegian (no) and Swedish (sv). It also improves support for Bulgarian (bg), Czech, Basque (eu), Dutch (nl) and Slovenian (sl), and adds special language models for historical German. Load with language code, i.e.

python
load Flair embeddings for Italian
embeddings = FlairEmbeddings('it-forward')


One-hot encoded embeddings (747)

Some classification baselines work astonishingly well with simple learnable word embeddings. To support testing these baselines, we've added learnable word embeddings that start from a one-hot encoding of words. To initialize, you need to pass a corpus to initialize the vocabulary.

python
load corpus
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()

init learnable word embeddings with corpus
embeddings = OneHotEmbeddings(corpus)


More options in `DocumentPoolEmbeddings` (747)

We now allow users to specify a fine-tuning option before the pooling operation is executed in document pool embeddings. Options are 'none' (no fine-tuning), 'linear' (linear remapping of word embeddings), 'nonlinear' (nonlinear remapping of word embeddings). Nonlinear should be used together with `WordEmbeddings`, while None should be used with `OneHotEmbeddings` (not necessary since they are already learnt on data). So, to replicate FastText classification you can either do:

python
instantiate one-hot encoded word embeddings
embeddings = OneHotEmbeddings(corpus)

document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='none')


or

python
instantiate pre-trained word embeddings
embeddings = WordEmbeddings('glove')

document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='nonlinear')


OpenAI GPT Embeddings (624)

We now support embeddings from the OpenAI GPT model. We use the excellent pytorch-pretrained-BERT library to download the GPT model, tokenize the input and extract embeddings from the subtokens.

Initialize with:

python
embeddings = OpenAIGPTEmbeddings()

Portuguese embeddings from NILC (576)

Extensibility to new downstream tasks (681)

Previously, we had the `SequenceTagger` and `TextClassifier` as the two downstream tasks supported by Flair. The `ModelTrainer` had specific methods to train these two models, making it difficult for users to add new types of tasks (such as text regression) to Flair.

This release refactors the `flair.nn.Model` and `ModelTrainer` functionality to make it uniform across tagging models and enable users to add new tasks to Flair. Now, by implementing the 5 methods in the `flair.nn.Model` interface, a custom model immediately becomes trainable with the `ModelTrainer`. Now, three types of downstream tasks implement this interface:

- the `SequenceTagger`,
- the `TextClassifier`
- and the beta `TextRegressor`.

The code refactor removes a lot of code redundancies and slims down the interfaces of the downstream tasks classes. As the sole breaking change, it removes the `load_from_file()` methods, which are now part of the `load()` method, i.e. if previously you loaded a self-trained model like this:

python
tagger = SequenceTagger.load_from_file('/path/to/model.pt')


You now do it like this:

python
tagger = SequenceTagger.load('/path/to/model.pt')


New features

- New beta support for text regression (564)
- Return confidence scores for single-label classification (664)
- Add method to find probability for each class in case of multi-class classification (693)
- Capability to change threshold during multi label classification 707
- Support for customized ELMo embeddings (661)
- Detect multi-label problems automatically: Previously, users always had to specify whether their text classification problem was multi_label or not. Now, this is detected automatically if users do not specify. So now you can init like this:

python
corpus
corpus = TREC_6()

make label_dictionary
label_dictionary = corpus.make_label_dictionary()

init text classifier
classifier = TextClassifier(document_embeddings, label_dictionary)


- We added better module descriptions to embeddings and dropout so that more parameters get printed by default for models for better logging. (747)
- Make 'cache_root' a global variable so that different directories can be chosen for caching (667)
- Both string and Token objects can now be passed to the add_token method (628)

New datasets
- Added IMDB classification corpus to `flair.datasets` (749)
- Added TREC_6 classification corpus to `flair.datasets` (749)
- Added 20 newsgroups classification corpus to `flair.datasets` (NEWSGROUPS object)
- WASSA-17 emotion intensity text regression tasks (695)

Bug fixes

- We normalized the training loss across modules so that train / test loss are consistent. (670)
- Permission error on Windows preventing model download (557)
- Handling of empty sentences (566 758)
- Fix text generation on CUDA (666)
- others ...

0.4.1

Updated documentation (https://github.com/zalandoresearch/flair/issues/138, https://github.com/zalandoresearch/flair/issues/89)
Expanded documentation and tutorials.

0.4.0

Release 0.4 with new models, lots of new languages, experimental multilingual models, hyperparameter selection methods, BERT and ELMo embeddings, etc.

New Features

Support for new languages

Flair embeddings
We now include new language models for:
* [Swedish](https://github.com/zalandoresearch/flair/issues/3)
* [Polish](https://github.com/zalandoresearch/flair/issues/187)
* [Bulgarian](https://github.com/zalandoresearch/flair/issues/188)
* [Slovenian](https://github.com/zalandoresearch/flair/issues/202)
* [Dutch](https://github.com/zalandoresearch/flair/issues/224)

In addition to English and German. You can load FlairEmbeddings for Dutch for instance with:

python
flair_embeddings = FlairEmbeddings('dutch-forward')


Word Embeddings

We now include pre-trained [FastText Embeddings for 30 languages](https://github.com/zalandoresearch/flair/issues/234): English, German, Dutch, Italian, French, Spanish, Swedish, Danish, Norwegian, Czech, Polish, Finnish, Bulgarian, Portuguese, Slovenian, Slovakian, Romanian, Serbian, Croatian, Catalan, Russian, Hindi, Arabic, Chinese, Japanese, Korean, Hebrew, Turkish, Persian, Indonesian.

Each language has embeddings trained over Wikipedia, or Web crawls. So instantiate with:

python
German embeddings computed over Wikipedia
german_wikipedia_embeddings = WordEmbeddings('de-wiki')

German embeddings computed over web crawls
german_crawl_embeddings = WordEmbeddings('de-crawl')


Named Entity Recognition

Thanks to the Flair community, we now include NER models for:
* [French](https://github.com/zalandoresearch/flair/issues/238)
* [Dutch](https://github.com/zalandoresearch/flair/issues/224)

Next to the previous models for English and German.

Part-of-Speech Taggigng

Thanks to the Flair community, we now include PoS models for:
* [German tweets](https://github.com/zalandoresearch/flair/issues/51)


Multilingual models

As a major new feature, we now include models that can tag text in various languages.

12-language Part-of-Speech Tagging

We include a PoS model trained over 12 different languages (English, German, Dutch, Italian, French, Spanish, Portuguese, Swedish, Norwegian, Danish, Finnish, Polish, Czech).

python
load model
tagger = SequenceTagger.load('pos-multi')

text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort kaufte er einen Hut .')

predict PoS tags
tagger.predict(sentence)

print sentence with predicted tags
print(sentence.to_tagged_string())


4-language Named Entity Recognition

We include a NER model trained over 4 different languages (English, German, Dutch, Spanish).

python
load model
tagger = SequenceTagger.load('ner-multi')

text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort traf er Thomas Jefferson .')

predict NER tags
tagger.predict(sentence)

print sentence with predicted tags
print(sentence.to_tagged_string())


This model also kind of works on other languages, such as French.

Pre-trained classification models ([issue 70](https://github.com/zalandoresearch/flair/issues/70))

Flair now also includes two pre-trained classification models:
* de-offensive-lanuage: detecting offensive language in German text ([GermEval 2018 Task 1](https://projects.fzai.h-da.de/iggsa/projekt/))
* en-sentiment: detecting postive and negative sentiment in English text ([IMDB](http://ai.stanford.edu/~amaas/data/sentiment/))

Simply load the `TextClassifier` using the preferred model, such as
python
TextClassifier.load('en-sentiment')


BERT and ELMo embeddings

We added both BERT and ELMo embeddings so you can try them out, and mix and match them with Flair embeddings or any other embedding types. We hope this will enable the research community to better compare and combine approaches.

BERT Embeddings ([issue 251](https://github.com/zalandoresearch/flair/issues/251))

We added [BERT embeddings](https://arxiv.org/pdf/1810.04805.pdf) to Flair. We are using the implementation of [huggingface](https://github.com/huggingface/pytorch-pretrained-BERT). The embeddings can be used as any other embedding type in Flair:

python
from flair.embeddings import BertEmbeddings
init embedding
embedding = BertEmbeddings()
create a sentence
sentence = Sentence('The grass is green .')
embed words in sentence
embedding.embed(sentence)


ELMo Embeddings ([issue 260](https://github.com/zalandoresearch/flair/issues/260))

Flair now also includes [ELMo embeddings](http://www.aclweb.org/anthology/N18-1202). We use the implementation of [AllenNLP](https://allennlp.org/elmo). As this implementation comes with a lot of sub-dependencies, you need to first install the library via `pip install allennlp` before you can use it in Flair. Using the embeddings is as simple as using any other embedding type:
python
from flair.embeddings import ELMoEmbeddings
init embedding
embedding = ELMoEmbeddings()
create a sentence
sentence = Sentence('The grass is green .')
embed words in sentence
embedding.embed(sentence)



Multi-Dataset Training ([issue 232](https://github.com/zalandoresearch/flair/issues/232))

You can now train a model on on multiple datasets with the `MultiCorpus` object. We use this to train our multilingual models.

Just create multiple corpora and put them into `MultiCorpus`:

python
english_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
german_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_GERMAN)
dutch_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_DUTCH)

multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])

The `multi_corpus` can now be used for training, just as any other corpus before. Check [the tutorial](TUTORIAL_6_TRAINING_A_MODEL.md) for more details.

Parameter Selection using Hyperopt ([issue 242](https://github.com/zalandoresearch/flair/issues/242))

We built a wrapper around [hyperopt](http://hyperopt.github.io/hyperopt/) to allow you to search for the best hyperparameters for your downstream task.

Define your search space and start training using several different parameter settings. The results are written to a specific file called `param_selection.txt` in the result directory. Check [the tutorial](TUTORIAL_7_HYPER_PARAMETER.md) for more details.

NLP Dataset Downloader ([issue 243](https://github.com/zalandoresearch/flair/issues/243))

To make it as easy as possible to start training models, we have a new feature for automatically downloading publicly available NLP datasets. For instance, by running this code:

python
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)


you download the Universal Dependencies corpus for English and can immediately start training models. The list of available datasets can be found in [the tutorial](TUTORIAL_5_CORPUS.md).


Model training features

We added various other features to model training.

Saving training log ([issue 212](https://github.com/zalandoresearch/flair/issues/212))

The training log output will from now on be automatically saved in the result directory you provide for training.
The log will be saved in `training.log`.

Resuming training ([issue 217](https://github.com/zalandoresearch/flair/issues/217))

It is now possible to stop training at any point in time and to resume it later by training with `checkpoint` set to `True`. Check [the tutorial](TUTORIAL_6_TRAINING_A_MODEL.md) for more details.

Custom Optimizers ([issue 220](https://github.com/zalandoresearch/flair/issues/220))

You can now choose other optimizers besides SGD, i.e. any PyTorch optimizer, plus our own modified implementations of SDG and Adam, namely SGDW and AdamW.

Learning Rate Finder ([issue 228](https://github.com/zalandoresearch/flair/issues/228))

A new helper method to assist you in finding a [good learning rate for model training](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_8_MODEL_OPTIMIZATION.md#finding-the-best-learning-rate).


Breaking Changes

This release introduces breaking changes. The most important are:

Unified Model Trainer ([issue 189](https://github.com/zalandoresearch/flair/issues/189))

Instead of maintaining two separate trainer classes for sequence labeling and text classification, we now have one model training class, namely `ModelTrainer`. This replaces the earlier classes `SequenceTaggerTrainer` and `TextClassifierTrainer`.

Downstream task models now implement the new `flair.nn.Model` interface. So, both the `SequenceTagger` and `TextClassifier` now inherit from `flair.nn.Model`. This allows both models to be trained with the `ModelTrainer`, like this:

python
Training text classifier
tagger = SequenceTagger(512, embeddings, tag_dictionary, 'ner')
trainer = ModelTrainer(tagger, corpus)
trainer.train('results')

Training text classifier
classifier = TextClassifier(document_embedding, label_dictionary=label_dict)
trainer = ModelTrainer(classifier, corpus)
trainer.train('results')


The advantage is that all training parameters ans training procedures are now the same for sequence labeling and text classification, which reduces redundancy and hopefully make it easier to understand.

Metric class

The metric class is now refactored to compute micro and macro averages for F1 and accuracy. There is also a new enum `EvaluationMetric` which you can pass to the ModelTrainer to tell it what to use for evaluation.

Updates and Bug Fixes

Torch 1.0 ([issue 176](https://github.com/zalandoresearch/flair/issues/299))

Flair now bulids on torch 1.0.

Use Pathlib ([issue 176](https://github.com/zalandoresearch/flair/issues/176))

Flair now uses `Path` wherever possible to allow easier operations on files/directories. However, our interfaces still allows you to pass a string, which will then be transformed into a Path by Flair.

Bug Fixes

* Fix: Non-whitespaced tokenized text results into an infinite loop ([issue 226](https://github.com/zalandoresearch/flair/issues/226))
* Fix: Getting IndexError: list index out of range error ([issue 233](https://github.com/zalandoresearch/flair/issues/233))
* Do not reset cache directory always to None ([issue 249](https://github.com/zalandoresearch/flair/issues/249))
* Filter sentences with zero tokens ([issue 266](https://github.com/zalandoresearch/flair/issues/266))

0.3.2

This is an update over release 0.3.1 with some critical bug fixes, a few new features and a lot more pre-packaged embeddings.

New Features

Embeddings

More word embeddings (194 )

We added FastText embeddings for 10 languages ('en', 'de', 'fr', 'pl', 'it', 'es', 'pt', 'nl', 'ar', 'sv'), load using the two-letter language code, like this:

python
french_embedding = WordEmbeddings('fr')


More character LM embeddings (204 187 )

Thanks to contribution by [stefan-it](https://github.com/stefan-it/flair-lms), we added CharLMEmbeddings for Bulgarian and Slovenian. Load like this:

python
flm_embeddings = CharLMEmbeddings('slovenian-forward')
blm_embeddings = CharLMEmbeddings('slovenian-backward')


Custom embeddings (170 )

Add explanation on how to use your own custom word embeddings. Simply convert to gensim.KeyedVectors and point embedding class there:

python
custom_embedding = WordEmbeddings('path/to/your/custom/embeddings.gensim')


New embeddings type: `DocumentPoolEmbeddings` (191 )

Add a new embedding class for document-level embeddings. You can now choose between different pooling options, e.g. min, max and average. Create the new embeddings like this:

python
word_embeddings = WordEmbeddings('glove')
pool_embeddings = DocumentPoolEmbeddings([word_embeddings], mode='min')


Language model

New method: `generate_text()` (167 )

The `LanguageModel` class now has an in-built `generate_text()` method to sample the LM. Run code like this:

python
load your language model
model = LanguageModel.load_language_model('path/to/your/lm')

generate 2000 characters
text = model.generate_text(20000)
print(text)


Metrics

Class-based metrics in `Metric` class (164 )

Refactored Metric class to provide class-based metrics, as well as micro and macro averaged F1 scores.

Bug Fixes

Fix serialization error for MacOS and Windows (174 )

On these setups, we got errors when serializing or loading large models. We've put in place a workaround that limits model size so it works on those systems. Added bonus is that models are smaller now.

"Frozen" dropout (184 )

Potentially big issue in which dropout was frozen in the first epoch in embeddings produced from the character LM, meaning that throughout training the same dimensions stayed dropped. Fixed this.

Testing step in language model trainer (178 )

Previously, the language model was never applied to test data during training. A final testing step has been added in (again).

Testing

Distinguish between unit and integration tests (183)

Instructions on how to run tests with pipenv (161 )

Optimizations

Disable autograd during testing and prediction (175)

Since autograd is unused here this gives us minor speedups.

0.3.1

This is a stability-update over release 0.3.0 with small optimizations, refactorings and bug fixes. For list of new features, refer to 0.3.0.

Optimizations

Retain Token embeddings in memory by default (146 )

Allow for faster training of text classifier on large datasets by keeping token embeddings im memory.

Always clear embeddings after prediction (149 )

After prediction, remove embeddings from memory to avoid filling up memory.


Refactorings

Alignd TextClassificationTrainer and SquenceTaggerTrainer (148 )

Align signatures and features of the two training classes to make it easier to understand training options.

Updated DocumentLSTMEmbeddings (150 )

Remove unused flag and code from DocumentLSTMEmbeddings

Removed unneeded AWS and Jinja2 dependencies (158 )

Some dependencies are no longer required.


Bug Fixes

Fixed error when predicting over empty sentences. (157)

Serialization: reset cache settings when saving a model. (153 )

0.3.0

Breaking Changes

New `Label` class with confidence score (https://github.com/zalandoresearch/flair/issues/38)

A tag prediction is not a simple string anymore but a `Label`, which holds a value and a confidence score.
To obtain the tag name you need to call `tag.value`. To get the score call `tag.score`. This can help you build
applications in which you only want to use predictions that lie above a specific confidence threshold.

`LockedDropout` moved to the new `flair.nn` module (https://github.com/zalandoresearch/flair/issues/48)


New Features

Multi-token spans (https://github.com/zalandoresearch/flair/issues/54, https://github.com/zalandoresearch/flair/issues/97)
Entities are can now be wrapped into multi-token spans (type: `Span`). This is helpful for entities that span multiple words, such as "George Washington". A `Span` contains the position of the entity in the original text, the tag, a confidence score, and its text. You can get spans from a sentence by using the `get_spans()` method, like so:
python
from flair.data import Sentence
from flair.models import SequenceTagger

make a sentence
sentence = Sentence('George Washington went to Washington .')

load and run NER
tagger = SequenceTagger.load('ner')
tagger.predict(sentence)

get span entities, together with tag and confidence score
for entity in sentence.get_spans('ner'):
print('{} {} {}'.format(entity.text, entity.tag, entity.score))


Predictions with confidence score (https://github.com/zalandoresearch/flair/issues/38)
Predicted tags are no longer simple strings, but objects of type `Label` that contain a value and a confidence score. These scores are extracted during prediction from the sequence tagger or text classifier and indicate how confident the model is of a prediction. Print confidence scores of tags like this:

python
from flair.data import Sentence
from flair.models import SequenceTagger

make a sentence
sentence = Sentence('George Washington went to Washington .')

load the POS tagger
tagger = SequenceTagger.load('pos')

run POS over sentence
tagger.predict(sentence)

print token, predicted POS tag and confidence score
for token in sentence:
print('{} {} {}'.format(token.text, token.get_tag('pos').value, token.get_tag('pos').score))


Visualization routines (https://github.com/zalandoresearch/flair/issues/61)
`flair` now includes visualizations for plotting training curves and weights when training a sequence tagger or text classifier. We also added visualization routines for plotting embeddings and highlighting tags in a sentence. For instance, to visualize contextual string embeddings, do this:

python
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import CharLMEmbeddings
from flair.visual import Visualizer

get a list of Sentence objects
corpus = NLPTaskDataFetcher.fetch_data(NLPTask.CONLL_03).downsample(0.1)
sentences = corpus.train + corpus.test + corpus.dev

init embeddings (can also be a StackedEmbedding)
embeddings = CharLMEmbeddings('news-forward-fast')

embed corpus batch-wise
batches = [sentences[x:x + 8] for x in range(0, len(sentences), 8)]
for batch in batches:
embeddings.embed(batch)

visualize
visualizer = Visualizer()
visualizer.visualize_word_emeddings(embeddings, sentences, 'data/visual/embeddings.html')


Implementation of different dropouts (https://github.com/zalandoresearch/flair/issues/48)
Different dropout possibilities (Locked Dropout and Word Dropout) were added and can be used during training.

Memory management for training on large data sets (https://github.com/zalandoresearch/flair/issues/137)
`flair` now stores contextual string embeddings on disk to speed up training and allow for training on larger datsets.

Pre-trained language models for Polish
Added pre-trained language models for Polish, donated by [(Borchmann et al., 2018)](https://github.com/applicaai/poleval-2018). Load the Polish embeddings like this:

python
flm_embeddings = CharLMEmbeddings('polish-forward')
blm_embeddings = CharLMEmbeddings('polish-backward')


Bug Fixes

Fix evaluation of sequence tagger (https://github.com/zalandoresearch/flair/issues/79, https://github.com/zalandoresearch/flair/issues/75)
The script `eval.pl` for sequence tagger contained bugs. `flair` now uses its own evaluation methods.

Fix bugs in text classifier (https://github.com/zalandoresearch/flair/issues/108)
Fixed bugs in single label training and out-of-memory errors during evaluation.

Others

Standardize logging output (https://github.com/zalandoresearch/flair/issues/16)
Logging output for sequence tagger and text classifier is imporved and standardized.

Update torch version (https://github.com/zalandoresearch/flair/issues/34, https://github.com/zalandoresearch/flair/issues/106)

Page 5 of 6

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.