Estnltk

Latest version: v1.7.2

Safety actively analyzes 613568 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 3

3.5

* Text class has been redesigned.
Text annotations are now decomposed into Span-s, SpanList-s and Layer-s;
* A common class for text annotators -- Tagger class -- has been introduced;
* Word segmentation has been redesigned.
It is now a three-step process, which includes basic tokenization (layer 'tokens'), creation of compound tokens (layer 'compound\_tokens'), and creation of words (layer 'words') based on 'tokens' and 'compound\_tokens'.
Token compounding rules that are aware of text units containing punctuation (such as abbreviations, emoticons, web addresses) have been implemented;
* The segmentation order has been changed: word segmentation now comes before the sentence segmentation, and the paragraph segmentation comes after the sentence segmentation;
* Sentence segmentation has been redesigned.
Sentence segmenter is now aware of the compound tokens (fixing compound tokens can improve sentence segmentation results), and special post-correction steps are applied to improve quality of sentence segmentation;
* Morphological analysis interface has been redesigned.
Morphological analyses are no longer attached to the layer 'words' (although they can be easily accessed through the words, if needed), but are contained in a separate layer named 'morph_analysis'.
* Morphological analysis process can now more easily decomposed into analysis and disambiguation (using special taggers VabamorfAnalyzer and VabamorfDisambiguator).
Also, a tagger responsible for post-corrections of morphological analysis (PostMorphAnalysisTagger) has been introduced, and post-corrections for improving quality of part of speech, and quality of analysis of numbers and pronouns have been implemented;
* Rules for converting morphological analysis categories from Vabamorf's format to GT (giellatekno) format have been ported from the previous version of EstNLTK.
Note, however, that the porting is not complete: full functionality requires 'clauses' annotation (which is currently not available);
* ...
* Other components of EstNLTK (such as the temporal expression tagger, and the named entity recognizer) are yet to be ported to the new version in the future;

Added
-----
* SyntaxIgnoreTagger, which can be used for detecting parts of text that should be ignored by the syntactic analyser.
Note: it is yet to be integrated with the pre-processing module of syntactic analysis;

1.7.2

Changed

* Renamed `PgCollection.meta` -> `meta_columns`;
* Deprecated `PgCollection.create()`. Use `PostgresStorage.add_collection` method to create new collections;
* Deprecated `PgCollection.delete()`, `PostgresStorage.delete(collection)` and `PostgresStorage.__delitem__(collection)`. Use `PostgresStorage.delete_collection` method to remove collections;
* Deprecated `PgCollection.select_fragment_raw()` (no longer relevant) and `continue_creating_layer` (use `create_layer(..., mode="append")` instead);
* Deprecated `PgCollection.has_fragment()`, `get_fragment_names()`, `get_fragment_tables()`. Use `collection.has_layer(name, layer_type="fragmented")` and `collection.get_layer_names_by_type(layer_type="fragmented")` instead;
* Merged `PgCollection.create_fragment` into `PgCollection.create_fragmented_layer`;
* Merged `PgCollection._create_layer_table` into `PgCollection.add_layer`;
* `StorageCollections.load()` removed legacy auto-insert behaviour;
* Refactored `PostgresStorage`: upon connecting to database, a new schema is now automatically created if the flag `create_schema_if_missing` has been set and the user has enough privileges. No need to manually call `create_schema` anymore;
* Refactored `StorageCollections` & `PostgresStorage`: relocated `storage_collections` table insertion and deletion logic to `PostgresStorage`;
* Refactored `PgCollection.add_layer`: added `layer_type` parameter, deprecated `fragmented_layer` paramater and added `'multi'` to layer types;
* Replaced function `conll_to_str` with `converters.conll.layer_to_conll`. For usage, see [this tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/C_syntax/01_syntax_preprocessing.ipynb) ("CONLL exporter");
* Refactored `DateTagger`, `AddressPartTagger`, `SyntaxIgnoreTagger`, `CompoundTokenTagger`: use new RegexTagger instead of the legacy one;
* Refactored `AdjectivePhrasePartTagger`: use `rule_taggers` instead of legacy `dict_taggers`;
* Updated `StanzaSyntax(Ensemble)Tagger`: random picking of ambiguous analyses is no longer deterministic: you'll get different result on each run if the input is morphologically ambiguous. However, if needed, you can use seed values to ensure repeatability. For details, see [stanza parser tutorials](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_stanza.ipynb);
* Updated `StanzaSyntaxEnsembleTagger`:
* if user attempts to process sentences longer than 1000 words with GPU / CUDA, a guarding exception will be thrown. Pass parameter `gpu_max_words_in_sentence=None` to the tagger to disable the exception;
* added `aggregation_algorithm` parameter, which defaults to `'las_coherence'` (the same algorithm that has been used in the previous versions);
* added a new aggregation algorithm: `'majority_voting'`. With the majority voting, the input will be processed token-wise and head & deprel that gets most votes from models will be picked for each token. Note, however, that this method can produce invalid tree structures, as there is no mechanism to ensure that majority-voting-picked tokens will make up a valid tree.
* Renamed `SoftmaxEmbTagSumWebTagger` to `NeuralMorphDisambWebTagger` and made following updates:
* `NeuralMorphDisambWebTagger` is now a `BatchProcessingWebTagger`;
* `NeuralMorphDisambWebTagger` is now also a `Retagger` and can be used to disambiguate ambiguous `morph_analysis` layer. In the same vein, `NeuralMorphTagger` was also made `Retagger` and can be used for disambiguation. For details on the usage, see [the neural morph tutorial](https://github.com/estnltk/estnltk/blob/b67ca34ef0702bb7d7fbe1b55639327dfda55830/tutorials/nlp_pipeline/B_morphology/08_neural_morph_tagger_py37.ipynb);
* `estnltk_neural` package requirements: removed explicit `tensorflow` requirement.
* Note, however, that `tensorflow <= 1.15.5` (along with Python `3.7`) is still required if you want to use `NeuralMorphTagger`;
* `Wordnet`:
* default database is no longer distributed with the package, wordnet now downloads the database automatically via `estnltk_resources`;
* alternatively, a local database can now be imported via parameter `local_dir`;
* updated wordnet database version to **2.6.0**;
* `HfstClMorphAnalyser`:
* the model is no longer distributed with the package, the analyser now downloads model automatically via `estnltk_resources`;
* Refactored `BatchProcessingWebTagger`:
* renamed parameter `batch_layer_max_size` -> `batch_max_size`;
* the tagger has now 2 working modes: a) batch splitting guided by text size limit, b) batch splitting guided by layer size limit (the old behaviour);
* Updated `vabamorf`'s function `syllabify_word`:
* made compound word splitting heuristic more tolerant to mismatches, and as a result, we can now more properly syllabify words which root tokens do not match exactly with the surface form. Examples: `kolmekümne` (`kol-me-küm-ne`), `paarisada` (`paa-ri-sa-da`), `ühesainsas` (`ü-hes-ain-sas`). However, if you need to use the old syllabification behaviour, pass parameter `tolerance=0` to the function, e.g. `syllabify_word('ühesainsas', tolerance=0)`.

Added

* `MultiLayerTagger` -- interface for taggers that create multiple layers at once;
* `NerWebTagger` that tags NER layers via [tartuNLP NER webservice](https://ner.tartunlp.ai/api) (uses EstBERTNER v1 model). See [this tutorial](https://github.com/estnltk/estnltk/blob/b970ea98532921a4e06022fff2cd3755fc181edf/tutorials/nlp_pipeline/D_information_extraction/02_named_entities.ipynb) for details;
* `EstBERTNERTagger` that tags NER layers using huggingface EstBERTNER models. See [this tutorial](https://github.com/estnltk/estnltk/blob/b970ea98532921a4e06022fff2cd3755fc181edf/tutorials/nlp_pipeline/D_information_extraction/02_named_entities.ipynb) for details;
* `RelationLayer` -- new type of layer for storing information about relations between entities mentioned in text, such as coreference relations between names and pronouns, or semantic roles/argument structures of verbs. However, `RelationLayer` has not yet completely integrated with EstNLTK's tools, and there are following limitations:
* you cannot access attributes of foreign layers (such as `lemmas` from `morph_analysis`) via spans of a relation layer;
* `estnltk_core.layer_operations` do not support `RelationLayer`;
* `estnltk.storage.postgres` does not support `RelationLayer`;
* `estnltk.visualisation` does not handle `RelationLayer`;

For usage examples, see the [RelationLayer's tutorial](https://github.com/estnltk/estnltk/blob/b8ad0932a852daedb1e3eddeb02c944dd1f292ee/tutorials/system/relation_layer.ipynb).

* `RelationTagger` -- interface for taggers creating `RelationLayer`-s. Instructions on how to create a `RelationTagger` can be found in [this tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/taggers/base_tagger.ipynb);
* `WebRelationTagger` & `BatchProcessingWebRelationTagger`, which allow to create web-based `RelationTagger`-s;
* `CoreferenceTagger` which detects pronominal coreference relations. The tool is based on [Estonian Coreference System v1.0.0](https://github.com/SoimulPatriei/EstonianCoreferenceSystem) and currently relies on stanza 'et' models for pre-processing the input text. In future, the tool also becomes available via a web service (by `CoreferenceV1WebTagger`). For details, see the [coreference tutorial](https://github.com/estnltk/estnltk/blob/28814a3fa9ff869cd4cfc88308f6ce7e29157889/tutorials/nlp_pipeline/D_information_extraction/04_pronominal_coreference.ipynb);
* Updated `VabamorfDisambiguator`, `VabamorfTagger` & `VabamorfCorpusTagger`: added possibility to preserve phonetic mark-up (even with disambiguation);
* `UDMorphConverter` -- tagger that converts Vabamorf's morphology categories to Universal Dependencies morphological categories. Note that the conversion can introduce additional ambiguities as there is no disambiguation included, and roughly 3% to 9% of words do not obtain correct UD labels with this conversion. More details in [tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/B_morphology/06_morph_analysis_with_ud_categories.ipynb);
* `RobertaTagger` for tagging `EMBEDDIA/est-roberta` embeddings. The interface is analogous to that of `BertTagger`. [Tutorial](https://github.com/estnltk/estnltk/blob/e223a7e6245d29a6b1838335bfa3872a0aa92840/tutorials/nlp_pipeline/E_embeddings/bert_embeddings_tagger.ipynb).
* `BertTokens2WordsRewriter` -- tagger that rewrites BERT tokens layer to a layer enveloping EstNLTK's words layer. Can be useful for mapping Bert's output to EstNLTK's tokenization (currently used by `EstBERTNERTagger`).
* `PhraseExtractor` -- tagger for removing phrases and specific dependency relations based on UD-syntax;
* `ConsistencyDecorator` -- decorator for PhraseExtractor. Calculates syntax conservation scores after removing phrase from text;
* `StanzaSyntaxTaggerWithIgnore` -- entity ignore tagger. Retags text with StanzaSyntaxTagger and excludes phrases found by PhraseExtractor;
* `estnltk.resource_utils.delete_all_resources`. Apply it before uninstalling EstNLTK to remove all resources;
* `clauses_and_syntax_consistency` module, which allows to 1) detect potential errors in clauses layer using information from the syntax layer, 2) fix clause errors with the help of syntactic information. [Tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/F_annotation_consistency/clauses_and_syntax_consistency.ipynb);
* `PostgresStorage` methods:
* `add_collection`
* `refresh`
* `delete_collection`
* `PgCollection` methods:
* `refresh`
* `get_layer_names_by_type`
* `PgCollectionMeta` (provides views to `PgCollection`'s metadata, and allows to query metadata) and `PgCollectionMetaSelection` (read-only iterable selection over `PgCollection`'s metadata values);
* Parameter `remove_empty_nodes` to `conll_to_text` importer -- if switched on (default), then empty / null nodes (ellipsis in the enhanced representation) will be discarded (left out from textual content and also from annotations) while importing from conllu files;
* Added a simplified example about how to get whitespace tokenization for words to tutorial [`restoring_pretokenized_text.ipynb`](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/corpus_processing/restoring_pretokenized_text.ipynb);
* `pg_operations.drop_all`;

Fixed

* `extract_(discontinuous_)sections`: should now also work on non-ambiguous layer that has a parent;
* `BaseText.topological_sort`: should now also work on layers with malformed/unknown dependencies;
* `CompoundTokenTagger`: 2nd level compounding rules now also work on detached layers;
* `TimexTagger`'s rules: disabled extraction of too long year values (which could break _Joda-Time_ integer limits);
* Bug that caused collection metadata to disappear when using `PgCollection.insert` (related to `PgCollection.column_names` not returning automatically correct metadata column names on a loaded collection; newly introduced `PgCollectionMeta` solved that problem);
* `StanzaSyntax(Ensemble)Tagger`: should now also work on detached layers;
* Fixed `BaseLayer.diff`: now also takes account of a difference in `secondary_attributes`;
* Fixed `downloader._download_and_install_hf_resource`: disabled default behaviour and `use_symlinks` option, because it fails under the Windows;
* Fixed `download`: made it more flexible on parsing (idiosyncratic) 'Content-Type' values;
* Fixed `BertTagger` tokenization: `BertTagger` can now better handle misalignments between bert tokens and word spans caused by emojiis, letters with diacritics, and the invisible token `\xad`;

1.7.1

Changed

* Stucture and organization of [EstNLTK's tutorials](https://github.com/estnltk/estnltk/tree/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials), including:
* Relocated introductory tutorials into the folder ['basics'](https://github.com/estnltk/estnltk/tree/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/basics);
* Relocated 'estner_training' tutorials to 'nlp_pipeline/D_information_extraction';
* Updated syntax tutorials and split into parser-wise sub tutorials:
* [maltparser tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_maltparser.ipynb);
* [stanza's parser tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_stanza.ipynb);
* [udpipe's parser tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_udpipe.ipynb);
* [vislcg3 tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_vislcg3.ipynb);
* Updated `parse_enc` -- it can now be used for parsing [ENC 2021](https://metashare.ut.ee/repository/browse/eesti-keele-uhendkorpus-2021-vert/f176ccc0d05511eca6e4fa163e9d454794df2849e11048bb9fa104f1fec2d03f/). See the details from [the tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/corpus_processing/importing_text_objects_from_corpora.ipynb);
* The function `parse_enc_file_iterator` now attempts to _automatically fix malformed paragraph annotations_ . As a result, more words and sentences can be imported from corpora, but the side effect is that there will be artificially created paragraph annotations -- even for documents that do not have paragraph annotations originally. The setting can be turned off, if needed;
* Updated `get_resource_paths` function: added EstNLTK version checking. A resource description can now contain version specifiers, which declare estnltk or estnltk_neural version required for using the resource. Using version constraints is optional, but if they are used and constraints are not satisfied, then `get_resource_paths` won't download the resource nor return its path;
* Relocated `estnltk.transformers` (`MorphAnalysisWebPipeline`) into `estnltk.web_taggers`;
* Refactoring: moved functions `_get_word_texts` & `_get_word_text` to `estnltk.common`;

Added

* `ResourceView` class, which lists EstNLTK's resources as a table, and shows their download status. See the [resources tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/basics/estnltk_resources.ipynb) for details.
* `SyntaxIgnoreCutter` class, which cuts the input Text object into a smaller Text by leaving out all spans from the syntax_ignore layer (produced by `SyntaxIgnoreTagger`). The resulting Text can then be analysed syntactically while skipping parts of a text may be difficult to analyse. For details, see the [tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/02_syntax_preprocessing_with_ignoretagger.ipynb);
* function `add_syntax_layer_from_cut_text`, which can be used to carry over the syntactic analysis layer from the cut text (created by `SyntaxIgnoreCutter`) to the original text. The [tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/02_syntax_preprocessing_with_ignoretagger.ipynb);

Fixed

* Syntax preprocessing [tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/01_syntax_preprocessing.ipynb) to describe the current state of preprocessing;

1.7.0

Changed

* EstNLTK's tools that require large resources (e.g. syntactic parsers and neural analysers) can now download resources automatically upon initialization. This stops the program flow with an interactive prompt asking
for user's permission to download the resource. However, you can predownload the resource in order to avoid the interruption, see this [tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/estnltk_resources.ipynb) for details.

* Stucture and organization of [EstNLTK's tutorials](https://github.com/estnltk/estnltk/tree/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials). However, the work on updating tutorials is still not complete.

* `PgCollection`: now uses `CollectionStructure.v30` by default.

* Disambiguator (a system tagger): it's now a Retagger, but can work either as a retagger or a tagger, depending on the inputs. [Tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/taggers/system/disambiguator.ipynb).

Added

* `downloader` & `resources_utils` for downloading additional resources and handling paths of downloaded resources. [Tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/estnltk_resources.ipynb)

* Collocation net -- allows to find different connections between words based on the collocations each word was in. [Tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/collocation_net/tutorial.ipynb).

* `PgCollection`: added `CollectionStructure.v30` which allows to create sparse layer tables. Sparse layer tables do not store empty layers, which can save up the storage space and allow faster queries over tables & collection. The [main db tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/storage/storing_text_objects_in_postgres.ipynb) exemplfies the creation and usage of sparse layers.

* `PgCollection.create_layer` & `PgCollection.add_layer` now take parameter `sparse=True` which turns layer into a sparse layer;

* `PgCollection.select` now has a boolean parameter `keep_all_texts`: turning the parameter off yields only texts with non-empty sparse layers;

* `PgSubCollection` now has methods `create_layer` and `create_layer_block` which can be used to create a sparse layer from specific subcollection;

Fixed

* `BaseText.__repr__` method using wrong variable name;

* `NeuralMorphTagger`'s configuration reading and handling: model locations can now be freely customized;

* `TimexTagger`'s rules on detecting dates with roman numeral months & dates with slashes.

1.7.0rc0

EstNLTK has gone through a major package restructuring and refactoring process.

Package restructuring

EstNLTK has been split into 3 Python packages:

* `estnltk-core` -- package containing core datastructures, interfaces and data conversion functions of the EstNLTK library;
* `estnltk` -- the standard package, which contains basic linguistic analysis (including Vabamorf morphological analysis, syntactic parsing and information extraction models), system taggers and Postgres database tools;
* `estnltk-neural` -- package containing linguistic analysis based on neural models (Bert embeddings tagger, Stanza syntax taggers and neural morphological tagger);

Normally, end users only need to install `estnltk` (as `estnltk-core` will be installed automatically).

Tools in `estnltk-neural` require installation of deep learning frameworks (`tensorflow`, `pytorch`), and are demanding for computational resources; they also rely on large models (which need to be downloaded separately).

Changed

* `Text`:

* method `text.analyse` is deprecated and no longer functional. Use `text.tag_layer` to create layers. Calling `text.analyse` will display an error message with additional information on migrating from `analyse` to `tag_layer`;
* added instance variable `text.layer_resolver` which uses EstNLTK's default pipeline to create layers. The following new layers were added to the pipeline: `'timexes'`,` 'address_parts`', `'addresses'`, `'ner'`, `'maltparser_conll_morph'`, `'gt_morph_analysis'`, `'maltparser_syntax'`,`'verb_chains'`, `'np_chunks'`
* Shallow copying of a `Text` is no longer allowed. Only `deepcopy` can be used;
* Renamed method: `text.list_layers` -> `text.sorted_layers`;
* Renamed property: `text.attributes` -> `text.layer_attributes`;
* `Text` is now a subclass of `BaseText` (from `estnltk-core`). `BaseText` stores raw text, metadata and layers, has methods for adding and removing layers, and provides layer access via indexing (square brackets). `Text` provides an alternative access to layers (layers as attributes), and allows to call for text analysers / NLP pipeline (`tag_layer`)

* `Layer`:
* Removed `to_dict()` and `from_dict()` methods. Use `layer_to_dict` and `dict_to_layer` from `estnltk.converters` instead;
* Shallow copying of a `Layer` is no longer allowed. Only `deepcopy` can be used;
* Renamed `Layer.attribute_list()` to `Layer.attribute_values()`;
* indexing attributes (`start`, `end`, `text`) should now be passed to the method via keyword argument `index_attributes`. They will be prepended to the selection of normal attributes;
* Renamed `Layer.metadata()` to `Layer.get_overview_dataframe()`;
* Method `Layer.add_annotation(base_span, annotations)`:
* now allows to pass `annotations` as a dictionary (formerly, `annotations` could be passed only as keyword arguments);
* `Annotation` object cannot be passed as a `base_span`;
* HTML representation: maximum length of a column is 100 characters and longer strings will be truncated; however, you can change the maximum length via `OUTPUT_CONFIG['html_str_max_len']` (a configuration dictionary in `estnltk_core.common`);
* `Layer` is now a subclass of `BaseLayer` (from `estnltk-core`). `BaseLayer` stores text's annotations, attributes of annotations and metadata, has methods for adding and removing annotations, and provides span/attribute access via indexing (square brackets). `Layer` adds layer operations (such as finding descendant and ancestor layers, and grouping spans or annotations of the layer), provides an alternative access to local attributes (via dot operator), and adds possibility to access foreign attributes (e.g. attributes of a parent layer).

* ` SpanList/Envelopingspan/Span/Annotation`:
* Removed `to_records()`/`to_record()` methods. The same functionality is provided by function `span_to_records` (from `estnltk_core.converters`), but note that the conversion to records does not support all EstNLTK's data structures and may result in information loss. Therefore, we recommend converting via functions `layer_to_dict`/`text_to_dict` instead;
* Method `Span.add_annotation(annotation)` now allows to pass `annotation` as a dictionary (formerly, `annotation` could be passed only as keyword arguments);
* Constructor `Annotation(span, attributes)` now allows to pass `attributes` as a dictionary (formerly, `attributes` could be passed only as keyword arguments);

* `Tagger`:
* trying to `copy` or `deepcopy` a tagger now raises `NotImplementedError`. Copying a tagger is a specific operation, requires handling of tagger's resources and therefore no copying should attempted by default. Instead, you should create a new tagger instance;

* `PgCollection`: Removed obsolete `create_layer_table` method. Use `add_layer` method instead.

* `estnltk.layer_operations`
* moved obsolete functions `compute_layer_intersection`, `apply_simple_filter`, `count_by_document`, `dict_to_df`, `group_by_spans`, `conflicts`, `iterate_conflicting_spans`, `combine`, `count_by`, `unique_texts`, `get_enclosing_spans`, `apply_filter`, `drop_annotations`, `keep_annotations`, `copy_layer` (former `Layer.copy()`) to `estnltk_core.legacy`;

* Renamed `Resolver` -> `LayerResolver` and changed:
* `default_layers` (used by `Text.tag_layer`) are held at the `LayerResolver` and can be changed;
* `DEFAULT_RESOLVER` is now available from `estnltk.default_resolver`. Former location `estnltk.resolve_layer_dag` was preserved for legacy purposes, but will be removed in future;
* Renamed property `list_layers` -> `layers`;
* HTML/string represenations now display default_layers and a table, which lists names of creatable layers, their prerequisite layers, names of taggers responsible for creating the layers and descriptions of corresponding taggers;
* Trying to `copy` or `deepcopy` a layer resolver results in an exception. You should only create new instances of `LayerResolver` -- use function `make_resolver()` from `estnltk.default_resolver` to create a new default resolver;

* Renamed `Taggers` -> `TaggersRegistry` and changed:
* now retaggers can also be added to the registry. For every tagger creating a layer, there can be 1 or more retaggers modifying the layer. Also, retaggers of a layer can be removed via `clear_retaggers`;
* taggers and retaggers can now be added as `TaggerLoader` objects: they declare input layers, output layer and importing path of a tagger, but do not load the tagger until explicitly demanded ( _lazy loading_ );

* Refactored `AnnotationRewriter`:
* tagger should now clearly define whether it only changes attribute values (default) or modifies the set of attributes in the layer;
* tagger should not add or delete annotations (this is job for `SpanAnnotationsRewriter`);

* Restructured `estnltk.taggers` into 3 submodules:
* `standard` -- tools for standard NLP tasks in Estonian, such as text segmentation, morphological processing, syntactic parsing, named entity recognition and temporal expression tagging;
* `system` -- system level taggers for finding layer differences, flattening and merging layers, but also taggers for rule-based information extraction, such as phrase tagger and grammar parsing tagger;
* `miscellaneous` -- taggers made for very specific analysis purposes (such as date extraction from medical records), and experimental taggers (verb chain detection, noun phrase chunking);
* _Note_: this should not affect importing taggers: you can still import most of the taggers from `estnltk.taggers` (except neural ones, which are now in the separate package `estnltk-neural`);

* `serialisation_map` (in `estnltk.converters`) was replaced with `SERIALISATION_REGISTRY`:
* `SERIALISATION_REGISTRY` is a common registry used by all serialisation functions (such as `text_to_json` and `json_to_text` in `estnltk_core.converters`). The registry is defined in the package `estnltk_core` (contains only the `default` serialization module), and augmented in `estnltk` package (with `legacy_v0` and `syntax_v0` serialization modules);

* Renamed `estnltk.taggers.dict_taggers` -> `estnltk.taggers.system.rule_taggers` and changed:
* `Vocabulary` class is replaced by `Ruleset` and `AmbiguousRuleset` classes
* All taggers now follow a common structure based on a pipeline of static rules, dynamic rules and a global decorator
* Added new tagger `SubstringTagger` to tag occurences of substrings in text
* Old versions of the taggers are moved to `estnltk.legacy` for backward compatibility

* Relocated TCF, CONLL and CG3 conversion utils to submodules in `estnltk.converters`;

* Relocated `estnltk.layer` to `estnltk_core.layer`;

* Relocated `estnltk.layer_operations` to `estnltk_core.layer_operations`;

* Moved functionality of `layer_operations.group_by_layer` into `GroupBy` class;

* Relocated `TextaExporter` to `estnltk.legacy` (not actively developed);

* Renamed `TextSegmentsTagger` -> `HeaderBasedSegmenter`;

* Renamed `DisambiguatingTagger` -> `Disambiguator`;

* Rename `AttributeComparisonTagger` --> `AttributeComparator`;

* Relocated Vabamorf's default parameters from `estnltk.taggers.standard.morph_analysis.morf_common` to `estnltk.common`;

* Merged `EnvelopingGapTagger` into `GapTagger`:
* `GapTagger` now has 2 working modes:
* Default mode: look for sequences of consecutive characters not covered by input layers;
* EnvelopingGap mode: look for sequences of enveloped layer's spans not enveloped by input enveloping layers;

* Refactored `TimexTagger`:
* removed `TIMEXES_RESOLVER` and moved all necessary preprocessing (text segmentation and morphological analysis) inside `TimexTagger`;
* `'timexes'` is now a flat layer by default. It can be made enveloping `'words'`, but this can result in broken timex phrases due to differences in `TimexTagger`'s tokenization and EstNLTK's default tokenization;

* `Vabamorf`'s optimization:
* Disabled [Swig proxy classes](http://www.swig.org/Doc3.0/Python.html#Python_builtin_types). As a result, the morphological analysis is faster. However, this update is under testing and may not be permanent, because disabled proxy classes are known to cause conflicts with other Python Swig extensions compiled under different settings (for more details, see [here](https://stackoverflow.com/q/21103242) and [here](https://github.com/estnltk/estnltk/blob/b0d0ba6d943fb42b923fa6999c752fead927c992/dev_documentation/hfst_integration_problems/solving_stringvector_segfault.md));

* Dropped Python 3.6 support;


Added

* `Layer.secondary_attributes`: a list of layer's attributes which will be skipped while comparing two layers. Usually this means that these attributes contain redundant information. Another reason for marking attribute as _secondary_ is the attribute being recursive, thus skipping the attribute avoids infinite recursion in comparison;

* `Layer.span_level` property: an integer conveying depth of enveloping structure of this layer; `span_level=0` indicates no enveloping structure: spans of the layer mark raw text positions `(start, end)`, and `span_level` > 0 indicates that spans of the layer envelop around smaller level spans (for details, see the `BaseSpan` docstring in `estnltk_core.layer.base_span`);

* `Layer.clear_spans()` method that removes all spans (and annotations) from the layer. Note that clearing does not change the `span_level` of the layer, so spans added after the clearing must have the same level as before clearing;

* `find_layer_dependencies` function to `estnltk_core.layer_operations` -- finds all layers that the given layer depends on. Can also be used for reverse search: find all layers depending on the given layer (e.g. enveloping layers and child layers);

* `SpanAnnotationsRewriter` (a replacement for legacy `SpanRewriter`) -- a tagger that applies a modifying function on each span's annotations. The function takes span's annotations (a list of `Annotation` objects) as an input and is allowed to change, delete and add new annotations to the list. The function must return a list with modified annotations. Removing all annotations of a span is forbidden.

Fixed

* Property `Layer.end` giving wrong ending index;
* `Text` HTML representation: Fixed "FutureWarning: The frame.append method is deprecated /.../ Use pandas.concat instead";
* `Layer.ancestor_layers` and `Layer.descendant_layers` having their functionalities swaped (`ancestor_layers` returned descendants instead of ancestors), now they return what the function names insist;
* `Span.__repr__` now avoids overly long representations and renders fully only values of basic data types (such as `str`, `int`, `list`);
* `SyntaxDependencyRetagger` now marks `parent_span` and `children` as `secondary_attributes` in order to avoid infinite recursion in syntax layer comparison;
* `PgCollection`: `collection.layers` now returns `[]` in case of an empty collection;
* `PgCollection`: added proper exception throwing for cases where user wants to modify an empty collection;

1.6.9.1beta

This is an intermediate release of PyPI packages. The version 1.6.9b0 and 1.6.9.1b0 are equal considering the main functionalities, so no conda packages will be generated. The list of changes will be documented in the next release.

Page 1 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.