Stanza

Latest version: v1.8.2

Safety actively analyzes 629959 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 4

1.6.0

Multiple model levels

The `package` parameter for building the `Pipeline` now has three default settings:

- `default`, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
- `default-fast`, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
- `default-accurate`, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into `-fast` and `-accurate` versions for each UD dataset.

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide: 88.71 69.29
simplify-separate 88.24 75.75
simplify-connected 88.32 75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages `ontonotes-combined_nocharlm`, `ontonotes-combined_charlm`, and `ontonotes-combined_electra-large`.

Future plans include using multiple NER datasets for other models as well.

Other features

- Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty Jemoka). When creating a `Pipeline`, you can now provide a `callable` via the `tokenize_postprocessor` parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the `Pipeline` https://github.com/stanfordnlp/stanza/pull/1290

- Finetuning for transformers in the NER models: have not yet found helpful settings, though https://github.com/stanfordnlp/stanza/commit/45ef5445f44222df862ed48c1b3743dc09f3d3fd

- SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 https://github.com/stanfordnlp/stanza/commit/88cd0df5da94664cb04453536212812dc97339bb

- charlm for PT (improves accuracy on non-transformer models): https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a

- build models with transformers for a few additional languages: MR, AR, PT, JA https://github.com/stanfordnlp/stanza/commit/45b387531c67bafa9bc41ee4d37ba0948daa9742 https://github.com/stanfordnlp/stanza/commit/0f3761ee63c57f66630a8e94ba6276900c190a74 https://github.com/stanfordnlp/stanza/commit/c55472acbd32aa0e55d923612589d6c45dc569cc https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a

Bugfixes

- Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 https://github.com/stanfordnlp/stanza/commit/c71bf3fdac8b782a61454c090763e8885d0e3824

- `run_ete.py` was not correctly processing the charlm, meaning the whole thing wouldn't actually run https://github.com/stanfordnlp/stanza/commit/16f29f3dcf160f0d10a47fec501ab717adf0d4d7

- Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 https://github.com/stanfordnlp/stanza/commit/82a02151da17630eb515792a508a967ef70a6cef

1.5.1

Features

depparse can have transformer as an embedding https://github.com/stanfordnlp/stanza/pull/1282/commits/ee171cd167900fbaac16ff4b1f2fbd1a6e97de0a

Lemmatizer can remember word,pos it has seen before with a flag https://github.com/stanfordnlp/stanza/issues/1263 https://github.com/stanfordnlp/stanza/commit/a87ffd0a4f43262457cf7eecf5555a621c6dc24e

Scoring scripts for Flair and spAcy NER models (requires the appropriate packages, of course) https://github.com/stanfordnlp/stanza/pull/1282/commits/63dc212b467cd549039392743a0be493cc9bc9d8 https://github.com/stanfordnlp/stanza/pull/1282/commits/c42aed569f9d376e71708b28b0fe5b478697ba05 https://github.com/stanfordnlp/stanza/pull/1282/commits/eab062341480e055f93787d490ff31d923a68398

SceneGraph connection for the CoreNLP client https://github.com/stanfordnlp/stanza/pull/1282/commits/d21a95cc90443ec4737de6d7ba68a106d12fb285

Update constituency parser to reduce the learning rate on plateau. Fiddling with the learning rates significantly improves performance https://github.com/stanfordnlp/stanza/pull/1282/commits/f753a4f35b7c2cf7e8e6b01da3a60f73493178e1

Tokenize [] based on () rules if the original dataset doesn't have [] in it https://github.com/stanfordnlp/stanza/pull/1282/commits/063b4ba3c6ce2075655a70e54c434af4ce7ac3a9

Attempt to finetune the charlm when building models (have not found effective settings for this yet) https://github.com/stanfordnlp/stanza/pull/1282/commits/048fdc9c9947a154d4426007301d63d920e60db0

Add the charlm to the lemmatizer - this will not be the default, since it is slower, but it is more accurate https://github.com/stanfordnlp/stanza/pull/1282/commits/e811f52b4cf88d985e7dbbd499fe30dbf2e76d8d https://github.com/stanfordnlp/stanza/pull/1282/commits/66add6d519deb54ca9be5fe3148023a5d7d815e4 https://github.com/stanfordnlp/stanza/pull/1282/commits/f086de2359cce16ef2718c0e6e3b5deef1345c74

Bugfixes

Forgot to include the lemmatizer in CoreNLP 4.5.3, now in 4.5.4 https://github.com/stanfordnlp/stanza/commit/4dda14bd585893044708c70e30c1c3efec509863 https://github.com/bjascob/LemmInflect/issues/14#issuecomment-1470954013

prepare_ner_dataset was always creating an Armenian pipeline, even for non-Armenian langauges https://github.com/stanfordnlp/stanza/commit/78ff85ce7eed596ad195a3f26474065717ad63b3

Fix an empty `bulk_process` throwing an exception https://github.com/stanfordnlp/stanza/pull/1282/commits/5e2d15d1aa59e4a1fee8bba1de60c09ba21bf53e https://github.com/stanfordnlp/stanza/issues/1278

Unroll the recursion in the Tarjan part of the Chuliu-Edmonds algorithm - should remove stack overflow errors https://github.com/stanfordnlp/stanza/pull/1282/commits/e0917b0967ba9752fdf489b86f9bfd19186c38eb

Minor updates

Put NER and POS scores on one line to make it easier to grep for: https://github.com/stanfordnlp/stanza/commit/da2ae33e8ef9e48842685dfed88896b646dba8c4 https://github.com/stanfordnlp/stanza/commit/8c4cb04d38c1101318755270f3aa75c54236e3fe

Switch all pretrains to use a name which indicates their source, rather than the dataset they are used for: https://github.com/stanfordnlp/stanza/pull/1282/commits/d1c68ed01276b3cf1455d497057fbc0b82da49e5 and many others

Pipeline uses `torch.no_grad()` for a slight speed boost https://github.com/stanfordnlp/stanza/pull/1282/commits/36ab82edfc574d46698c5352e07d2fcb0d68d3b3

Generalize save names, which eventually allows for putting `transformer`, `charlm` or `nocharlm` in the save name - this lets us distinguish different complexities of model https://github.com/stanfordnlp/stanza/pull/1282/commits/cc0845826973576d8d8ed279274e6509250c9ad5 for constituency, and others for the other models

Add the model's flags to the `--help` for the `run` scripts, such as https://github.com/stanfordnlp/stanza/pull/1282/commits/83c0901c6ca2827224e156477e42e403d330a16e https://github.com/stanfordnlp/stanza/pull/1282/commits/7c171dd8d066c6973a8ee18a016b65f62376ea4c https://github.com/stanfordnlp/stanza/pull/1282/commits/8e1d112bee42f2211f5153fcc89083b97e3d2600

Remove the dependency on `six` https://github.com/stanfordnlp/stanza/pull/1282/commits/6daf97142ebc94cca7114a8cda5a20bf66f7f707 (thank you BLKSerene )

New Models

VLSP constituency https://github.com/stanfordnlp/stanza/commit/500435d3ec1b484b0f1152a613716565022257f2

VLSP constituency -> tagging https://github.com/stanfordnlp/stanza/commit/cb0f22d7be25af0b3b2790e3ce1b9dbc277c13a7

CTB 5.1 constituency https://github.com/stanfordnlp/stanza/pull/1282/commits/f2ef62b96c79fcaf0b8aa70e4662d33b26dadf31

Add support for CTB 9.0, although those models are not distributed yet https://github.com/stanfordnlp/stanza/pull/1282/commits/1e3ea8a10b2e485bc7c79c6ab41d1f1dd8c2022f

Added an Indonesian charlm

Indonesian constituency from ICON treebank https://github.com/stanfordnlp/stanza/pull/1218

All languages with pretrained charlms now have an option to use that charlm for dependency parsing

French combined models out of `GSD`, `ParisStories`, `Rhapsodie`, and `Sequoia` https://github.com/stanfordnlp/stanza/pull/1282/commits/ba64d37d3bf21af34373152e92c9f01241e27d8b

UD 2.12 support https://github.com/stanfordnlp/stanza/pull/1282/commits/4f987d2cd708ce4ca27935d347bb5b5d28a78058

1.5.0

Ssurgeon interface

Headlining this release is the initial release of Ssurgeon, a rule-based dependency graph editing tool. Along with the existing Semgrex integration with CoreNLP, Ssurgeon allows for rewriting of dependencies such as in the UD datasets. More information is in the GURT 2023 paper, https://aclanthology.org/2023.tlt-1.7/

In addition to this addition, there are two other CoreNLP integrations, a long list of bugfixes, a few other minor features, and a long list of constituency parser experiments which were somewhere between "ineffective" and "small improvements" and are available for people to experiment with.

CoreNLP integration:
- Ssurgeon interface! New interface allows for editing of dependency graphs using Semgrex patterns and Ssurgeon rules. https://github.com/stanfordnlp/stanza/pull/1205 https://aclanthology.org/2023.tlt-1.7/
- English Morphology class (deterministic English lemmatizer) https://github.com/stanfordnlp/stanza/commit/6aed177731e883ce92057be7e78abdce3141a862
- English constituency -> dependency converter https://github.com/stanfordnlp/stanza/commit/0987794c9e960b32ed75d5804dd5c586466ae061

Bugfixes:
- Bugfix for older versions of torch: https://github.com/stanfordnlp/stanza/commit/376d7ea76248131a96d23e236ab165e7d5a544bb
- Bugfix for training (integration with new scoring script) https://github.com/stanfordnlp/stanza/issues/1167 https://github.com/stanfordnlp/stanza/commit/9c39636c438cbeb00ab7a7e8d9caa0bcd31ccc44
- Demo was showing constituency parser along with dependency parsing, even with conparse off: https://github.com/stanfordnlp/stanza/commit/cbc13b0219281f2c27e89ccf2914e13f8aa2bb1b
- Replace absurdly long characters with UNK (thank you khughitt) https://github.com/stanfordnlp/stanza/issues/1137 https://github.com/stanfordnlp/stanza/pull/1140
- Package all relevant pretrains into default.zip - otherwise pretrains used by NER models which are not the default pretrain were being missed. https://github.com/stanfordnlp/stanza/commit/435685f875766e0b9b2b9b1d4792db1c452f9722
- stanza-train NER training bugfix (wrong pretrain): https://github.com/stanfordnlp/stanza/commit/2757cb40edf7a4bf9f62e31eec4b3632ac5ebcb9
- Pass around device everywhere instead of calling cuda(). this should fix models occasionally being split over multiple devices. would also allow for use of MPS, but the current torch implementation for MPS is buggy https://github.com/stanfordnlp/stanza/issues/1209 https://github.com/stanfordnlp/stanza/pull/1159
- Fix error in preparing tokenizer datasets (thanks dvzubarev): https://github.com/stanfordnlp/stanza/pull/1161
- Fix unnecessary slowness in preparing tokenizer datasets (again, thanks dvzubarev): https://github.com/stanfordnlp/stanza/pull/1162
- Fix using the correct pretrain when rebuilding POS tags for a Depparse dataset (again, thanks dvzubarev): https://github.com/stanfordnlp/stanza/pull/1170
- When using the tregex interface to corenlp, add parse if it isn't already there (again, depparse was being confused with parse): https://github.com/stanfordnlp/stanza/commit/b118473604d50d678c2857c0f39f59ba0cd9c2a3
- Update use of emoji to match latest releases: https://github.com/stanfordnlp/stanza/issues/1195 https://github.com/stanfordnlp/stanza/commit/ea345a88f8916c2ab2cd2e6260caa7831dfe2f23

Features:
- Mechanism for resplitting tokens into MWT https://github.com/stanfordnlp/stanza/issues/95 https://github.com/stanfordnlp/stanza/commit/8fac17f625173b2c2bf1cecf611deecb37399322
- CLI for tokenizing text into one paragraph per line, whitespace separated (useful for Glove, for example) https://github.com/stanfordnlp/stanza/commit/cfd44d17f806703b7ed6719993501366a52afbb1
- `detach().cpu()` speeds things up significantly in some cases https://github.com/stanfordnlp/stanza/commit/ccfbc56b3b312fdde1350104a0d0d5645c9c80cc
- Potentially use a constituency model as a classifier - WIP research project https://github.com/stanfordnlp/stanza/pull/1190
- add an output format `"{:C}"` for document objects which prints out documents as CoNLL: https://github.com/stanfordnlp/stanza/pull/1169
- If a constituency tree is available, include it when outputting conll format for documents: https://github.com/stanfordnlp/stanza/pull/1171
- Same with sentiment: https://github.com/stanfordnlp/stanza/commit/abb581945a70fec335dbfadd71bf8c457fa908eb
- Additional language code coverage (thank you juanro49) https://github.com/stanfordnlp/stanza/commit/5802b10882026c4694a4d966e4200c48c5469b1b https://github.com/stanfordnlp/stanza/commit/f06bf86b566772ea6551c663835ddb9a6f5584ff https://github.com/stanfordnlp/stanza/commit/32f83fa2f2333f42925323c4ac9da059dffdf1dc https://github.com/stanfordnlp/stanza/commit/34505758c9d8de4ca70bfbe5418448ad54af088f
- Allow loading a pipeline for new languages (useful when developing a new suite of models) https://github.com/stanfordnlp/stanza/commit/e7fcd262a6c5f3f71b339fe989bcaa177fb378f1
- Script to count the work done by annotators on aws sagemaker private workforce: https://github.com/stanfordnlp/stanza/pull/1186
- Streaming interface which batch processes items in the stream: https://github.com/stanfordnlp/stanza/commit/2c9fe3dad434b271fa23c20a9cf8ccaf63991f16 https://github.com/stanfordnlp/stanza/issues/550
- Can pass a defaultdict to MultilingualPipeline, useful for specifying the processors for each language at once: https://github.com/stanfordnlp/stanza/commit/70fd2fdc94575dec79c4994ea2dc66a719768ab0 https://github.com/stanfordnlp/stanza/issues/1199
- Transformer at bottom layer of POS - currently only available in English as the `en_combined_bert` model, others to come https://github.com/stanfordnlp/stanza/pull/1132

New models:
- Armenian NER model using an NER labeling of armtdp (thanks to ShakeHakobyan): https://github.com/myavrum/ArmTDP-NER https://github.com/stanfordnlp/stanza/issues/1206 https://github.com/stanfordnlp/stanza/pull/1212
- Sindhi tokenization from ISRA https://github.com/stanfordnlp/stanza/pull/1117
- Sindhi NER from SiNER: https://github.com/stanfordnlp/stanza/commit/2a8ded4b0c327761b047caf433128f13b1ad14bf
- Erzya from UD 2.11 https://github.com/stanfordnlp/stanza/commit/0344ac34b5df602a49da25d58655a24a0ffcd208

Conparser experiments:
- Transformer stack (initial implementation did not help) https://arxiv.org/abs/2010.10669 https://github.com/stanfordnlp/stanza/commit/110031e29259b34be6f958fd6d67d4774d6b084a
- TREE_LSTM constituent composition method (didn't beat MAX) https://github.com/stanfordnlp/stanza/commit/2f722c828fa1364131b670da5b925082e9aa336a
- Learned weighting between bert layers (this did help a little) https://github.com/stanfordnlp/stanza/commit/2d0c69ee449501155225efc2afb53b4ba6eeefe7
- Silver trees: train 10 models, use those models to vote on good trees, use those trees to then train new models. helps smaller treebanks such as IT and VI, but no effect on EN https://github.com/stanfordnlp/stanza/pull/1148
- New in_order_compound transition scheme: no improvement https://github.com/stanfordnlp/stanza/commit/f560b08902cf9f9e20656697c367500389115057
- Multistage training with madgrad or adamw: definite improvement. madgrad included as optional dependency https://github.com/stanfordnlp/stanza/commit/2706c4b100285e50f3d9a69e51ca5955e15ba41d https://github.com/stanfordnlp/stanza/commit/f500936b5ca4ba2305a028241996e5d198afd94b
- Report the scores of tags when retagging (does not affect the conparser training) https://github.com/stanfordnlp/stanza/commit/766341942962e5a5a0aa0cda3dd170ac098ac6f9
- FocalLoss on the transitions using optional dependency: didn't help https://arxiv.org/abs/1708.02002 https://github.com/stanfordnlp/stanza/commit/90a8337083f0dc057ea2a9ee794595a6b292850f
- LargeMarginSoftmax: didn't help https://github.com/tk1980/LargeMarginInSoftmax https://github.com/stanfordnlp/stanza/commit/5edd7242073720aff94f07904009ce0cad47b7ff
- Maxout layer: didn't help https://arxiv.org/abs/1302.4389 https://github.com/stanfordnlp/stanza/commit/c708ce7736ffb021f9a0065f2bedaa8b73de52ba
- Reverse parsing: not expected to help, potentially can be useful when building silver treebanks. May also be useful as a two step parser in the future. https://github.com/stanfordnlp/stanza/commit/4954845ba4b16240e6acf8d45d83161a0dec8d33

1.4.2

- Pipeline cache in Multilingual is a single OrderedDict
https://github.com/stanfordnlp/stanza/issues/1115#issuecomment-1239759362
https://github.com/stanfordnlp/stanza/commit/ba3f64d5f571b1dc70121551364fc89d103ca1cd

- Don't require `pytest` for all installations unless needed for testing
https://github.com/stanfordnlp/stanza/issues/1120
https://github.com/stanfordnlp/stanza/commit/8c1d9d80e2e12729f60f05b81e88e113fbdd3482

- hide SiLU and Minh imports if the version of torch installed doesn't have those nonlinearities
https://github.com/stanfordnlp/stanza/issues/1120
https://github.com/stanfordnlp/stanza/commit/6a90ad4bacf923c88438da53219c48355b847ed3

- Reorder & normalize installations in setup.py
https://github.com/stanfordnlp/stanza/pull/1124

1.4.1

Overview

We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.

New NER models

- New Polish NER model based on NKJP from Karol Saputa and ryszardtuora
https://github.com/stanfordnlp/stanza/issues/1070
https://github.com/stanfordnlp/stanza/pull/1110

- Make GermEval2014 the default German NER model, including an optional Bert version
https://github.com/stanfordnlp/stanza/issues/1018
https://github.com/stanfordnlp/stanza/pull/1022

- Japanese conversion of GSD by Megagon
https://github.com/stanfordnlp/stanza/pull/1038

- Marathi NER dataset from L3Cube. Includes a Sentiment model as well
https://github.com/stanfordnlp/stanza/pull/1043

- Thai conversion of LST20
https://github.com/stanfordnlp/stanza/commit/555fc0342decad70f36f501a7ea1e29fa0c5b317

- Kazakh conversion of KazNERD
https://github.com/stanfordnlp/stanza/pull/1091/commits/de6cd25c2e5b936bc4ad2764b7b67751d0b862d7

Other new models

- Sentiment conversion of Tass2020 for Spanish
https://github.com/stanfordnlp/stanza/pull/1104

- VIT constituency dataset for Italian
https://github.com/stanfordnlp/stanza/pull/1091/commits/149f1440dc32d47fbabcc498cfcd316e53aca0c6
... and many subsequent updates

- Combined UD models for Hebrew
https://github.com/stanfordnlp/stanza/issues/1109
https://github.com/stanfordnlp/stanza/commit/e4fcf003feb984f535371fb91c9e380dd187fd12

- For UD models with small train dataset & larger test dataset, flip the datasets
UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL
https://github.com/stanfordnlp/stanza/issues/1030
https://github.com/stanfordnlp/stanza/commit/9618d60d63c49ec1bfff7416e3f1ad87300c7073

- Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF
https://github.com/stanfordnlp/stanza/commit/47740c6252a6717f12ef1fde875cf19fa1cd67cc

Model improvements

- Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost
https://github.com/stanfordnlp/stanza/pull/1086

- Pretrained charlm integrated into Sentiment. Improves English, others not so much
https://github.com/stanfordnlp/stanza/pull/1025

- LSTM, 2d maxpool as optional items in the Sentiment
from the paper `Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling`
https://github.com/stanfordnlp/stanza/pull/1098

- First learn with AdaDelta, then with another optimizer in conparse training. Very helpful
https://github.com/stanfordnlp/stanza/commit/b1d10d3bdd892c7f68d2da7f4ba68a6ae3087f52

- Grad clipping in conparse training
https://github.com/stanfordnlp/stanza/commit/365066add019096332bcba0da4a626f68b70d303

Pipeline interface improvements

- GPU memory savings: charlm reused between different processors in the same pipeline
https://github.com/stanfordnlp/stanza/pull/1028

- Word vectors not saved in the NER models. Saves bandwidth & disk space
https://github.com/stanfordnlp/stanza/pull/1033

- Functions to return tagsets for NER and conparse models
https://github.com/stanfordnlp/stanza/issues/1066
https://github.com/stanfordnlp/stanza/pull/1073
https://github.com/stanfordnlp/stanza/commit/36b84db71f19e37b36119e2ec63f89d1e509acb0
https://github.com/stanfordnlp/stanza/commit/2db43c834bc8adbb8b096cf135f0fab8b8d886cb

- displaCy integration with NER and dependency trees
https://github.com/stanfordnlp/stanza/commit/20714137d81e5e63d2bcee420b22c4fd2a871306

Bugfixes

- Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex)
TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook)
https://github.com/stanfordnlp/stanza/pull/1056

- Starting a new corenlp client w/o server shouldn't wait for the server to be available
TY to Mariano Crosetti
https://github.com/stanfordnlp/stanza/issues/1059
https://github.com/stanfordnlp/stanza/pull/1061

- Read raw glove word vectors (they have no header information)
https://github.com/stanfordnlp/stanza/pull/1074

- Ensure that illegal languages are not chosen by the LangID model
https://github.com/stanfordnlp/stanza/issues/1076
https://github.com/stanfordnlp/stanza/pull/1077

- Fix cache in Multilingual pipeline
https://github.com/stanfordnlp/stanza/issues/1115
https://github.com/stanfordnlp/stanza/commit/cdf18d8b19c92b0cfbbf987e82b0080ea7b4db32

- Fix loading of previously unseen languages in Multilingual pipeline
https://github.com/stanfordnlp/stanza/issues/1101
https://github.com/stanfordnlp/stanza/commit/e551ebe60a4d818bc5ba8880dda741cc8bd1aed7

- Fix that conparse would occasionally train to NaN early in the training
https://github.com/stanfordnlp/stanza/commit/c4d785729e42ac90f298e0ef4ab487d14fa35591

Improved training tools

- W&B integration for all models: can be activated with --wandb flag in the training scripts
https://github.com/stanfordnlp/stanza/pull/1040

- New webpages for building charlm, NER, and Sentiment
https://stanfordnlp.github.io/stanza/new_language_charlm.html
https://stanfordnlp.github.io/stanza/new_language_ner.html
https://stanfordnlp.github.io/stanza/new_language_sentiment.html

- Script to download Oscar 2019 data for charlm from HF (requires `datasets` module)
https://github.com/stanfordnlp/stanza/pull/1014

- Unify sentiment training into a Python script, replacing the old shell script
https://github.com/stanfordnlp/stanza/pull/1021
https://github.com/stanfordnlp/stanza/pull/1023

- Convert sentiment to use .json inputs. In particular, this helps with languages with spaces in words such as Vietnamese
https://github.com/stanfordnlp/stanza/pull/1024

- Slightly faster charlm training
https://github.com/stanfordnlp/stanza/pull/1026

- Data conversion of WikiNER generalized for retraining / add new WikiNER models
https://github.com/stanfordnlp/stanza/pull/1039

- XPOS factory now determined at start of POS training. Makes addition of new languages easier
https://github.com/stanfordnlp/stanza/pull/1082

- Checkpointing and continued training for charlm, conparse, sentiment
https://github.com/stanfordnlp/stanza/pull/1090
https://github.com/stanfordnlp/stanza/commit/0e6de808eacf14cd64622415eeaeeac2d60faab2
https://github.com/stanfordnlp/stanza/commit/e5793c9dd5359f7e8f4fe82bf318a2f8fd190f54

- Option to write the results of a NER model to a file
https://github.com/stanfordnlp/stanza/pull/1108

- Add fake dependencies to a conllu formatted dataset for better integration with evaluation tools
https://github.com/stanfordnlp/stanza/commit/6544ef3fa5e4f1b7f06dbcc5521fbf9b1264197a

- Convert an AMT NER result to Stanza .json
https://github.com/stanfordnlp/stanza/commit/cfa7e496ca7c7662478e03c5565e1b2b2c026fad

- Add a ton of language codes, including 3 letter codes for languages we generally treat as 2 letters
https://github.com/stanfordnlp/stanza/commit/5a5e9187f81bd76fcd84ad713b51215b64234986
https://github.com/stanfordnlp/stanza/commit/b32a98e477e9972737ad64deea0bda8d6cebb4ec and others

1.4.0

Not secure

Overview

As part of the new Stanza release, we integrate transformer inputs to the NER and conparse modules. In addition, we now support several additional languages for NER and conparse.

Pipeline interface improvements

- Download resources.json and models into temp dirs first to avoid race conditions between multiple processors
https://github.com/stanfordnlp/stanza/issues/213
https://github.com/stanfordnlp/stanza/pull/1001

- Download models for Pipelines automatically, without needing to call `stanza.download(...)`
https://github.com/stanfordnlp/stanza/issues/486
https://github.com/stanfordnlp/stanza/pull/943

- Add ability to turn off downloads
https://github.com/stanfordnlp/stanza/commit/68455d895986357a2c1f496e52c4e59ee0feb165

- Add a new interface where both processors and package can be set
https://github.com/stanfordnlp/stanza/issues/917
https://github.com/stanfordnlp/stanza/commit/f37042924b7665bbaf006b02dcbf8904d71931a1

- When using pretokenized tokens, get character offsets from text if available
https://github.com/stanfordnlp/stanza/issues/967
https://github.com/stanfordnlp/stanza/pull/975

- If Bert or other transformers are used, cache the models rather than loading multiple times
https://github.com/stanfordnlp/stanza/pull/980

- Allow for disabling processors on individual runs of a pipeline
https://github.com/stanfordnlp/stanza/issues/945
https://github.com/stanfordnlp/stanza/pull/947

Other general improvements

- Add text and sent_id to conll output
https://github.com/stanfordnlp/stanza/discussions/918
https://github.com/stanfordnlp/stanza/pull/983
https://github.com/stanfordnlp/stanza/pull/995

- Add ner to the token conll output
https://github.com/stanfordnlp/stanza/discussions/993
https://github.com/stanfordnlp/stanza/pull/996

- Fix missing Slovak MWT model
https://github.com/stanfordnlp/stanza/issues/971
https://github.com/stanfordnlp/stanza/commit/5aa19ec2e6bc610576bc12d226d6f247a21dbd75

- Upgrades to EN, IT, and Indonesian models
https://github.com/stanfordnlp/stanza/issues/1003
https://github.com/stanfordnlp/stanza/pull/1008
IT improvements with the help of attardi and msimi

- Fix improper tokenization of Chinese text with leading whitespace
https://github.com/stanfordnlp/stanza/issues/920
https://github.com/stanfordnlp/stanza/pull/924

- Check if a CoreNLP model exists before downloading it (thank you interNULL)
https://github.com/stanfordnlp/stanza/pull/965

- Convert the run_charlm script to python
https://github.com/stanfordnlp/stanza/pull/942

- Typing and lint fixes (thank you asears)
https://github.com/stanfordnlp/stanza/pull/833
https://github.com/stanfordnlp/stanza/pull/856

- stanza-train examples now compatible with the python training scripts
https://github.com/stanfordnlp/stanza/issues/896

NER features

- Bert integration (not by default, thank you vythaihn)
https://github.com/stanfordnlp/stanza/pull/976

- Swedish model (thank you EmilStenstrom)
https://github.com/stanfordnlp/stanza/issues/912
https://github.com/stanfordnlp/stanza/pull/857

- Persian model
https://github.com/stanfordnlp/stanza/issues/797

- Danish model
https://github.com/stanfordnlp/stanza/pull/910/commits/3783cc494ee8c6b6d062c4d652a428a04a4ee839

- Norwegian model (both NB and NN)
https://github.com/stanfordnlp/stanza/pull/910/commits/31fa23e5239b10edca8ecea46e2114f9cc7b031d

- Use updated Ukrainian data (thank you gawy)
https://github.com/stanfordnlp/stanza/pull/873

- Myanmar model (thank you UCSY)
https://github.com/stanfordnlp/stanza/pull/845

- Training improvements for finetuning models
https://github.com/stanfordnlp/stanza/issues/788
https://github.com/stanfordnlp/stanza/pull/791

- Fix inconsistencies in B/S/I/E tags
https://github.com/stanfordnlp/stanza/issues/928#issuecomment-1027987531
https://github.com/stanfordnlp/stanza/pull/961

- Add an option for multiple NER models at the same time, merging the results together
https://github.com/stanfordnlp/stanza/issues/928
https://github.com/stanfordnlp/stanza/pull/955

Constituency parser

- Dynamic oracle (improves accuracy a bit)
https://github.com/stanfordnlp/stanza/pull/866

- Missing tags now okay in the parser
https://github.com/stanfordnlp/stanza/issues/862
https://github.com/stanfordnlp/stanza/commit/04dbf4f65e417a2ceb19897ab62c4cf293187c0b

- bugfix of () not being escaped when output in a tree
https://github.com/stanfordnlp/stanza/commit/eaf134ca699aca158dc6e706878037a20bc8cbd4

- charlm integration by default
https://github.com/stanfordnlp/stanza/pull/799

- Bert integration (not the default model) (thank you vythaihn and hungbui0411)
https://github.com/stanfordnlp/stanza/commit/05a0b04ee6dd701ca1c7c60197be62d4c13b17b6
https://github.com/stanfordnlp/stanza/commit/0bbe8d10f895560a2bf16f542d2e3586d5d45b7e

- Preemptive bugfix for incompatible devices from zhaochaocs
https://github.com/stanfordnlp/stanza/issues/989
https://github.com/stanfordnlp/stanza/pull/1002

- New models:
DA, based on [Arboretum](http://catalog.elra.info/en-us/repository/browse/ELRA-W0084/)
IT, based on the [Turin treebank](http://www.di.unito.it/~tutreeb/treebanks.html)
JA, based on [ALT](https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/)
PT, based on [Cintil](https://catalogue.elra.info/en-us/repository/browse/ELRA-W0055/)
TR, based on [Starlang](https://www.researchgate.net/publication/344829282_Creating_A_Syntactically_Felicitous_Constituency_Treebank_For_Turkish)
ZH, based on CTB7

Page 2 of 4

Releases

Has known vulnerabilities

Previous Next

Stanza

Page 2 of 4

1.6.0

1.5.1

1.5.0

1.4.2

1.4.1

1.4.0

Page 2 of 4

Links

Releases