stanza Changelog

2310.06165

original implementation: https://github.com/KarelDO/wl-coref/tree/master

Updated form of Word-Level Coreference Resolution
https://aclanthology.org/2021.emnlp-main.605/
original implementation: https://github.com/vdobrovolskii/wl-coref

If you use Stanza's coref module in your work, please be sure to cite both of the above papers.

Special thanks to [vdobrovolskii](https://github.com/vdobrovolskii), who graciously agreed to allow for integration of his work into Stanza, to KarelDO for his support of his training enhancement, and to Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.

Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.

Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models

https://github.com/stanfordnlp/stanza/pull/1309

Interface change: English MWT

English now has an MWT model by default. Text such as `won't` is now marked as a single **token**, split into two **words**, `will` and `not`. Previously it was expected to be tokenized into two pieces, but the `Sentence` object containing that text would not have a single `Token` object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.

Code that used to operate with `for word in sentence.words` will continue to work as before, but `for token in sentence.tokens` will now produce **one** object for MWT such as `won't`, `cannot`, `Stanza's`, etc.

Pipeline creation will not change, as MWT is automatically (but not silently) added at `Pipeline` creation time if the language and package includes MWT.

https://github.com/stanfordnlp/stanza/pull/1314/commits/f22dceb93275fc724536b03b31c08a94617880ca https://github.com/stanfordnlp/stanza/pull/1314/commits/27983aefe191f6abd93dd49915d2515d7c3973d1

Other updates

- NetworkX representation of enhanced dependencies. Allows for easier usage of Semgrex on enhanced dependencies - searching over enhanced dependencies requires CoreNLP >= 4.5.6 https://github.com/stanfordnlp/stanza/pull/1295 https://github.com/stanfordnlp/stanza/pull/1298
- Sentence ending punct tags improved for English to avoid labeling non-punct as punct (and POS is switched to using a DataLoader) https://github.com/stanfordnlp/stanza/issues/1000 https://github.com/stanfordnlp/stanza/pull/1303
- Optional rewriting of MWT after the MWT processing step - will give the user more control over fixing common errors. Although we still encourage posting issues on github so we can fix them for everyone! https://github.com/stanfordnlp/stanza/pull/1302
- Remove deprecated output methods such as `conll_as_string` and `doc2conll_text`. Use `"{:C}".format(doc)` instead https://github.com/stanfordnlp/stanza/commit/e01650f9c56382495082a9a24fa0310414c46651
- Mixed OntoNotes and WW NER model for English is now the default. Future versions may include CoNLL 2003 and CoNLL++ data as well.
- Sentences now have a `doc_id` field if the document they are created from has a `doc_id`. https://github.com/stanfordnlp/stanza/pull/1314/commits/8e2201f42cb99a5a3d8358ce59501c1d88f2585e
- Optional processors added in cases where the user may not want the model we have run by default. For example, conparse for Turkish (limited training data) or coref for English (the only available model is the transformer model) https://github.com/stanfordnlp/stanza/pull/1314/commits/3d90d2b8a82048c5cea549b654e52544ed241833

Updated requirements

- Support dropped for python 3.6 and 3.7. The `peft` module used for finetuning the transformer used in the coref processor does not support those versions.
- Added `peft` as an optional dependency to transformer based installations
- Added `networkx` as a dependency for reading enhanced dependencies. Added `toml` as a dependency for reading the coref config.

1.8.2

Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.

Old English

- Add Old English (ANG) annotation! Thank you to dmetola https://github.com/stanfordnlp/stanza/issues/1365

MWT improvements

- Fix words ending with `-nna` split into MWT https://github.com/stanfordnlp/handparsed-treebank/commit/2c48d4093daddc790bf89d7b35c47ee4d7d272d1 https://github.com/stanfordnlp/stanza/issues/1366

- Fix MWT for English splitting into weird words by enforcing that the pieces add up to the whole (which is always the case in the English treebanks) https://github.com/stanfordnlp/stanza/issues/1371 https://github.com/stanfordnlp/stanza/pull/1378

- Mark `start_char` and `end_char` on an MWT if it is composed of exactly its subwords https://github.com/stanfordnlp/stanza/commit/23840891c37d54a5cf491ea58b0702987dd4a6d7 https://github.com/stanfordnlp/stanza/issues/1361

Peft memory management

- Previous versions were loading multiple copies of the transformer in order to use adapters. To save memory, we can use Peft's capacity to attach multiple adapters to the same transformer instead as long as they have different names. This allows for loading just one copy of the entire transformer when using a Pipeline with several finetuned models. https://github.com/huggingface/peft/issues/1523 https://github.com/stanfordnlp/stanza/pull/1381 https://github.com/stanfordnlp/stanza/pull/1384

Other bugfixes and minor upgrades

- Fix crash when trying to load previously unknown language https://github.com/stanfordnlp/stanza/issues/1360 https://github.com/stanfordnlp/stanza/commit/381736f8fb9b60a929002cc750bd0df3d7dad03a

- Check that sys.stderr has isatty before manipulating it with tqdm, in case sys.stderr was monkeypatched: https://github.com/stanfordnlp/stanza/commit/d180ae02b278dd09dff53bc910e7aa43656e944d https://github.com/stanfordnlp/stanza/issues/1367

- Try to avoid OOM in the POS in the Pipeline by reducing its max batch length https://github.com/stanfordnlp/stanza/commit/42718135e2ab4b145bbb5861d55bb9424ca3549f

- Fix usage of gradient checkpointing & a weird interaction with Peft (thanks to Jemoka) https://github.com/stanfordnlp/stanza/commit/597d48f1ead89fa9a0cca86cf9f0b530ed249792

Other upgrades

- Add \* to the list of functional tags to drop in the constituency parser, helping Icelandic annotation https://github.com/stanfordnlp/stanza/commit/57bfa8bbd8d3d42d4ee29d4a406640b126ce0f46 https://github.com/stanfordnlp/stanza/issues/1356#issuecomment-1981216912

- Can train depparse without using any of the POS columns, especially useful if training a cross-lingual parser: https://github.com/stanfordnlp/stanza/commit/4048caed1b89030082d23b8f71d23bae6c9c54f1 https://github.com/stanfordnlp/stanza/commit/15b136bb30dda272d318a61a5f602e7fc81e7a31

- Add a constituency model for German https://github.com/stanfordnlp/stanza/commit/7a4f48c738f0db8923aa5da88d0a9743eaee4c6a https://github.com/stanfordnlp/stanza/commit/86ddaab31c73a7d0a389d0557f3696c29d441657 https://github.com/stanfordnlp/stanza/issues/1368

1.8.1

Integrating PEFT into several different annotators

We integrate [PEFT](https://github.com/huggingface/peft) into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the `default_accurate` model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the `default_accurate` package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

- POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results https://github.com/stanfordnlp/stanza/pull/1320
- Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. https://github.com/stanfordnlp/stanza/pull/1335
- NER also trained with peft: unfortunately, no consistent improvements to scores https://github.com/stanfordnlp/stanza/pull/1336
- depparse includes peft: no consistent improvements yet https://github.com/stanfordnlp/stanza/pull/1337 https://github.com/stanfordnlp/stanza/pull/1344
- Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser https://github.com/stanfordnlp/stanza/pull/1341
- Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. https://github.com/stanfordnlp/stanza/pull/1347
- Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. https://github.com/stanfordnlp/stanza/pull/1348
- Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. https://github.com/stanfordnlp/stanza/pull/1346 https://github.com/stanfordnlp/stanza/issues/1345

Features

- Include SpacesAfter annotations on words in the CoNLL output of documents: https://github.com/stanfordnlp/stanza/issues/1315 https://github.com/stanfordnlp/stanza/pull/1322
- Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. https://github.com/stanfordnlp/stanza/pull/1331 https://github.com/stanfordnlp/stanza/issues/1330
- wandb support for coref https://github.com/stanfordnlp/stanza/pull/1338
- Coref annotator breaks length ties using POS if available https://github.com/stanfordnlp/stanza/issues/1326 https://github.com/stanfordnlp/stanza/commit/c4c3de5803f27843a5050e10ccae71b3fd9c45e9

Bugfixes

- Using a proxy with `download_resources_json` was broken: https://github.com/stanfordnlp/stanza/pull/1318 https://github.com/stanfordnlp/stanza/issues/1317 Thank you ider-zh
- Fix deprecation warnings for escape sequences: https://github.com/stanfordnlp/stanza/pull/1321 https://github.com/stanfordnlp/stanza/issues/1293 Thank you sterliakov
- Coref training rounding error https://github.com/stanfordnlp/stanza/pull/1342
- Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice https://github.com/stanfordnlp/stanza/pull/1354
- V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. https://github.com/stanfordnlp/stanza/pull/1350 https://github.com/stanfordnlp/stanza/issues/1294
- Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: https://github.com/stanfordnlp/stanza/issues/1333 https://github.com/stanfordnlp/stanza/issues/1339 https://github.com/stanfordnlp/stanza/commit/f1fbaaad983e58dc3fcf318200d685663fb90737
- Clarify error when a language is only partially handled: https://github.com/stanfordnlp/stanza/commit/da01644b4ba5ba477c36e5d2736012b81bcd00d4 https://github.com/stanfordnlp/stanza/issues/1310

Additional 1.8.1 Bugfixes

- Older POS models not loaded correctly... need to use `.get()` https://github.com/stanfordnlp/stanza/commit/13ee3d5cbc2c9174c3e0c67ca75b580e4fe683b1 https://github.com/stanfordnlp/stanza/issues/1357
- Debug logging for the Constituency retag pipeline to better support someone working on Icelandic https://github.com/stanfordnlp/stanza/commit/6e2520f24d63fa8af4136f10137e57b195fda20a https://github.com/stanfordnlp/stanza/issues/1356
- `device` arg in `MultilingualPipeline` would crash if `device` was passed for an individual `Pipeline`: https://github.com/stanfordnlp/stanza/commit/44058a0ec296c6da5997bfaf8911a26d425d2cec

1.8.0

Integrating PEFT into several different annotators

We integrate [PEFT](https://github.com/huggingface/peft) into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the `default_accurate` model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the `default_accurate` package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

- POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results https://github.com/stanfordnlp/stanza/pull/1320
- Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. https://github.com/stanfordnlp/stanza/pull/1335
- NER also trained with peft: unfortunately, no consistent improvements to scores https://github.com/stanfordnlp/stanza/pull/1336
- depparse includes peft: no consistent improvements yet https://github.com/stanfordnlp/stanza/pull/1337 https://github.com/stanfordnlp/stanza/pull/1344
- Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser https://github.com/stanfordnlp/stanza/pull/1341
- Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. https://github.com/stanfordnlp/stanza/pull/1347
- Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. https://github.com/stanfordnlp/stanza/pull/1348
- Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. https://github.com/stanfordnlp/stanza/pull/1346 https://github.com/stanfordnlp/stanza/issues/1345

Features

- Include SpacesAfter annotations on words in the CoNLL output of documents: https://github.com/stanfordnlp/stanza/issues/1315 https://github.com/stanfordnlp/stanza/pull/1322
- Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. https://github.com/stanfordnlp/stanza/pull/1331 https://github.com/stanfordnlp/stanza/issues/1330
- wandb support for coref https://github.com/stanfordnlp/stanza/pull/1338
- Coref annotator breaks length ties using POS if available https://github.com/stanfordnlp/stanza/issues/1326 https://github.com/stanfordnlp/stanza/commit/c4c3de5803f27843a5050e10ccae71b3fd9c45e9

Bugfixes

- Using a proxy with `download_resources_json` was broken: https://github.com/stanfordnlp/stanza/pull/1318 https://github.com/stanfordnlp/stanza/issues/1317 Thank you ider-zh
- Fix deprecation warnings for escape sequences: https://github.com/stanfordnlp/stanza/pull/1321 https://github.com/stanfordnlp/stanza/issues/1293 Thank you sterliakov
- Coref training rounding error https://github.com/stanfordnlp/stanza/pull/1342
- Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice https://github.com/stanfordnlp/stanza/pull/1354
- V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. https://github.com/stanfordnlp/stanza/pull/1350 https://github.com/stanfordnlp/stanza/issues/1294
- Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: https://github.com/stanfordnlp/stanza/issues/1333 https://github.com/stanfordnlp/stanza/issues/1339 https://github.com/stanfordnlp/stanza/commit/f1fbaaad983e58dc3fcf318200d685663fb90737
- Clarify error when a language is only partially handled: https://github.com/stanfordnlp/stanza/commit/da01644b4ba5ba477c36e5d2736012b81bcd00d4 https://github.com/stanfordnlp/stanza/issues/1310

1.7.0

Neural coref processor added!

Conjunction-Aware Word-Level Coreference Resolution

1.6.1

V1.6.1 is a patch of a bug in the Arabic POS tagger.

We also mark Python 3.11 as supported in the `setup.py` classifiers. **This will be the last release that supports Python 3.6**

Multiple model levels

The `package` parameter for building the `Pipeline` now has three default settings:

- `default`, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
- `default-fast`, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
- `default-accurate`, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into `-fast` and `-accurate` versions for each UD dataset.

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide: 88.71 69.29
simplify-separate 88.24 75.75
simplify-connected 88.32 75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages `ontonotes-combined_nocharlm`, `ontonotes-combined_charlm`, and `ontonotes-combined_electra-large`.

Future plans include using multiple NER datasets for other models as well.

Other features

- Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty Jemoka). When creating a `Pipeline`, you can now provide a `callable` via the `tokenize_postprocessor` parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the `Pipeline` https://github.com/stanfordnlp/stanza/pull/1290

- Finetuning for transformers in the NER models: have not yet found helpful settings, though https://github.com/stanfordnlp/stanza/commit/45ef5445f44222df862ed48c1b3743dc09f3d3fd

- SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 https://github.com/stanfordnlp/stanza/commit/88cd0df5da94664cb04453536212812dc97339bb

- charlm for PT (improves accuracy on non-transformer models): https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a

- build models with transformers for a few additional languages: MR, AR, PT, JA https://github.com/stanfordnlp/stanza/commit/45b387531c67bafa9bc41ee4d37ba0948daa9742 https://github.com/stanfordnlp/stanza/commit/0f3761ee63c57f66630a8e94ba6276900c190a74 https://github.com/stanfordnlp/stanza/commit/c55472acbd32aa0e55d923612589d6c45dc569cc https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a

Bugfixes

- V1.6.1 fixes a bug in the Arabic POS model which was an unfortunate side effect of the NER change to allow multiple tag sets at once: https://github.com/stanfordnlp/stanza/commit/b56f442d4d179c07411a44a342c224408eb6a6a9

- Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 https://github.com/stanfordnlp/stanza/commit/c71bf3fdac8b782a61454c090763e8885d0e3824

- `run_ete.py` was not correctly processing the charlm, meaning the whole thing wouldn't actually run https://github.com/stanfordnlp/stanza/commit/16f29f3dcf160f0d10a47fec501ab717adf0d4d7

- Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 https://github.com/stanfordnlp/stanza/commit/82a02151da17630eb515792a508a967ef70a6cef

Stanza

Page 1 of 4