Spark-nlp

Latest version: v5.3.3

Safety actively analyzes 629503 Python packages for vulnerabilities to keep your Python projects secure.

Page 5 of 22

4.2.8

========
----------------
Bug Fixes & Enhancements
----------------
* Fix the issue with optional keys (labels) in metadata when using XXXForSequenceClassitication annotators. This fixes `Some(neg) -> 0.13602075` as `neg -> 0.13602075` to be in harmony with all the other classifiers. https://github.com/JohnSnowLabs/spark-nlp/pull/13396
* Introducing a config to skip `LightPipeline` validation for `inputCols` on the Python side for projects depending on Spark NLP. This toggle should only be used for specific annotators that do not follow the convention of predefined `inputAnnotatorTypes` and `outputAnnotatorType`.

========

4.2.7

========
----------------
Bug Fixes & Enhancements
----------------
* Fix `outputAnnotatorType` issue in pipelines with `Finisher` annotator. This change adds `outputAnnotatorType` to `AnnotatorTransformer` to avoid loading `outputAnnotatorType` attribute when a stage in pipeline does not use it.
* Fix the wrong sentence index calculation in metadata by annotators in the pipeline when `setExplodeSentences` param was set to `true` in SentenceDetector annotator
* Fix the issue in `Tokenizer` when a custom pattern is used with `lookahead/-behinds` and it has `0 width` matches. This led to indexes not being calculated correctly
* Fix missing to output embeddings in `.fullAnnotate()` method when `parseEmbeddings` param was set to `True/true`
* Fix broken links to the Python API pages, as the generation of the PyDocs was slightly changed in a previous release. This makes the Python APIs accessible from the Annotators and Transformers pages like before
* Change default values of `explodeEntities` and `mergeEntities` parameters to `true`
* Better error handling when there are empty paths/relations in `GraphExtraction`annotator. New message will better guide the user on how to configure `GraphExtraction` to output meaningful relationships
* Removed the duplicated definition of method `setWeightedDistPath` from `ContextSpellCheckerApproach`

========

4.2.6

========
----------------
Enhancements
----------------
* Updating Spark & PySpark dependencies from 3.2.1 to 3.2.3 in provided scripts and in all the documentation

----------------
Bug Fixes
----------------
* Fix the broken TypedDependencyParserApproach and TypedDependencyParserModel annotators used in Python (this bug was introduced in 4.2.5 release)
* Fix the broken Python API documentation

========

4.2.5

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **CamemBertForSequenceClassification** annotator in Spark NLP 🚀. `CamemBertForSequenceClassification` can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `CamembertForSequenceClassification` for PyTorch or `TFCamembertForSequenceClassification` for TensorFlow in HuggingFace 🤗
* **NEW:** Add `AnnotatorType` validation in Spark NLP `LightPipeline`. Currently, a misconfiguration of `inputCols` in an annotator in a pipeline raises an exception when using `transform` method, but in `LightPipeline` it only outputs empty values. This behavior can confuse users, this change introduces a validation that will raise an exception now in `LightPipeline` too.
* Add outputAnnotatorType for all annotators in Python
* Add inputAnnotatorTypes and outputAnnotatorType requirement validation for all subclasses derived from `AnnotatorApproach` and `AnnotatorModel`
* Adding AnnotatorType validation in `LightPipeline`
* Add validation for the number and type of columns set in `TFNerDLGraphBuilder` annotator. In efforts to avoid wrong definition of columns when using Spark NLP annotators in Python
* Add more details to Alphabet error message in `EntityRuler` annotator to better guide users
* Add instructions on how to resolve RocksDB incompatibilities when using Spark NLP with an M1 machine
* Refactor and implement a better error handling in ResourceDownloader. This change removes `getObjectFromS3` allowing AWS SDK to rise the correspondent error. In addition, this change also refactors ResourceDownloader to reflect the intention of each credential type on the downloader
* Implement full build and test of all unit tests base on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x major releases
* UpdateUpgrade `sbt-assembly` to `1.2.0` that comes with lots of performance improvements. This benefits those who are trying to package Spark NLP as a Fat JAR
* Update `sbt` to `1.8.0` with improvements and bug fixes, but mostly for CVEs fixes:
* Updates to Coursier 2.1.0-RC1 to address [https://github.com/advisories/GHSA-wv7w-rj2x-556x](https://github.com/advisories/GHSA-wv7w-rj2x-556x "https://github.com/advisories/GHSA-wv7w-rj2x-556x")
* Updates to Ivy 2.3.0-sbt-a8f9eb5bf09d0539ea3658a2c2d4e09755b5133e to address [https://github.com/advisories/GHSA-wv7w-rj2x-556x](https://github.com/advisories/GHSA-wv7w-rj2x-556x "https://github.com/advisories/GHSA-wv7w-rj2x-556x")
* Use the new withIncludeScala in assemblyOption instead of value

----------------
Bug Fixes
----------------
* Fix an issue with the `BigTextMatcher` Annotator, where it would not match entities with overlapping definitions. For Example, if both `lung` and `lung cancer` are defined, `lung` would not be matched in a given text. This was due to an abstraction error of one of the subclasses of the `BigTextMatcher` during construction of the underlying data structure
* Fix indexing issue for `RegexTokenizer` annotator. If the document was split into sentences, the index of the sentence inside the document was not taken into consideration for the indexes of the tokens. This would lead to further issues down the pipeline, where tokens would be filtered while unpacking them for other Annotators
* Refactor the `Resolvers` object in Spark NLP's dependency to avoid the conflict with the Resolvers inside the new `sbt`

========

4.2.4

Not secure

========
----------------
New Features & Enhancements
----------------
* Introduce support for GCP storage to be allowed as `cache_pretrained` directory for keeping all downloaded models and pipelines
* Update to TensorFlow 2.7.4 with bug and CVEs fixes
* Update documentation on how to use `testDataset` param in NerDLApproach, ClassifierDLApproach, MultiClassifierDLApproach, and SentimentDLApproach
* Update installation instructions for Apple M1 chip
* Improve error handling while importing external TensorFlow models into Spark NLP
* Improve error messages when importing external models from remote storages like DBFS, S3, and HDFS
* Add support for future decoder-encoder models (2 separate models)

----------------
Bug Fixes
----------------
* Add missing setPreservePosition in NerConverter
* Add missing inputAnnotatorTypes to BigTextMatcher, ViveknSentimentModel, and NerConverter annotators
* Fix all wrong example codes provided for LemmatizerModel in Models Hub
* Fix provided notebook to import Longformer models from HF: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/transformers/HuggingFace%20in%20Spark%20NLP%20-%20Longformer.ipynb
* Fix the t5_grammar_error_corrector model to be compatible with Spark NLP 4.0+

========

4.2.3

Not secure

========
----------------
New Features & Enhancements
----------------
* Implement a new control over number of accepted columns in Python. This will sync the behavior between Scala and Python where user sets more columns than allowed inside setInputCols
* Adding metadata sentence key parameter in order to select which metadata field to use as sentence for CoNLLGenerator annotator
* Include escaping in CoNLLGenerator annotator when writing to csv and preserve special char tokens
* Add documentation for new `IAnnotation` feature for Scala users
* Add rules and delimiter parameters to RegexMatcher annotator to support string as input in addition to a file
python
regexMatcher = RegexMatcher() \
.setRules(["\\d{4}\\/\\d\\d\\/\\d\\d,date", "\\d{2}\\/\\d\\d\\/\\d\\d,short_date"]) \
.setDelimiter(",") \
.setInputCols(["sentence"]) \
.setOutputCol("regex") \
.setStrategy("MATCH_ALL")

----------------
Bug Fixes
----------------
* Fix NotSerializableException when WordEmbeddings is used over K8s cluster while `setEnableInMemoryStorage` is set to `true`
* Fix a bug in RegexTokenizer annotator when it outputs the wrong indexes if the pattern includes splits that are not followed by a space
* Fix training modul failing on EMR due to a bad Apache Spark version detection. The following classes were fixed: `CoNLL()`, `CoNLLU()`, `POS()`, and `PubTator()`
* Fix a bug in CoNLLGenerator annotator where token has non-int metadata
* Fix the wrong SentencePiece model's name required for DeBertaForQuestionAnswering and DeBertaEmbeddings when importing models
* Fix `NaNs` result in some ViTForImageClassification models/pipelines

========

Page 5 of 22

Releases

Has known vulnerabilities

Previous Next

Spark-nlp

Page 5 of 22

4.2.8

4.2.7

4.2.6

4.2.5

4.2.4

4.2.3

Page 5 of 22

Links

Releases