Spark-nlp

Latest version: v5.3.3

Safety actively analyzes 619566 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 22

5.3.3

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introduce UAEEmbeddings for sentence embeddings using Universal AnglE Embedding, aimed at improving semantic textual similarity tasks
* Introduce critical enhancements and optimizations to the processing of the CoNLL-U format for Dependency Parsers training, including enhanced multiword token handling and improved handling of missing uPos values
* Add example notebook for `DocumentCharacterTextSplitter`
* Add example notebook for `DeBertaForZeroShotClassification`
* Add example notebooks for `BGEEmbeddings` and `MPNetEmbeddings`
* Add example notebook for `MPNetForQuestionAnswering`
* Add example notebook for `MPNetForSequenceClassification`
* Implement cache mechanism for `metadata.json`, enhancing efficiency by avoiding unnecessary downloads

----------------
Bug Fixes
----------------
* Address a bug with serializing ONNX models that lack a `.onnx_data` file, ensuring better reliability in model serialization processes
* Delete redundant `Multilingual_Translation_with_M2M100.ipynb` notebook entries
* Fix Colab link for the M2M100 notebook

========

5.3.2

========
----------------
Bug Fixes
----------------
* Fix and add notebooks to import models from Hugging Face
* Add ONNX and TensorFlow notebooks
* Fix XlnetForSeqeunceClassification and added XlnetForTokenClassificaiton
* Rename DistilBertForZeroShotClassification
* Add missing notebooks
* Add MPNetEmbeddings to annotator
* Fix XLMRoBertaForQuestionAnswering, XLMRoBertaForTokenClassification, and XLMRoBertaForSequenceClassification: Reverted the change in tfFile naming that was causing exceptions while loading and saving the models
* Fix documentation for sparknlp.start()

========

5.3.1

========
----------------
Bug Fixes
----------------
* Fix M2M100 not working on the second run (closing the ONNX Session by mistake)
* Fix ONNX models failing in clusters like Databricks
* Fix `ZeroShotNerClassification` issue with NerConverter
* adding colab notebook for M2M100

========

5.3.0

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing Llama-2 and all the models fine-tuned based on this architecutre. This our very first CasualLM annotator in ONNX and it comes with support for quantization in INT4 and INT8 for CPUs.
* **NEW:** Introducing `MPNetForSequenceClassification` annotator for sequence classification tasks. This annotator is based on the MPNet architecture and is designed to classify sequences of text into a set of predefined classes.
* **NEW:** Introducing `MPNetForQuestionAnswering` annotator for question answering tasks. This annotator is based on the MPNet architecture and is designed to answer questions based on a given context.
* **NEW:** Introducing `M2M100` state-of-the-art multilingual translation. M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. The model can directly translate between the 9,900 directions of 100 languages.
* **NEW:** Introducing a new `DeBertaForZeroShotClassification` annotator for zero-shot classification tasks. This annotator is based on the DeBERTa architecture and is designed to classify sequences of text into a set of predefined classes.
* **NEW:** Implement retreival feature in our `DocumentSimilarity`annotator. The new DocumentSimilarity ranker is a powerful tool for ranking documents based on their similarity to a given query document. It is designed to be efficient and scalable, making it ideal for a variety of RAG applications/
* Add ONNNX support for `BertForZeroShotClassification` annotator.
* Add support for in-memory use of `WordEmbeddingsModel` annotator in server-less cluster. We initially introduced in-memory feature for this annotator for users inside Kubernetes cluster without any `HDFS`, however, today it runs without any issue `locally`, Google `Colab`, `Kaggle`, `Databricks`, `AWS EMR`, `GCP`, and `AWS Glue`.
* New Whisper Large and Distil models.
* Update ONNX Runtime to 1.17.0
* Support new Databricks Runtimes of 14.2, 14.3, 14.2 ML, 14.3 ML, 14.2 GPU, 14.3 GPU
* Support new EMR 6.15.0 and 7.0.0 versions
* Add nobteook to fine-tune a BERT for Sentence Embeddings in Hugging Face and import it to Spark NLP
* Add notebook to import BERT for Zero-Shot classification from Hugging Face
* Add notebook to import DeBERTa for Zero-Shot classification from Hugging Face
* Update EntityRuler documentation
* Improve SBT project and resolve warnings (almost!)

----------------
Bug Fixes
----------------
* Fix Spark NLP Configuration's to set `cluster_tmp_dir` on Databricks' DBFS via `spark.jsl.settings.storage.cluster_tmp_dir` https://github.com/JohnSnowLabs/spark-nlp/issues/14129
* Fix score calculation in `RoBertaForQuestionAnswering` annotator https://github.com/JohnSnowLabs/spark-nlp/pull/14147
* Fix optional input col validations https://github.com/JohnSnowLabs/spark-nlp/pull/14153
* Fix notebooks for importing DeBERTa classifiers https://github.com/JohnSnowLabs/spark-nlp/pull/14154
* Fix GPT2 deserialization over the cluster (Databricks) https://github.com/JohnSnowLabs/spark-nlp/pull/14177

========

5.2.3

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForQuestionAnswering annotator
* Refactoring AWS SDK use in Spark NLP to reduce the overal size of the library. We have dropped the use of `bundle` and started to directly using `S3` SDK. This will also minimize incompatibilities with other libraries that use AWS SDKs
* Add new notebooks to import DeBertaForQuestionAnswering, DebertaForSequenceClassification, and DeBertaForTokenClassification models from HuggingFace
* Add a new `DocumentTokenSplitter` notebook
* Add a new trainig NER notebook by using DeBerta Embeddings
* Add a new trainig text classification notebook by using INSTRUCTOR Embeddings
* Update `RoBertaForTokenClassification` notebook
* Update `RoBertaForSequenceClassification` notebook
* Update `OpenAICompletion` notebook with new `gpt-3.5-turbo-instruct` model

----------------
Bug Fixes
----------------
* Fix `BGEEmbeddings` not downloading in Python

========

5.2.2

========
----------------
Enhancements
----------------
* Update `aws-java-sdk-bundle` dependency to a version without any CVEs

----------------
Bug Fixes
----------------
* Fix the missing `BGEEmbeddings` from annotator in Python
* Add a new BGE notebook to import models into Spark NLP
* Upload the new true `BGE` models to Spark NLP for text embeddings

========

Page 1 of 22

Releases

Has known vulnerabilities

Spark-nlp

Page 1 of 22

5.3.3

5.3.2

5.3.1

5.3.0

5.2.3

5.2.2

Page 1 of 22

Links

Releases