Spark-nlp

Latest version: v5.3.3

Safety actively analyzes 631178 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 22

5.2.1

========
----------------
New Features & Enhancements
----------------
* Add support for Spark and PySpark 3.5 major release
* Support Databricks Runtimes of 14.0, 14.1, 14.2, 14.0 ML, 14.1 ML, 14.2 ML, 14.0 GPU, 14.1 GPU, and 14.2 GPU
* **NEW:** Introducing the `BGEEmbeddings` annotator for Spark NLP. This annotator enables the integration of `BGE` models, based on the BERT architecture, into Spark NLP. The `BGEEmbeddings` annotator is designed for generating dense vectors suitable for a variety of applications, including `retrieval`, `classification`, `clustering`, and `semantic search`. Additionally, it is compatible with `vector databases` used in `Large Language Models (LLMs)`.
* **NEW:** Introducing support for ONNX Runtime in DeBertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DeBertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DeBertaForQuestionAnswering annotator
* Add a new notebook to show how to import any model from `T5` family into Spark NLP with TensorFlow format
* Add a new notebook to show how to import any model from `T5` family into Spark NLP with ONNX format
* Add a new notebook to show how to import any model from `MarianNMT` family into Spark NLP with ONNX format

----------------
Bug Fixes
----------------
* Fix serialization issue in `DocumentTokenSplitter` annotator failing to be saved and loaded in a Pipeline
* Fix serialization issue in `DocumentCharacterTextSplitter` annotator failing to be saved and loaded in a Pipeline

========

5.2.0

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introduceding the `CLIPForZeroShotClassification` for Zero-Shot Image Classification using OpenAI's CLIP models
* **NEW:** Introduceding the `DocumentTokenSplitter` which allows users to split large documents into smaller chunks to be used in RAG with LLM models
* **NEW:** Introducing support for ONNX Runtime in T5Transformer annotator
* **NEW:** Introducing support for ONNX Runtime in MarianTransformer annotator
* **NEW:** Introducing support for ONNX Runtime in BertSentenceEmbeddings annotator
* **NEW:** Introducing support for ONNX Runtime in XlmRoBertaSentenceEmbeddings annotator
* **NEW:** Introducing support for ONNX Runtime in CamemBertForQuestionAnswering, CamemBertForTokenClassification, and CamemBertForSequenceClassification annotators
* Adding a caching support for newly imported T5 models in TF format to improve the performance to be competitive to ONNX version
* Improve ZIP util and add tests for both ZipArchiveUtil and OnnxWrapper
* Refactor ONNX and add OnnxSession to broadcast
* Update ONNX Runtime to 1.16.3
* Add a new notebook fro structure streaming

----------------
Bug Fixes
----------------
* Fix random dimension mismatch in E5Embeddings and MPNetEmbeddings due to a missing average_pool after last_hidden_state in the output
* Fix batching exception in E5 and MPNet embeddings annotators failing when sentence is used instead of document
* Fix chunk construction when an entity is found
* Fix a bug in library's version in Scala
* Fix Whisper models not downloading due to wrong library's version
* Fix and refactor saving best model based on given metrics during NerDL training

========

5.1.4

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introduceding the `DocumentCharacterTextSplitter` which allows users to split large documents into smaller chunks. `DocumentCharacterTextSplitter` takes a list of separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.
* **NEW:** Introducing support for ONNX Runtime in RobertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in RobertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in RobertaForQuestionAnswering annotator
* Adding an example to load a model directly from Azure using .load() method. This example helps users to understand how to set Spark NLP to load models from Azure

----------------
Bug Fixes
----------------
* Fix a bug with in `Whisper` annotator, that would not allow every model to be imported
* Fix BPE Tokenizer to include a flag whether or not to always prepend a space before words (previous behavior for embeddings)
* Fix BPE Tokenizer to correctly convert and tokenize non-latin and other special characters/words
* Fix `RobertaForQuestionAnswering` to produce the same logits and indexes as the implementation in Transformer library
* Fix the return order of logits in `BertForQuestionAnswering` and `DistilBertForQuestionAnswering` annotators

========

5.1.3

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in BertForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in BertForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in BertForQuestionAnswering annotator
* **NEW:** Introducing support for ONNX Runtime in DistilBertForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DistilBertForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DistilBertForQuestionAnswering annotator
* **NEW:** Setting ONNX configuration such as GPU device id, execution mode, etc. via Spark NLP configs
* Update Whisper documentation with minimum required version of Spark/PySpark (3.4)

----------------
Bug Fixes
----------------
* Fix `module 'sparknlp.annotator' has no attribute 'Token2Chunk'` error in Python when using `Token2Chunk` annotator inside loaded PipelineModel

========

5.1.2

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing VisionEncoderDecoder annotator to generate captions from images
* Add missing enteries in the docs and update them with the new features
* Improve beam search results in BART Transformer

========

5.1.1

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in MPNet embedding annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForQuestionAnswering annotator
* Implement `getVectors` feature in Word2VecModel, Doc2VecModel, and WordEmbeddingsModel annotators. This new feature allows access to the entire tokens and their vectors in the loaded model.

----------------
Bug Fixes
----------------
* Fix how to save and load `Whisper` models
* Fix saving ONNX model on Windows operating system

========

Page 2 of 22

Releases

Has known vulnerabilities

Previous Next

Spark-nlp

Page 2 of 22

5.2.1

5.2.0

5.1.4

5.1.3

5.1.2

5.1.1

Page 2 of 22

Links

Releases