Spark-nlp

Latest version: v5.3.3

Safety actively analyzes 631249 Python packages for vulnerabilities to keep your Python projects secure.

Page 7 of 22

4.0.0

Not secure

========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **AlbertForQuestionAnswering** annotator in Spark NLP 🚀. `AlbertForQuestionAnswering` can load `ALBERT` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `AlbertForQuestionAnswering` for **PyTorch** or `TFAlbertForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **BertForQuestionAnswering** annotator in Spark NLP 🚀. `BertForQuestionAnswering` can load `BERT` & `ELECTRA` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `BertForQuestionAnswering` and `ElectraForQuestionAnswering` for **PyTorch** or `TFBertForQuestionAnswering` and `TFElectraForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **DeBertaForQuestionAnswering** annotator in Spark NLP 🚀. `DeBertaForQuestionAnswering` can load `DeBERTa` v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `DebertaV2ForQuestionAnswering` for **PyTorch** or `TFDebertaV2ForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **DistilBertForQuestionAnswering** annotator in Spark NLP 🚀. `DistilBertForQuestionAnswering` can load `DistilBERT` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `DistilBertForQuestionAnswering` for **PyTorch** or `TFDistilBertForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **LongformerForQuestionAnswering** annotator in Spark NLP 🚀. `LongformerForQuestionAnswering` can load `Longformer` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `LongformerForQuestionAnswering` for **PyTorch** or `TFLongformerForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **RoBertaForQuestionAnswering** annotator in Spark NLP 🚀. `RoBertaForQuestionAnswering` can load `RoBERTa` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `RobertaForQuestionAnswering` for **PyTorch** or `TFRobertaForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **XlmRoBertaForQuestionAnswering** annotator in Spark NLP 🚀. `XlmRoBertaForQuestionAnswering` can load `XLM-RoBERTa` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `XLMRobertaForQuestionAnswering` for **PyTorch** or `TFXLMRobertaForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **MultiDocumentAssembler** annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators
* Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations result in performance improvements from +50% to +700% (more details in Benchmarks section)
* **NEW:** Introducing **SpanBertCorefModel** annotator for Coreference Resolution on BERT and SpanBERT models based on [BERT for Coreference Resolution: Baselines and Analysis](https://arxiv.org/abs/1908.09091) paper. An implementation of a SpanBert based coreference resolution model.
* Support for 2 inputs in LightPipeline for with MultiDocumentAssembler
* Migrate T5Transformer to TensorFlow v2 architecture with re-uploading all the existing models
* Official support for Apple silicon M1 on macOS devices. From Spark NLP 4.0.0 you can use `spark-nlp-m1` package that supports Apple silicon M1 on your macOS machine
* Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions
* Unifying all supported Apache Spark packages on Maven into `spark-nlp` for CPU, `spark-nlp-gpu` for GPU, and `spark-nlp-m1` for new Apple silicon M1 on macOS. The need for Apache Spark specific package like `spark-nlp-spark32` has been removed.
* Adding a new param to sparknlp.start() function in Python and Scala for Apple silicon M1 on macOS (`m1=True`)
* Update Colab, Kaggle, and SageMaker scripts
* Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)
* Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines
* Allow change of case sensitivity. Currently, user cannot set setCaseSensitive param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification.
* Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0
* Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)
* Refactor the entire Python module in Spark NLP to make the development and maintenance easier
* Refactor unit tests in Python and migrate to pytest
* Welcoming 6x new Databricks runtimes to our Spark NLP family:
* Databricks 10.4 LTS
* Databricks 10.4 LTS ML
* Databricks 10.4 LTS ML GPU
* Databricks 10.5
* Databricks 10.5 ML
* Databricks 10.5 ML GPU
* Welcoming a new EMR 6.x series to our Spark NLP family:
* EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)
* Upgrade TensorFlow to 2.7.1 and start supporing Apple silicon M1
* Upgrade RocksDB with new enhancements and support for Apple silicon M1
* Upgrade SentencePiece tokenizer TF ops to 2.7.1
* Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support
* Upgrade to Scala 2.12.15

----------------
Bug Fixes
----------------
* Fix the default pre-trained model for DeBertaForTokenClassification in Scala and Python
* Remove a requirement in DocumentNormalizer that consecutive stage processing can produce empty text annotations without breaking the pipeline
* Fix WordSegmenterModel outputting wrong order of tokens. The regex that groups the tagging format was refactored to preserve the order of segmented outputs (tokens)
* Fix encoding sentences not respecting the max sequence length given by a user in XlmRobertaSentenceEmbeddings
* Fix encoding sentences by using SentencePiece to calculate the correct tokens indexing
* Fix SentencePiece serialization issue when XlmRoBertaEmbeddings and XlmRoBertaSentenceEmbeddings annotators are used from a Fat JAR on GPU
* Remove non-existing parameters from DocumentAssembler in Python

----------------
Backward Compatibility
----------------
* Deprecate support for Spark/PySpark 2.3, Spark/PySpark 2.4, and Scala 2.11 https://github.com/JohnSnowLabs/spark-nlp/pull/8319
* The start() functions in Python and Scala will no longer have `spark23`, `spark24`, and `spark32` parameters. The default `sparknlp.start()` works on PySpark 3.0.x, 3.1.x, and 3.2.x without the need of any Spark related flags
* Some models/pipelines which were trained or saved by using Spark and PySpark 2.3/2.4 will no longer work on Spark NLP 4.0.0
* Remove json4s-ext dependency to allow the support for all Apache Spark major releases in one build

========

3.4.4

Not secure

========
----------------
New Features
----------------
* **NEW:** Introducing **DeBertaForTokenClassification** annotator in Spark NLP 🚀. `DeBertaForTokenClassification` can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `DebertaV2ForTokenClassification` for **PyTorch** or `TFDebertaV2ForTokenClassification` for **TensorFlow** models in HuggingFace
* **NEW:** Introducing **CamemBertEmbeddings** annotator in Spark NLP 🚀
* Add support for BatchAnnotate to UniversalSentenceEncoder

----------------
Bug Fixes & Enhancements
----------------
* Optimizing Tokenizer performance up to 400% when there is exceptions list
* Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts
* Removing trove4j dependency
* Fix bug that caused get input/output/LazyAnnotator to return None
* Fix DeBertaForSequenceClassification in Python failing to load pretrained models

========

3.4.3

Not secure

========
----------------
New Features
----------------
* **NEW:** Introducing **DeBertaForSequenceClassification** annotator in Spark NLP 🚀. `DeBertaForSequenceClassification` can load DeBERTa v2&v3 models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `DebertaForSequenceClassification` for **PyTorch** or `TFDebertaForSequenceClassification` for **TensorFlow** models in HuggingFace
* New multi-label feature in all SequenceForClassification. The following annotators now have the option to switch to sigmoid activation function instead of softmax for the output layer: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, and XlnetForSequenceClassification
* New minLength, maxLength, splitLength, customBounds, and useCustomBoundsOnly parameters in SentenceDetectorDL
* New impossiblePenultimates feature in SentenceDetectorDLModel
* New feature to set names for columns in CoNLLU class: textCol, documentCol, sentenceCol, formCol, uposCol, xposCol, and lemmaCol
* New formCol and lemmaCol parameters in Lemmatizer annotator
* Add new functionality to download and extract models from S3 via direct link

----------------
Bug Fixes & Enhancements
----------------
* Fix and train new English spell checker models for Spark NLP 3.4.1 on Spark 3.x and 2.x
* Update SentenceDetector documentation
* Add a missing notebook to demonstrate training a WordSegmenterApproach annotator for word segmentation

========

3.4.2

Not secure

========
----------------
New Features
----------------
* Introducing DeBertaEmbeddings annotator. DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%).
This annotator is compatible with all the models trained/fine-tuned by using `DebertaV2Model` for **PyTorch** or `TFDebertaV2Model` for **TensorFlow** models (DeBERTa-v2 & DeBERTa-v3) in HuggingFace
* Introducing a new param enableCaching in Doc2VecApproach and Word2VecApproach which if enabled speeds up the training
* Support Databricks runtime 10.3, 10.3 ML, and 10.3 ML & GPU
* Support EMR emr-5.34.0 and emr-6.5.0

----------------
Bug Fixes
----------------
* Fix bestModelMetric param when the set value was ignored https://github.com/JohnSnowLabs/spark-nlp/pull/6978

========

3.4.1

Not secure

========
----------------
New Features & Enhancements
----------------
* Implement TF Session warmup for MarianTransformer, T5Transformer, and GPT2Transformer annotators. The first inference for these annotators used to take between 15-20 seconds, now with the warmup session all the inferences including the first time will be the same https://github.com/JohnSnowLabs/spark-nlp/pull/6773
* Add bestModelMetric param to choose between Micro-average or Macro-average for best model https://github.com/JohnSnowLabs/spark-nlp/pull/6749
* Add trimWhitespace and preservePosition params to RegexTokenizer https://github.com/JohnSnowLabs/spark-nlp/pull/6806
* Add a new `setSentenceMatchAdd` param to EntityRuler to match entities across documents/sentences and not just tokens https://github.com/JohnSnowLabs/spark-nlp/pull/6841
* Add support spark32 and real_time_output flags in sparknlp.start() function at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6822

----------------
Bug Fixes
----------------
* Fix random NullPointerException when using TensorFlow models without Kyro serialization https://github.com/JohnSnowLabs/spark-nlp/pull/6741
* Fix RecursiveTokenizerModel not being readable in a saved Pipeline https://github.com/JohnSnowLabs/spark-nlp/pull/6748
* Fix ContextSpellCheckerApproach not being trained on Databricks https://github.com/JohnSnowLabs/spark-nlp/pull/6750
* Fix ContextSpellCheckerModel wrong order of tokens it's used with Sentence Detectors https://github.com/JohnSnowLabs/spark-nlp/pull/6799
* Fix GraphExtraction when fullAnnotate and document are used at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6845
* Fix Word2VecModel being cast to Doc2VecModel by mistake https://github.com/JohnSnowLabs/spark-nlp/pull/6849
* Fix broken sentence indexing in BertEmbeddings that impacted SentenceEmbeddings for text classification https://github.com/JohnSnowLabs/spark-nlp/pull/6867
* Fix missing setExceotionsPath param in Tokenizer when it's used in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6868
* Fix the wrong metrics being mentioned when useBestModel was enabled. The documentation said Micro-averaged F1 but in fact, it was Macro-average F1. (this option is now available to choose which metric to be tracked)
* Update broken slow unit tests https://github.com/JohnSnowLabs/spark-nlp/pull/6767

========

3.4.0

Not secure

========
----------------
Major features and improvements
----------------
* **NEW:** Introducing **GPT2Transformer** annotator in Spark NLP 🚀. OpenAI GPT2 - huggingface `TFGPT2LMHeadModel`
* **NEW:** Introducing **RoBertaForSequenceClassification** annotator in Spark NLP 🚀. `RoBertaForSequenceClassification` can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `RobertaForSequenceClassification` for **PyTorch** or `TFRobertaForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **XlmRoBertaForSequenceClassification** annotator in Spark NLP 🚀. `XlmRoBertaForSequenceClassification` can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLMRobertaForSequenceClassification` for **PyTorch** or `TFXLMRobertaForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **LongformerForSequenceClassification** annotator in Spark NLP 🚀. `LongformerForSequenceClassification` can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `LongformerForSequenceClassification` for **PyTorch** or `TFLongformerForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **AlbertForSequenceClassification** annotator in Spark NLP 🚀. `AlbertForSequenceClassification` can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `AlbertForSequenceClassification` for **PyTorch** or `TFAlbertForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **XlnetForSequenceClassification** annotator in Spark NLP 🚀. `XlnetForSequenceClassification` can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLNetForSequenceClassification` for **PyTorch** or `TFXLNetForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing trainable and distributed Word2Vec annotators based on Word2Vec in Spark ML
* Support for Apache Spark and PySpark 3.2.x on Scala 2.12
* Introducing `useBestModel` param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training
* Welcoming 6x new Databricks runtimes to our Spark NLP family:
* Databricks 10.0
* Databricks 10.0 ML GPU
* Databricks 10.1
* Databricks 10.1 ML GPU
* Databricks 10.2
* Databricks 10.2 ML GPU
* Welcoming 3x new EMR 6.x series to our Spark NLP family:
* EMR 5.33.1 (Apache Spark 2.4.7 / Hadoop 2.10.1)
* EMR 6.3.1 (Apache Spark 3.1.1 / Hadoop 3.2.1)
* EMR 6.4.0 (Apache Spark 3.1.2 / Hadoop 3.2.1)
* Adding a new param to sparknlp.start() function in Python for Apache Spark 3.2.x (`spark32=True`)
* Add new scripts/notebook to generate custom TensroFlow graphs for `ContextSpellCheckerApproach` annotator
* Add a new `graphFolder` param to `ContextSpellCheckerApproach` annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graph
* Support DBFS file system in `graphFolder` param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on Databricks
* Add new feature to all classifiers (`ForTokenClassification` and `ForSequenceClassification`) to retrieve classes from the pretrained models
* Add `inputFormats` param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.
* Enable batch processing in T5Transformer and MarianTransformer annotators
* Add Schema to `readDataset` in CoNLL() class

----------------
Bug Fixes
----------------
* Fix a race condition in a cluster mode when the accessing TF session is called as many times as the number of available cores on the Driver machine for the very first time. Loading a model multiple times result in disk activities and IO becomes a bottleneck for larger models especially on a machine(s) with slower disks) https://github.com/JohnSnowLabs/spark-nlp/pull/6575
* Fix a performance issue introduced in the 3.3.3 release for T5Transformer and MarianTransformer annotators. While we added support for ignored tokens, accidentally we introduced a bug that degraded the performance for these two annotators (sometimes twice slower). Please do update to 3.4.0 if you are using any of these annotators. https://github.com/JohnSnowLabs/spark-nlp/pull/6605
* Fix a bug in model resolution by not filtering based on the timestamp
* Fix configProtoBytes param type in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6549
* Fix missing DefaultParamsReadable in RegexTokenizer annotator https://github.com/JohnSnowLabs/spark-nlp/pull/6653
* Fix missing models `lemma_antbnc`, `sentiment_vivekn`, and `spellcheck_norvig` for Spark 3.x
* Fix missing pipelines `clean_slang`, `check_spelling`, `match_chunks`, and `match_datetime` for Spark 3.x
* Fix `saveModel` in TrainingHelper
* Fix Keyword/Yake module naming in Scala https://github.com/JohnSnowLabs/spark-nlp/pull/6562

----------------
Backward Compatibility
----------------

* The parameter `dateFormat` in DateMatcher and MultiDateMatcher annotators has been renamed to `outputFormat`:

python=

previously
.setDateFormat("yyyy/MM/dd")

after 3.4.0 release
.setOutputFormat("yyyy/MM/dd")

* Deprecating xling TF Hub models for UniversalSentenceEncoder annotator (there are `CMLM` models available which outperform xling models with support for more languages)
* Deprecating Finnish old BERT models (there are newer models available now)

========

Page 7 of 22

Releases

Has known vulnerabilities

Previous Next

Spark-nlp

Page 7 of 22

4.0.0

3.4.4

3.4.3

3.4.2

3.4.1

3.4.0

Page 7 of 22

Links

Releases