Catboost

Latest version: v1.2.5

Safety actively analyzes 629765 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 14

1.0.1

Not secure

> :warning: **PySpark support is broken in this release.**: Please use release 1.0.3 instead.

CatBoost for Apache Spark
* More robust handling of CatBoost Master and Workers failures, avoid freezes.
* Fix for empty partitions. 1687
* Fix use-after-free. 1759 and other random errors.
* Support Spark 3.1.

Python package
* Support python 3.10. 1575

Breaking changes
* Use group weight for generated pairs in pairwise losses

Bugfixes
* Switch to mimalloc allocator on Linux and macOS to avoid problems with static TLS.
* Fix SEGFAULTs on macOS. 1877
* Fix: Distributed training: do not fail if worker contains only learn or test data
* Fix SEGFAULT on CPU with Depthwise training and `rsm` < 1.
* Fix `calc_feature_statistics` for cat features. 1882
* Fix result of cv for metric_period case
* fix `eval_metric` for Multitarget training

1.0.0

Not secure

In this release we decided to increment major version as we think that CatBoost is ready for production usage. We know, that CatBoost is used a lot in many different companies and individual projects, and not it's not only a "psychological" maturity - we think, that all the features we added in the last year and in current release is worth to update major version. And of course, as many programmers we love magic of binary numbers and we want to celebrate 100₂ anniversary since CatBoost first release on github :)
New losses
* We've implemented multi label multiclass loss function, that allows to predict multiple lables for each object 1420
* Added LogCosh loss implementation 844

Fully distributed CatBoost for Apache Spark
* In this release we our Apache Spark package became truly distributed - in previouse version CatBoost stored test datasets in controller process memory. And now test datasets are splitted evenly by workers.

Major speedup on CPU
* Speedup training on numeric datasets (480K rows, 60 features, 100 trees, binclass, 20% speedup on 16 cores Intel CPU 3.7s -> 2.9s)

R package
* Update C++ handles by reference to avoid redundant copies by david-cortes
* Avoid calculating groupwise feature importance: do not calculate feature importance for groupwise metrics by default
* R tests clear environment after runs so they won't find temporary data from previous runs
* Fixed ignored features in R fail whet single feature were ignored
* Fix feature_count attribute with ignored_features

CV improvements
* Added support for text features and embeddings in crossvalidation mode
* We've changed the way crossvalidation works - previously, CatBoost was training a small batch of trees on each fold and then switched to next fold or next batch of trees. In 1.0.0 we changed this behaviour and now CatBoost trains full model on each fold. That allows us to reduce memory and time overhead of starting new batch - only one CPU to GPU memory copy is needed per fold, not per each batch of trees. Mean metric interactive plot became unavailable until the end of training on all folds.
* **Important change** From now on `use_best_model` and early stopping works independently on each fold, as we are trying to make single fold trainig as close to regular training as possible. If one model stops at iteration `i` we use it's last value in mean score plot for points with `[i+1; last iteration)`.

GPU improvements
* Fixed distributed training performance on Ethernet networks ~2x training time speedup. For 2 hosts, 8 v100/host, 10gigabit eth, 300 factors, 150m samples, 200 trees, 3300s -> 1700s
* We've found a bug in model-size-reg implementation in gpu that leaded to worse quality of resulting model, especially in comparison to model trained on CPU with equal parameters

Rust
* Enabled load model from buffer for rust by manavsah

Bugfixes
* Fix for model predictions with text and embedding features
* Switch to TBB local executor to limit TLS size and avoid memory leakage 1835
* Switch to tcmalloc under linux x86_64 to avoid memory fragmentation bug in LFAlloc
* Fix for case of ignored text feature
* Fixed application of baseline in C++ code. Moved addition of that before application of activation functions and determining labels of objects.
* Fixes for scikit-learn compatibility validation 1783 and 1785
* Fix for thread_count = -1 in set_params(). Issue 1800
* Fix potential sigsegv in evaluator. Fixes 1809
* Fix slow (u)int8 & (u)int16 parsing as catfeatures. Fixes 718
* Adjust boost from overage option before auto learning rate
* Fix embeddings with CrossEntropy mode 1654
* Fix object importance 1820
* Fix data provider without target 1827

0.26.1

Not secure

R package
* Supported text features in R package, thanks to
* Supported virtual Ensembles in R

New features
* Thank gmrandazzo for adding multiregression with missing values on targets - `MultiRMSEWithMissingValues` loss function
* Supported multiclass prediction in C++ wrapper for model inference C API

Bugfixes
* Renamed keyword parameter in `predict_proba` function from `X` to `data`, fixes 1785
* R feature importances: remove pool argument, fix 1438 and 1772
* Fix CUDA training on Windows, multiple issues. main issue with details 1735
* Issue 1728: don't dereference pointers when there is no features
* Fixed empty tree processing in feature strength calculation
* Fixed missing loss graph points in select_features, 1775
* Sort csr matrix indices, fixes 1749
* Fix error "active CatBoost worker is already present in the current process" after previous training interruption or failure. 1795.
* Fixed erroneous warnings from models validation after training with custom loss or custom error function. Fixes 873 Fixes 1169

0.26

Not secure

New features
* 972. Add model evaluation on GPU. Thanks to rakalexandra.
* Support Langevin on GPU
* Save class labels to models in cross validation
* 1524. Return models after CV. Thanks to vklyukin
* [Python] 766. Add CatBoostRanker & pool.get_group_id_hash() for ranking. Thanks to AnnaAraslanova
* 262. Make CatBoost widget work in jupyter lab. Thanks to Dm17r1y
* [GPU only] Allow to add exponent to score aggregation function
* Allow to specify threshold parameter for binary classification model. Thanks to Keksozavr.
* [C Model API] 503. Allow to specify prediction type.
* [C Model API] 1201. Get predictions for a specific class.

Breaking changes
* Use CUDA 11 by default. CatBoost GPU now requires Linux x86_64 Driver Version >= 450.51.06 Windows x86_64 Driver Version >= 451.82.

Losses and metrics
* Add MRR and ERR metrics on CPU.
* Add [LambdaMart](https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/) loss.
* 1557. Add survivalAFT base logic. Thanks to blatr.
* 1286. Add Cox Proportional Hazards Loss. Thanks to fibersel.
* 1595. Provide object-oriented interface for setting up metric parameters. Thanks to ks-korovina.
* Change default YetiRank decay to 0.85 for better quality.

Python package
* 1372. Custom logging stream in python package. Thanks to DianaArapova.
* 1304. Callback after iteration functionality. Thanks to qoter.

R package
* 251. Train parameter synonyms. Thanks to ebalukova.
* 252. Add `eval_metrics`. Thanks to ebalukova.

Speedups
* [Python] Speed up custom metrics and objectives with `numba` (if available)
* [Python] 1710. Large speedup for cv dataset splitting by sklearn splitter

Other
* Use Exact leaves estimation method as default on GPU
* [Spark] 1632. Update version of Scala 2.11 for security reasons.
* [Python] 1695. Explicitly specify WHEEL 'Root-Is-Purelib' value

Bugfixes
* Fix default projection dimension for embeddings
* Fix `use_weights` for some eval_metrics on GPU - `use_weights=False` is always respected now
* [Spark] 1649. The earlyStoppingRounds parameter is not recognized
* [Spark] 1650. Error when using the autoClassWeights parameter
* [Spark] 1651. Error about "Auto-stop PValue" when using odType "Iter" and odWait
* Fix usage of pairlogit weights for CPU fallback metrics when training on GPU

0.25.1

Not secure

Speedup
* Now CatBoost uses non-owning Numpy arrays for passing c++ data to user-defined metric and loss functions in Python. This opens lot's of speedup probabilities: using those vectors in numba.jitted code, in cython code or just using numpy vector functions. Thanks micyril!

Bugfixes
* Fix 1620 - retrieval of R pointers by david-cortes
* Fix `EvalMetricsResult.get_metric()` by Roffild
* Fix multiclass AUC calculation 1615

0.25

Not secure

CatBoost for Apache Spark
This release includes CatBoost for Apache Spark package that supports training, model application and feature evaluation on Apache Spark platform. We've prepared [CatBoost for Apache Spark introduction](https://www.youtube.com/watch?v=47-mAVms-b8) and [CatBoost for Apache Spark Architecture
](https://www.youtube.com/watch?v=nrGt5VKZpzc) videos for introduction. More details available at [CatBoost for Apache Spark home page](https://github.com/catboost/catboost/tree/master/catboost/spark/catboost4j-spark).

Feature selection
CatBoost supports recursive feature elimination procedure - when you have lot's of feature candidates and you want to select only most influential features by training models and selecting only strongest by feature importance. You can look for details in our [tutorial](https://github.com/catboost/catboost/blob/master/catboost/tutorials/feature_selection/select_features_tutorial.ipynb)

New features
* Supported exact leaves estimation method for quantile, MAE and MAPE losses on GPU. You can enable it by setting leaf_estimation_method=Exact explicitly, in next releases we are planning to set it by default.
* Supported uncertainty prediction for multiclassification models
* 1568 Added support shap values calculation MultiRMSE models
* 1520 Added support for pathlib.Path in python package
* 1456 Added prehashed categorical features and text features to C API for model inference.

Losses and metrics
* Supported Huber and Tweedie losses in GPU training
* QueryAUC metric implemented by fibersel

Breaking changes
* We changed NDCG calculation principle for groups without relevant docs to make our NDCG score fully compatible with XGBoost and LightGBM implementations. Now we calc dcg==1 when there is no relevant objects in group (when ideal DCG equals zero), later we used score==0 in that case.

Speedups
* With help of Intel developers team we switched our threading model implementation to Intel Threading Building Blocks. That gives us up to 20% speedup on 28 threads and around 2x speedup when training in 120 threads and largely improves scalability.
* Speed up rendering fstat plots.
* Slightly speed up string casting in python package during pool creation.

R package
* Added path expansion when saving/loading files in R by david-cortes
* Added functionality to restore R handle after deserializing model by david-cortes
* Retrieve R pointers outside loops to speed up scalar access by david-cortes
* Multiple R documentation edits from david-cortes and jameslamb
* 1588 Added precision for converting params to json

Bugfixes
* 1525 Problem with missing exported functions in Windows R package dll
* 1315 Low CPU utilization in CPU cross-validation
* 785 Predict on single item with iloc fixed by feeeper
* Segfaults due to null pointer in pool in R package fixed by david-cortes
* 1553 Added check for baseline dimensions count in apply
* 1606 Allow to use CatBoost in AWS Lambda environment: fix bug with setting thread names
* 1609 and 1309 Print proper error message if all params in grid were invalid
* Ability to use docstrings in estimators added by pawelopiela
* Allow extra space at the end of line for libsvm format

Thanks!
* We would like to recognize Intel software engineering team’s contributions to Catboost project.
* Many thanks to our individual contributors: david-cortes jameslamb pawelopiela feeeper fibersel

Page 3 of 14

Releases

Has known vulnerabilities

Previous Next

Catboost

Page 3 of 14

1.0.1

1.0.0

0.26.1

0.26

0.25.1

0.25

Page 3 of 14

Links

Releases