Catboost

Latest version: v1.2.5

Safety actively analyzes 629765 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 14

0.24.4

Not secure
Speedup
* Major speedup asymmetric trees training time on CPU (2x speedup on Epsilon with 16 threads). We would like to recognize Intel software engineering team’s contributions to Catboost project.

New features
* Now we publish Python 3.9 wheels. Related issues: 1491, 1509, 1510
* Allow `boost_from_average` for `MultiRMSE` loss.
* Add tag pairwise=False for sklearn compatibility. Fixes issue 1518

Bugfixes:
* Allow fstr calculation for datasets with embeddings
* Fix `feature_importances_` for fstr with texts
* Virtual ensebles fix: use proper unshrinkage coefficients
* Fixed constants in RMSEWithUnceratainty loss function calculation to correspond values from original paper
* Allow shap values calculation for model with zero-weights and non-zero leaf values. Now we use sum of leaf weights on train and current dataset to guarantee non-zero weights for leafs, reachable on current dataset. Fixes issues 1512, 1284

0.24.3

Not secure
New functionality
* Support fstr text features and embeddings. Issue 1293

Bugfixes:
* Fix model apply speed regression from 0.24.1 & 0.24.2
* Different fixes in embeddings support: fixed apply and model serialization, fixed apply on texts and embeddings
* Fixed virtual ensembles prediction - use proper scaling, fix apply (issue 1462)
* Fix `score()` method for `RMSEWithUncertainty` issue 1482
* Automatically use correct `prediction_type` in `score()`

0.24.2

Not secure
Uncertainty prediction
* Supported uncertainty prediction for classification models.
* Fixed RMSEWithUncertainty data uncertainty prediction - now it predicts variance, not standard deviation.

New functionality
* Allow categorical feature counters for `MultiRMSE` loss function.
* `group_weight` parameter added to `catboost.utils.eval_metric` method to allow passing weights for object groups. Allows correctly match weighted ranking metrics computation when group weights present.
* Faster non-owning deserialization from memory with less memory overhead - moved some dynamically computed data to model file, other data is computed in lazy manner only when needed.

Experimental functionality
* Supported embedding features as input and linear discriminant analysis for embeddings preprocessing. Try adding your embeddings as new columns with embedding values array in Pandas.Dataframe and passing corresponding column names to `Pool` constructor or `fit` function with `embedding_features=['EmbeddingFeaturesColumnName1, ...]` parameter. Another way of adding your embedding vectors is new type of column in Column Description file `NumVector` and adding semicolon separated embeddings column to your XSV file: ClassLabel\t0.1;0.2;0.3\t....

Educational materials
* Published new [tutorial](https://github.com/catboost/catboost/blob/master/catboost/tutorials/uncertainty/uncertainty_regression.ipynb) on uncertainty prediction.


Bugfixes:
* Reduced GPU memory usage in multi gpu training when there is no need to compute categorical feature counters.
* Now CatBoost allows to specify `use_weights` for metrics when `auto_class_weights` parameter is set.
* Correctly handle NaN values in `plot_predictions` function.
* Fixed floating point precision drop releated bugs during Multiclass training with lots of objects in our case, bug was triggered while training on 25mln objects on single GPU card.
* Now `average` parameter is passed to TotalF1 metric while training on GPU.
* Added class labels checks
* Disallow feature remapping in model predict when there is empty feature names in model.

0.24.1

Not secure
Uncertainty prediction
Main feature of this release is total uncertainty prediction support via virtual ensembles.
You can read the theoretical background in the preprint [Uncertainty in Gradient Boosting via Ensembles](https://arxiv.org/pdf/2006.10562v2.pdf) from our research team.
We introduced new training parameter `posterior_sampling`, that allows to estimate total uncertainty.
Setting `posterior_sampling=True` implies enabling Langevin boosting, setting `model_shrink_rate` to `1/(2*N)` and setting `diffusion_temperature` to `N`, where `N` is dataset size.
CatBoost object method `virtual_ensembles_predict` splits model into `virtual_ensembles_count` submodels.
Calling `model.virtual_ensembles_predict(.., prediction_type='TotalUncertainty')` returns mean prediction, variance (and knowledge uncertrainty for models, trained with `RMSEWithUncertainty` loss function).
Calling `model.virtual_ensembles_predict(.., prediction_type='VirtEnsembles')` returns `virtual_ensembles_count` predictions of virtual submodels for each object.

New functionality
* Supported non-owning model deserialization for models with categorical feature counters
Speedups
* We've done lot's of speedups for sparse data loading. For example, on bosch sparse dataset preprocessing speed got 4.5x speedup while running in 28 thread setting.
Bugfixes:
* Fixed target check for PairLogitPairwise on GPU. Issue 1217
* Supported `n_features_in_` attribute required for using CatBoost in sklearn pipelines. Issue 1363

0.24

Not secure
New functionality
* We've finally implemented MVS sampling for GPU training. Switched default bootstrap algorithm to MVS for RMSE loss function while training on GPU
* Implemented near-zero cost model deserialization from memory blob. Currently, if your model doesn't use categorical features CTR counters and text features you can deserialize model from, for example, memory-mapped file.
* Added ability to load trained models from binary string or file-like stream. To load model from bytes string use `load_model(blob=b'....')`, to deserialize form file-like stream use `load_model(stream=gzip.open('model.cbm.gz', 'rb'))`
* Fixed auto-learning rate estimation params for GPU
* Supported beta parameter for QuerySoftMax function on CPU and GPU

New losses and metrics
* New loss function RMSEWithUncertainty - it allows to estimate data uncertainty for trained regression models. The trained model will give you a two-element vector for each object with the first element as regression model prediction and the second element as an estimation of data uncertainty for that prediction.

Speedups
* Major speedups for CPU training: kdd98 -9%, higgs -18%, msrank -28%. We would like to recognize Intel software engineering team’s contributions to Catboost project. This was mutually beneficial activity, and we look forward to continuing joint cooperation.

Bugfixes:
* Fixed CatBoost model export as Python code
* Fixed AUC metric creation
* Add text features to `model.feature_names_`. Issue1314
* Allow models, trained on datasets with NaN values (Min treatment) and without NaNs in `model_sum()` or as the base model in `init_model=`. Issue 1271

Educational materials
* Published new [tutorial](https://github.com/catboost/catboost/blob/master/catboost/tutorials/categorical_features/categorical_features_parameters.ipynb) on categorical features parameters. Thanks garkavem

0.23.2

Not secure
New functionality
* Added `plot_partial_dependence` method in python-package (Now it works for models with symmetric trees trained on dataset with numerical features only). Implemented by felixandrer.
* Allowed using `boost_from_average` option together with `model_shrink_rate` option. In this case shrinkage is applied to the starting value..
* Added new `auto_class_weights` option in python-package, R-package and cli with possible values `Balanced` and `SqrtBalanced`. For `Balanced` every class is weighted `maxSumWeightInClass / sumWeightInClass`, where sumWeightInClass is sum of weights of all samples in this class. If no weights are present then sample weight is 1. And maxSumWeightInClass - is maximum sum weight among all classes. For `SqrtBalanced` the formula is `sqrt(maxSumWeightInClass / sumWeightInClass)`. This option supported in binclass and multiclass tasks. Implemented by egiby.
* Supported `model_size_reg` option on GPU. Set to 0.5 by default (same as in CPU). This regularization works slightly differently on GPU: feature combinations are regularized more aggressively than on CPU. For CPU cost of a combination is equal to number of different feature values in this combinations that are present in training dataset. On GPU cost of a combination is equal to number of all possible different values of this combination. For example, if combination contains two categorical features c1 and c2, then the cost will be categories in c1 * categories in c2, even though many of the values from this combination might not be present in the dataset.
* Added calculation of Shapley values, (see formula (2) from https://arxiv.org/pdf/1802.03888.pdf). By default estimation from this paper (Algorithm 2) is calcucated, that is much more faster. To use this mode specify shap_calc_type parameter of CatBoost.get_feature_importance function as "Exact". Implemented by LordProtoss.

Bugfixes:
* Fixed onnx converter for old onnx versions.

Page 4 of 14

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.