Catboost

Latest version: v1.2.5

Safety actively analyzes 629765 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 12 of 14

0.10.0

Not secure
Breaking changes
R package
- In R package we have changed parameter name `target` to `label` in method [`save_pool()`](https://tech.yandex.com/catboost/doc/dg/concepts/r-reference_catboost-save_pool-docpage/)
Python package
- We don't support Python 3.4 anymore
- CatBoostClassifier and CatBoostRegressor [`get_params()`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier_get_params-docpage/) method now returns only the params that were explicitly set when constructing the object. That means that CatBoostClassifier and CatBoostRegressor get_params() will not contain 'loss_function' if it was not specified.
This also means that this code:
(python)
model1 = CatBoostClassifier()
params = model1.get_params()
model2 = CatBoost(params)

will create model2 with default loss_function RMSE, not with Logloss.
This breaking change is done to support sklearn interface, so that sklearn GridSearchCV can work.
- We've removed several attributes and changed them to functions. This was needed to avoid sklearn warnings:
`is_fitted_` => [`is_fitted()`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier_is_fitted-docpage/)
`metadata_` => [`get_metadata()`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier_metadata-docpage/)
- We removed file with model from constructor of estimator. This was also done to avoid sklearn warnings.
Educational materials
- We added [tutorial](https://github.com/catboost/tutorials/blob/master/ranking/ranking_tutorial.ipynb) for our ranking modes.
- We published our [slides](https://github.com/catboost/catboost/tree/master/slides), you are very welcome to use them.
Improvements
All
- Now it is possible to save model in json format.
- We have added Java interface for CatBoost model
- We now have static linkage with CUDA, so you don't have to install any particular version of CUDA to get catboost working on GPU.
- We implemented both multiclass modes on GPU, it is very fast.
- It is possible now to use multiclass with string labels, they will be inferred from data
- Added `use_weights` parameter to [metrics](https://tech.yandex.com/catboost/doc/dg/concepts/loss-functions-docpage/). By default all metrics, except for AUC use weights, but you can disable it. To calculate metric value without weights, you need to set this parameter to false. Example: Accuracy:use_weights=false. This can be done only for custom_metrics or eval_metric, not for the objective function. Objective function always uses weights if they are present in the dataset.
- We now use snapshot time intervals. It will work much faster if you save snapshot every 5 or 10 minutes instead of saving it on every iteration.
- Reduced memory consumption by ranking modes.
- Added automatic feature importance evaluation after completion of GPU training.
- Allow inexistent indexes in ignored features list
- Added [new metrics](https://tech.yandex.com/catboost/doc/dg/concepts/loss-functions-docpage/): `LogLikelihoodOfPrediction`, `RecallAt:top=k`, `PrecisionAt:top=k` and `MAP:top=k`.
- Improved quality for multiclass with weighted datasets.
- Pairwise modes now support automatic pairs generation (see [tutorial](https://github.com/catboost/tutorials/blob/master/ranking/ranking_tutorial.ipynb) for that).
- Metric `QueryAverage` is renamed to a more clear `AverageGain`. This is a very important ranking metric. It shows average target value in top k documents of a group.
Introduced parameter `best_model_min_trees` - the minimal number of trees the best model should have.
Python
- We now support sklearn GridSearchCV: you can pass categorical feature indices when constructing estimator. And then use it in GridSearchCV.
- We added new method to utils - building of ROC curve: [`get_roc_curve`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_utils_get_roc_curve-docpage/).
- Added [`get_gpu_device_count()`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_utils_get_gpu_device_count-docpage/) method to python package. This is a way to check if your CUDA devices are available.
- We implemented automatical selection of decision-boundary using ROC curve. You can select best classification boundary given the maximum FPR or FNR that you allow to the model. Take a look on [`catboost.select_threshold(self, data=None, curve=None, FPR=None, FNR=None, thread_count=-1)`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_utils_select_threshold-docpage/). You can also calculate FPR and FNR for each boundary value.
- We have added pool slicing: [`pool.slice(doc_indices)`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_pool_slice-docpage/)
- Allow GroupId and SubgroupId specified as strings.
R package
- GPU support in R package. You need to use parameter `task_type='GPU'` to enable GPU training.
- Models in R can be saved/restored by means of R: save/load or saveRDS/readRDS
Speedups
- New way of loading data in Python using [FeaturesData structure](https://tech.yandex.com/catboost/doc/dg/concepts/python-features-data__desc-docpage/). Using FeaturesData will speed up both loading data for training and for prediction. It is especially important for prediction, because it gives around 10 to 20 times python prediction speedup.
- Training multiclass on CPU ~ 60% speedup
- Training of ranking modes on CPU ~ 50% speedup
- Training of ranking modes on GPU ~ 50% speedup for datasets with many features and not very many objects
- Speedups of metric calculation on GPU. Example of speedup on our internal dataset: training with - AUC eval metric with test dataset with 2kk objects is speeded up 7sec => 0.2 seconds per iteration.
- Speedup of all modes on CPU training.

We also did a lot of stability improvements, and improved usability of the library, added new parameter synonyms and improved input data validations.

Thanks a lot to all people who created issues on github. And thanks a lot to our contributor pukhlyakova who implemented many new useful metrics!

0.9.1.1

Not secure
Bugfixes
- Fixed 403 bug in cuda train submodule (training crashed without evaluation set)
- Fixed exception propagation on pool parsing stage
- Add support of string `GroupId` and `SubgroupId` in python-package
- Print real class names instead of their labels in eval output

0.9

Not secure
Breaking Changes
- We removed calc_feature_importance parameter from Python and R.
Now feature importance calculation is almost free, so we always calculate feature importances. Previously you could disable it if it was slowing down your training.
- We removed Doc type for feature importances. Use Shap instead.
- We moved thread_count parameter in Python get_feature_importance method to the end.

Ranking
In this release we added several very powerfull ranking objectives:
- PairLogitPairwise
- YetiRankPairwise
- QueryCrossEntropy (GPU only)

Other ranking improvements:
- We have made improvements to our existing ranking objectives QuerySoftMax and PairLogit.
- We have added group weights support.

Accuracy improvements
- Improvement for datasets with weights
- Now we automatically calculate a good learning rate for you in the start of training, you don't have to specify it. After the training has finished, you can look on the training curve on evaluation dataset and make ajustments to the selected learning rate, but it will already be a good value.

Speedups:
- Several speedups for GPU training.
- 1.5x speedup for applying the model.
- Speed up multi classificaton training.
- 2x speedup for AUC calculation in eval_metrics.
- Several speedups for eval_metrics for other metrics.
- 100x speed up for Shap values calculation.
- Speedup for feature importance calculation. It used to be a bottleneck for GPU training previously, now it's not.
- We added possibility to not calculate metric on train dataset using `MetricName:hints=skip_train~false` (it might speed up your training if metric calculation is a bottle neck, for example, if you calculate many metrics or if you calculate metrics on GPU).
- We added possibility to calculate metrics only periodically, not on all iterations. Use metric_period for that.
(previously it only disabled verbose output on each iteration).
- Now we disable by default calculation of expensive metrics on train dataset. We don't calculate AUC and PFound metrics on train dataset by default. You can also disable calculation of other metrics on train dataset using `MetricName:hints=skip_train~true`. If you want to calculate AUC or PFound on train dataset you can use `MetricName:hints=skip_train~false`.
- Now if you want to calculate metrics using eval_metrics or during training you can use metric_period to skip some iterations. It will speed up eval_metrics and it might speed up training, especially GPU training.
Note that the most expensive metric calculation is AUC calculation, for this metric and large datasets it makes sense to use metric_period.
If you only want to see less verbose output, and still want to see metric values on every iteration written in file, you can use `verbose=n` parameter
- Parallelization of calculation of most of the metrics during training

Improved GPU experience
- It is possible now to calculate and visualise custom_metric during training on GPU.
Now you can use our Jupyter visualization, CatBoost viewer or TensorBoard the same way you used it for CPU training. It might be a bottleneck, so if it slows down your training use `metric_period=something` and `MetricName:hints=skip_train~false`
- We switched to CUDA 9.1. Starting from this release CUDA 8.0 will not be supported
- Support for external borders on GPU for cmdline

Improved tools for model analysis
- We added support of feature combinations to our Shap values implementation.
- Added Shap values for MultiClass and added an example of it's usage to our [Shap tutorial](https://github.com/catboost/tutorials/blob/master/model_analysis/shap_values_tutorial.ipynb).
- Added pretified parameter to get_feature_importance(). With `pretified=True` the function will return list of features with names sorted in descending order by their importance.
- Improved interfaces for eval-feature functionality
- Shap values support in R-package

New features
- It is possible now to save any metainformation to the model.
- Empty values support
- Better support of sklearn
- feature_names_ for CatBoost class
- Added silent parameter
- Better stdout
- Better diagnostic for invalid inputs
- Better documentation
- Added a flag to allow constant labels

New metrics
We added many new metrics that can be used for visualization, overfitting detection, selecting of best iteration of training or for cross-validation:
- BierScore
- HingeLoss
- HammingLoss
- ZeroOneLoss
- MSLE
- MAE
- BalancedAccuracy
- BalancedErrorRate
- Kappa
- Wkappa
- QueryCrossEntropy
- NDCG

New ways to apply the model
- Saving model as C++ code
- Saving model with categorical features as Python code

New ways to build the code
Added make files for binary with CUDA and for Python package

Tutorials
We created a new [repo with tutorials](https://github.com/catboost/tutorials/), now you don't have to clone the whole catboost repo to run Jupyter notebook with a tutorial.

Bugfixes
We have also a set of bugfixes and we are gratefull to everyone who has filled a bugreport, helping us making the library better.

Thanks to our Contributors
This release contains contributions from CatBoost team.
We want to especially mention pukhlyakova who implemented lots of useful metrics.

0.8.1

Not secure
Bug Fixes and Other Changes
- New model method `get_cat_feature_indices()` in Python wrapper.
- Minor fixes and stability improvements.

0.8

Not secure
Breaking changes
- We fixed bug in CatBoost. Pool initialization from `numpy.ndarray` and `pandas.dataframe` with string values that can cause slight inconsistence while using trained model from older versions. Around 1% of cat feature hashes were treated incorrectly. If you expirience quality drop after update you should consider retraining your model.

Major Features And Improvements
- Algorithm for finding most influential training samples for a given object from the 'Finding Influential Training Samples for Gradient Boosted Decision Trees' [paper](https://arxiv.org/pdf/1802.06640.pdf) is implemented. This mode for every object from input pool calculates scores for every object from train pool. A positive score means that the given train object has made a negative contribution to the given test object prediction. And vice versa for negative scores. The higher score modulo - the higher contribution.
See `get_object_importance` model method in Python package and `ostr` mode in cli-version. Tutorial for Python is available [here](https://github.com/catboost/tutorials/blob/master/model_analysis/object_importance_tutorial.ipynb).
More details and examples will be published in documentation soon.
- We have implemented new way of exploring feature importance - Shap values from [paper](https://arxiv.org/pdf/1706.06060.pdf). This allows to understand which features are most influent for a given object. You can also get more insite about your model, see details in a [tutorial](https://github.com/catboost/tutorials/blob/master/model_analysis/shap_values_tutorial.ipynb).
- Save model as code functionality published. For now you could save model as Python code with categorical features and as C++ code w/o categorical features.

Bug Fixes and Other Changes
- Fix `_catboost` reinitialization issues 268 and 269.
- Python module `catboost.util` extended with `create_cd`. It creates column description file.
- Now it's possible to load titanic and amazon (Kaggle Amazon Employee Access Challenge) datasets from Python code. Use `catboost.datasets`.
- GPU parameter `use_cpu_ram_for_cat_features` renamed to `gpu_cat_features_storage` with posible values `CpuPinnedMemory` and `GpuRam`. Default is `GpuRam`.

Thanks to our Contributors
This release contains contributions from CatBoost team.

As usual we are grateful to all who filed issues or helped resolve them, asked and answered questions.

0.7.2

Not secure
Major Features And Improvements
- GPU: New `DocParallel` mode for tasks without categorical features and or with categorical features and `—max-ctr-complextiy 1`. Provides best performance for pools with big number of documents.
- GPU: Distributed training on several GPU host via MPI. See instruction how to build binary [here](https://tech.yandex.com/catboost/doc/dg/concepts/cli-installation-docpage/#multi-node-installation).
- GPU: Up to 30% learning speed-up for Maxwell and later GPUs with binarization level > 32

Bug Fixes and Other Changes
- Hotfixes for GPU version of python wrapper.

Page 12 of 14

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.