Changelogs » Miceforest

PyUp Safety actively tracks 362,670 Python packages for vulnerabilities and notifies you when to upgrade.



* New main classes (`ImputationKernel`, `ImputedData`) replace (`ImputationKernel`, `ImputationKernel`, `ImputedDataSet`, `MultipleImputedDataSet`).
  * Data can now be referenced and imputed in place. This saves a lot of memory allocation and is much faster.
  * Data can now be completed in place. This allows for only a single copy of the dataset to be in memory at any given time, even if performing multiple imputation.
  * mean_match_subset parameter has been replaced with data_subset. This subsets the data used to build the model as well as the candidates.
  * More performance improvements around when data is copied and where it is stored.
  * Raw data is now stored as the original. Can handle pandas DataFrame and numpy ndarray.


This release improved a number of areas:
  * Huge performance improvements, especially if categorical variables were being imputed. These come from not predicting candidate data if we don't need to, using a much faster neighbors search, using numpy internally for indexing instead of pandas, and others.
  * Ability to tune parameters of models, and use best parameters for mice.
  * Improvements to code layout - got rid of ImputationSchema.
  * Raw data is now stored as a numpy array to save space and improve indexing.
  * Numpy arrays can be imputed, if you want to avoid pandas.
  * Options of multiple build-in mean matching functions.
  * Mean matching functions can handle most lightgbm objectives.


This is a major release, with breaking API changes:
  * The random forest package is now lightgbm
  - Much more lightweight (serialized kernels tend to be 5x smaller or more)
  - Much faster on big datasets (for comparable parameters)
  - More flexible... We can now use gbdt if we wish. lightgbm is more flexible in general.
  * Added a mean_match_subset parameter. This will help greatly speed up many processes.
  * mean_match_candidates now lazily accepts dicts as long as the keys are a subset of parameters in variable_schema.
  * Model parameters can be specified by variable, or globally.
  * Mean matching function can be overwritten if the user wishes.


* Models from all iterations can be saved with save_models == 2.
  * Kernel classes inherit from base imputed classes - allows for methods to be called on imputed datasets obtained form impute_new_data().
  * Time log was added
  * MultipleImputedDataset is now a collection of ImputedDataSets with methods for comparing them. Subscripting gives the desired dataset.
  * Tests updated to be much more comprehensive
  * Datasets can now be added and removed from a MultipleImputedDataSet/MultipleImputedKernel.


Automatic testing, coverage, and formatting has been implemented. Code is (reasonably) bug free.