Rsmtool

Latest version: v12.0.0

Safety actively analyzes 629004 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 6

7.0.0

This is a major release which includes changes to several key evaluation metrics computed by RSMTool.

What's new

Changes to evaluation metrics

The exact definitions of all evaluation metrics and their method of computation are now available in
* RSMTool documentation under [evaluation metrics](https://rsmtool.readthedocs.io/en/stable/evaluation.html).

Changes to evaluation metrics

* [Quadratic weighted kappa (QWK)](https://rsmtool.readthedocs.io/en/stable/evaluation.html#qwk) for `raw`, `raw_trim`, `scale` and `scale_trim` scores is now computed on *continuous* score values using formula suggested by [Haberman (2019)](https://onlinelibrary.wiley.com/doi/full/10.1002/ets2.12258). In previous versions of RSMTool such continuous score values were rounded to compute QWK.

* Subgroup differences are now evaluated using a new metrics ["Difference in standardized means"](https://rsmtool.readthedocs.io/en/stable/evaluation.html#differences-between-standardized-means-for-subgroups-dsm). This metrics was designed to be more robust to differences in scale between human and machine scores.

* SMD for human-human agreement is now computed using pooled standard deviation of H1 and H2 for the *double-scored sample* in the denominator.

* The default `tolerance` for [score postprocessing](https://rsmtool.readthedocs.io/en/stable/pipeline.html#score-post-processing) is now set to 0.4998 (instead of 0.49998). This may result in small changes to the values of all evaluation metrics for `raw_trim` and `scale_trim` scores. See below for new configuration files if you need to define custom tolerance.

New evaluation metrics

* [Test-theory based evaluations](https://rsmtool.readthedocs.io/en/stable/evaluation.html#accuracy-metrics-true-score): RSMTool and RSMEval now compute proportional reduction in mean squared error when using system scores to predict *true* scores.

* RSMTool and RSMEval now compute various additional [metrics of model fairness](https://rsmtool.readthedocs.io/en/stable/evaluation.html#additional-fairness-evaluations) suggested in [Loukina et al. 2019](https://www.aclweb.org/anthology/W19-4401/).


New configuration settings

* A new configuration setting [`experiment_names`](https://rsmtool.readthedocs.io/en/stable/advanced_usage.html#experiment-names-optional) for RSMSummarize allows specifying custom names for each experiment. These will be used to refer to the experiments in intermediate output files and in the report.

* A new configuration setting [`trim_tolerance`](https://rsmtool.readthedocs.io/en/stable/usage_rsmtool.html#trim-tolerance-optional) allows specifying custom tolerance when trimming scores to ceiling and floor values in RSMTool and RSMEval.

* A new configuration setting [`min_n_per_group`](https://rsmtool.readthedocs.io/en/stable/usage_rsmtool.html#min-n-per-group-optional) allows defining a threshold so that only groups with more than a certain number of members are included into the report. All groups are still included into the intermediate output files.

Other new functionality

* `.jsonlines` format is now one of the supported [input file formats](https://rsmtool.readthedocs.io/en/stable/pipeline.html#input-file-format).

API changes

- Several additional methods for computing standardized mean difference (SMD) are now available via [`rsmtool.utils.standardized_mean_difference`](https://rsmtool.readthedocs.io/en/stable/api.html#rsmtool.utils.standardized_mean_difference)

- The new routine for computing QWK is available via [`rsmtool.utils.quadratic_weighted_kappa`](https://rsmtool.readthedocs.io/en/stable/api.html#rsmtool.utils.quadratic_weighted_kappa)

- The new metrics differences in standardized means (DSM) is available via [`rsmtool.utils.difference_of_standardized_means`](https://rsmtool.readthedocs.io/en/stable/api.html#rsmtool.utils.difference_of_standardized_means)

- Functions for computing fairness analyses are now available via [`rsmtool.fairness_utils.get_fairness_analyses`](https://rsmtool.readthedocs.io/en/stable/api.html#from-fairness-utils-module).

Bugfixes

- [`partial_correlations()`](https://rsmtool.readthedocs.io/en/stable/api.html#rsmtool.utils.partial_correlations) function has been updated to return a correctly formatted matrix in a situation where the covariance matrix is very close to zero.

- The reports have been updated to correctly display plots for features with very long names.

6.1.0

This is a major release which includes a number of improvements primarily aimed to increase the flexibility of RSMTool API.

What's New

New functionality

* RSMTool now supports input files in SAS `SAS7BDAT` format.

* New learner `NNLRIterative`. This is a new [built-in linear regression model](https://rsmtool.readthedocs.io/en/latest/usage_rsmtool.html#builtin-models) that learns empirical OLS regression weights with feature selection using an iterative implementation of non-negative least squares regression.

* Custom truncation thresholds. The user can now remove outliers using pre-existing truncation thresholds specified in the `features` file by using the field [use_truncation_thresholds](https://rsmtool.readthedocs.io/en/latest/usage_rsmtool.html#use-truncation-thresholds-optional)

* Users can now run the `.ipynb` notebook generated from the experiment interactively, without having to set any environment variables. Each experiment now generates a (hidden) environment JSON file, which the notebook will automatically read.

API changes

* There is now a separate function ` utils.standardized_mean_difference()` that can be used to compute SMD.

* A new function `reader.try_to_load_file()` allows API user to specify what they want to happen if a file cannot be loaded. The functions can be set to return `None`, to raise warning, or to raise error.

* `DataContainer` class now includes additional helper methods. These methods allow users to `drop()` and `rename()` data frames in the DataContainer, and to select data frames using a specified prefix or suffix with the `get_frames()` method.

* `Configuration` class now includes several additional helper methods `pop()` and `copy()`.

* `utils.get_thumbnail_as_html()` now accepts an optional argument `path_to_thumbnail` which allows using two different paths for thumbnails and full-size images.


Other

* Support for `seaborn 0.9.0` and `statsmodels 0.9.0.`

* Support for `numpy 1.14.0`, `scipy 1.1.0`, and `pandas 0.23.0+`.

* Support for `ipython 6.5.0` and `notebook 5.7.2`.

* The documentation incorrectly stated the order of operations in the processing pipeline: the change of feature sign (if applicable) happens after standardization.

* If the user specifies a list of features and one of such features has zero variance, the tool now displays the correct error message.

* The logging messages displayed by `check_flag_column` now indicate the partition if different flag columns were used for training and evaluating the model.

* Miscellaneous minor bug fixes in the notebooks.

6.0.1

This is a bugfix release.

- The "System Information" section of the reports now uses `pkg_resources` instead of `pip` to get the list of installed packages since `pip` disallows the use of its internal API starting with v10.
- Fix incorrect formatting in the documentation.
- Update `ipython` and `notebook` package versions in order to address an incompatibility issue with the latest version of the `tornado` web server that affects interactive use of `ipython notebook` but not the report generation itself.
- Updated the description of the marginal/partial correlation plot in the report.

6.0

What's new?

This is a major release. The entire code base has been fully refactored to use a much more object-oriented design. This should make it much easier to make improvements and to add extensions. As result, there have been significant changes to the RSMTool API (see link in documentation below for more details).

New features

New learners

* New regressors from the latest SKLL release (v1.5.1) have been added to ``rsmtool``.
* `rsmtool` can now be used with both regressors and classifiers from SKLL, including classifiers that produce probabilistic output which can be used to produce expected values as predictions.

See the [SKLL documentation](http://skll.readthedocs.io/en/latest/run_experiment.html?highlight=learners#learners) for the full list of learners.

Enhanced outputs

* Users can now specify the ``file_format`` configuration option to save intermediate files in either ``tsv``, ``csv``, or ``xlsx`` format.
* Users can specify a ``use_thumbnails`` configuration option that will embed clickable thumbnails in the HTML report, rather than full-sized images. Upon clicking the thumbnails, full-sized images will be displayed in a new window. This is particularly useful for larger reports with many images, improving both the readability and the loading speed of such reports.
* Reports for `rsmtool`, `rsmeval`, and `rsmsummarize` now contain a new section containing links to intermediate files (``intermediate_file_paths.ipynb``) so that users can now easily inspect these files from the report itself.

New configuration options

* Users can now specify ``features`` in the configuration file as a ``list``. When providing a list of features, signs or transformations cannot be specified. This makes creating configuration files for simple experiments much easier and faster.
* Users can now specify a ``skll_objective`` for tuning the SKLL learners used in their experiments.
* Users can now specify a ``flag_column_test`` configuration option to use different flags for the test file and the training file.
* Users can now specify a ``standardize_features`` boolean option if they do not want the feature values standardized, which is the default.

New evaluations

* `rsmtool` and `rsmeval` now compute disattenuated correlations if the data includes two human scores.

Code changes

* New helper classes have been added to ``rsmtool``, which allow easy reading, writing, and manipulation of multiple ``pandas`` data frames.
- ``container.DataContainer()``: A class to encapsulate multiple data frames.
- ``reader.DataReader()``: A class to read multiple tabular files into a ``DataContainer()`` object.
- ``writer.DataWriter()``: A class to write all data frames contained in a``DataContainer()`` object to separate files, with a specified file extension.
* The ``rsmtool`` module is now installable via ``pip``, in addition to being installable with ``conda``.
* `preprocessor.trim()` can now take both numpy arrays and lists as inputs.

Bugfixes

* Fixed warning in ``rsmcompare`` when computing summary evaluations.
* Previously confusion matrices forced human scores to integers, while score distributions used the value "as is". Now both analyses use rounded human scores.
* Length columns are now forced to numeric, if they are non-numeric.

Documentation

* Added documentation for refactored [API](http://rsmtool.readthedocs.io/en/latest/api.html).
* Added detailed documentation about [how to write RSMTool tests](http://rsmtool.readthedocs.io/en/latest/contributing.html#rsmtool-tests).

5.7

What's new?

* Update Python to v3.6, pandas to v0.22.0 and SKLL to v1.5. This required minor changes to the code and updates to some of the test files.
* The conda installation command has changed. See the new command [here](http://rsmtool.readthedocs.io/en/latest/getting_started.html).

Improvements

* The `evaluation_by_group` notebook in addition to bar plots now includes a table showing the main metrics for each subgroup.
* When using the RSMTool API, it is now possible to specify a `tolerance` keyword argument for `trim` method. Read more [here](http://rsmtool.readthedocs.io/en/latest/api.html#from-preprocess-module).

Bugfixes
* The differential feature functioning (DFF) plots are now correctly generated using preprocessed feature values. In the previous version, they incorrectly used raw feature values.
* In v0.19.0 of scikit-learn, the implementation of `explained_variance_` in their PCA implementation underwent some bugfixes. Due to this, the results of PCA analyses no longer match those produced by the previous versions of RSMTool and had to be changed.

Other minor changes
* Updated the utility script `update_skll_model.py` to make it compatible with SKLL v.1.5.
* Minor updates for the documentation.

5.6

This is an important release that has a critical bugfix as well as useful improvements.

Bugfixes
- Fixed critical bug in computation of standardized mean differences. The denominator for SMDs should be using population standard deviations, not the ones computed over the subgroups themselves.
- Added converters to the notebook header to allow correct treatment of candidate IDs with leading zeros.
- Modified the test utility functions to catch discrepancies caused by missing leading zero.

Improvements
- The tables generated by `rsmsummarize` are now saved in the same way as for other tools.
- `rsmsummarize` now shows a table with standardized coefficients for all models.
- The predictions for the post-processed training set are now also saved.
- Added a new notebook that shows differential feature functioning (DFF) plots by subgroup. To use it, add `dff_by_group` to the `general_section`configuration option. Read more [here](http://rsmtool.readthedocs.io/en/latest/usage_rsmtool.html#general-sections-rsmtool).
- The features that have not been used in the model are now excluded from the datasets before they are sent to SKLL for prediction. This makes the prediction step much faster for large datasets.
- When testing whether the feature std. dev. in the training set is zero, we currently set tolerance to 1e-06. This is not sufficient with features with very low values (these can result from an inverse transform of acoustic likelihoods which are logs of very small values). This tolerance is now increased to 1e-07.

Other Minor Changes
- Update the utility script `update_skll_model.py` to allow it to be used with other tools.
- Update tests and documentation.

Page 4 of 6

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.