PyUp Safety actively tracks 267,561 Python packages for vulnerabilities and notifies you when to upgrade.
💥 Incompatible Changes 💥 - Python 2.7 is no longer supported since the underlying version of scikit-learn no longer supports it (Issue 497, PR 506). - Configuration field `objective` has been deprecated and replaced with `objectives` which [allows](https://skll.readthedocs.io/en/latest/run_experiment.htmlobjectives-optional) specifying multiple tuning objectives for grid search (Issue 381, PR 458). - Grid search is now enabled by default in both the API as well as while using a configuration file (Issue 463, PR 465). - The `Predictor` class previously provided by the `generate_predictions` utility script is no longer available. If you were relying on this class, you should just load the model file and call `Learner.predict()` instead (Issue 562, PR 566). - There are no longer any default grid search objectives since the choice of objective is best left to the user. Note that since grid search is enabled by default, you must either choose an objective or explicitly disable grid search (Issue 381, PR 458). - `mean_squared_error` is no longer supported as a metric. Use `neg_mean_squared_error` instead (Issue 382, PR 470). - The `cv_folds_file` configuration file field is now just called `folds_file` (Issue 382, PR 470). - Running an experiment with the `learning_curve` task now requires specifying [`metrics`](https://skll.readthedocs.io/en/latest/run_experiment.htmlmetrics-optional) in the `Output` section instead of `objectives` in the `Tuning` section (Issue 382, PR 470). - Previously when reading in CSV/TSV files, missing data was automatically imputed as zeros. This is not appropriate in all cases. This no longer the case and blanks are retained as is. Missing values will need to be explicitly dropped or replaced (see below) before using the file with SKLL (Issue 364, PRs 475 & 518). - `pandas` and `seaborn` are now direct dependencies of SKLL, and not optional (Issues 455 & 364, PRs 475 & 508). 💡 New features 💡 - `CSVReader`/`CSVWriter` & `TSVReader`/`TSVWriter` now use `pandas` as the backend rather than custom code that relied on the `csv` module. This leads to significant speedups, especially for very large files (~5x for reading and ~10x for writing)! The speedup comes at the cost of moderate increase in memory consumption. See detailed benchmarks [here](https://github.com/EducationalTestingService/skll/files/3637196/test_skll.pdf) (Issue 364, PRs 475 & 518). - SKLL models now have a new [`pipeline` attribute](https://skll.readthedocs.io/en/latest/run_experiment.htmlpipeline-optional) which makes it easy to manipulate and use them in `scikit-`learn, if needed (Issue 451, PR 474). - `scikit-learn` updated to 0.21.3 (Issue 457, PR 559). - The SKLL conda package is now a [generic Python package](https://www.anaconda.com/condas-new-noarch-packages/) which means the same package works on all platforms and on all Python versions >= 3.6. This package is hosted on the new, public [ETS anaconda channel](https://anaconda.org/ets). - SKLL learner hyperparameters have been updated to match the new `scikit-learn` defaults and those upcoming in 0.22.0 (Issue 438, PR 533). - Intermediate results for the grid search process are now available in the [`results.json`](https://skll.readthedocs.io/en/latest/run_experiment.htmlresults-files) files (Issue 431, 471). - The K models trained for each split of a K-fold cross-validation experiment can now be [saved](https://skll.readthedocs.io/en/latest/run_experiment.htmlsave-cv-models-optional) to disk (Issue 501, PR 505). - Missing values in CSV/TSV files can be dropped/replaced both via the [command line](https://skll.readthedocs.io/en/latest/utilities.htmlcmdoption-filter-features-db) and the [API](https://skll.readthedocs.io/en/latest/api/data.htmlskll.data.readers.CSVReader) (Issue 540, PR 542). - Warnings from `scikit-learn` are now captured in SKLL log files (issue 441, PR 480). - `Learner.model_params()` and, consequently, the [`print_model_weights`](https://skll.readthedocs.io/en/latest/utilities.htmlprint-model-weights) utility script now work with models trained on hashed features (issue 444, PR 466). - The [`print_model_weights`](https://skll.readthedocs.io/en/latest/utilities.htmlprint-model-weights) utility script can now output feature weights sorted by class labels to improve readability (Issue 442, PR 468). - The [`skll_convert`](https://skll.readthedocs.io/en/latest/utilities.htmlskll-convert) utility script can now convert feature files that do not contain labels (Issue 426, PR 453). 🛠 Bugfixes & Improvements 🛠 - Fix several bugs in how various tuning objectives and output metrics were computed (Issues 545 & 548, PR 551). - Fix how [`pos_label_str`](https://skll.readthedocs.io/en/latest/run_experiment.htmlpos-label-str-optional) is documented, read in, and used for classification tasks (Issues 550 & 570, PRs 566 & 571). - Fix several bugs in the `generate_predictions` utility script and streamline its implementation to _not_ rely on an externally specified positive label or index but rather read it from the model file or infer it (Issues 484 & 562, PR 566). - Fix bug due to overlap between tuning objectives that metrics that could prevent metric computation (Issue 564, PR 567). - Using an externally specified `folds_file` for grid search now works for `evaluate` and `predict` tasks, not just `train` (Issue 536, PR 538). - Fix incorrect application of sampling _before_ feature scaling in `Learner.predict()` (Issue 472, PR 474). - Disable feature sampling for `MultinomialNB` learner since it cannot handle negative values (Issue 473, PR 474). - Add missing logger attribute to `Learner.FilteredLeaveOneGroupOut` (Issue 541, PR 543). - Fix `FeatureSet.has_labels` to recognize list of `None` objects which is what happens when you read in an unlabeled data set and pass `label_col=None` (Issue 426, PR 453). - Fix bug in `ARFFWriter` that adds/removes `label_col` from the field names even if it's `None` to begin with (Issue 452, PR 453). - Do not produce unnecessary warnings for learning curves (Issue 410, PR 458). - Show a warning when applying feature hashing to multiple feature files (Issue 461, PR 479). - Fix loading issue for saved `MultinomialNB` models (Issue 573, PR 574). - Reduce memory usage for learning curve experiments by explicitly closing `matplotlib` figure instances after they are saved. - Improve SKLL’s cross-platform operation by explicitly reading and writing files as UTF-8 in readers and writers and by using the `newline` parameter when writing files. 📖 Documentation Updates 📖 - Reorganize documentation to explicitly document all types of output files and link them to the corresponding configuration fields in the `Output` section (Issue 459, PR 568). - Add new interactive tutorial that uses a Jupyter notebook hosted on binder (Issue 448, PRs 547 & 552). - Add a new page to official documentation explaining how the SKLL code is organized for new developers (Issue 511, PR 519). - Update SKLL contribution guidelines and link to them from official documentation (Issues 498 & 514, PR 503 & 519). - Update documentation to indicate that `pandas` and `seaborn` are now direct dependencies and not optional (Issue 553, PR 563). - Update `LogisticRegression` learner documentation to talk explicitly about penalties and solvers (Issue 490, PR 500). - Properly [document](https://skll.readthedocs.io/en/latest/api/data.htmlnotes-about-ids-label-conversion) the internal conversion of string labels to ints/floats and possible edge cases (Issue 436, PR 476). - Add feature scaling to Boston regression example (Issue 469, PR 478). - Several other additions/updates to documentation (Issue 459, PR 568). ✔️ Tests ✔️ - Make `tests` into a package so that we can do something like `from skll.tests.utils import X` etc. (Issue 530 , PR 531). - Add new tests based on SKLL examples so that we would know if examples ever break with any SKLL updates (Issues 529 & 544, PR 546). - Tweak tests to make test suite runnable on Windows (and pass!). - Add [Azure Pipelines](https://azure.microsoft.com/en-us/services/devops/pipelines/) integration for automated test builds on Windows. - Added several new comprehensive tests for all new features and bugfixes. Also, removed older, unnecessary tests. See various PRs above for details. - Current code coverage for SKLL tests is at 95%, the highest it has ever been! 🔍 Other changes 🔍 - Replace `prettytable` with the more actively maintained `tabulate` (Issue 356, PR 467). - Make sure entire codebase complies with PEP8 (Issue 460, PR 568). - Update the year to 2019 everywhere (Issue 447, PRs 456 & 568). - Update TravisCI configuration to use `conda_requirements.txt` for building environment (PR 515). 👩🔬 Contributors 👨🔬 (*Note*: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.) Supreeth Baliga (SupreethBaliga), Jeremy Biggs (jbiggsets), Aoife Cahill (aoifecahill), Ananya Ganesh (ananyaganesh), R. Gokul (rgokul), Binod Gyawali (bndgyawali), Nitin Madnani (desilinguist), Matt Mulholland (mulhod), Robert Pugh (Lguyogiro), Maxwell Schwartz (maxwell-schwartz), Eugene Tsuprun (etsuprun), Avijit Vajpayee (AVajpayeeJr), Mengxuan Zhao (chaomenghsuan)
This is a minor release of SKLL with the most notable change being compatibility with the latest version of scikit-learn (v0.20.1). What's new - SKLL is now compatible with scikit-learn v0.20.1 (Issue 432, PR 439). - `GradientBoostingClassifier` and `GradientBoostingRegressor` now accept sparse matrices as input (Issue 428, PR 429). - The `model_params` property now works for SVC learners with a linear kernel (Issue 425, PR 443). - Improved documentation (Issue 423, PR 437). - Update `generate_predictions` to output the probabilities for _all_ classes instead of just the first class (Issue 430, PR 433). **Note**: this change breaks backward compatibility with previous SKLL versions since the output file now _always_ includes a column header. Bugfixes - Fixed broken links in documentation (Issues 421 and 422, PR 437). - Fixed data type conversion in `NDJWriter` (Issue 416, PR 440). - Properly handle the possible combinations of trained model and prediction set vectorizers in `Learner.predict` (Issue 414, PR 445). Other changes - Make the tests for `MLPClassifier` and `MLPRegressor` go faster (by turning off grid search) to prevent Travis CI from timing out (issue 434, PR 435).
This is a hot fix release that addresses a single issue. `Learner` instances created via `from_file()` method did not get loggers associated with them. This meant that any and all warnings generated for such learner instances would have led to `AttributeError` exceptions.
This is primarily a bug fix release. Bugfixes - Generate the "folds_file" warnings only when "folds_file" is specified (issue 404, PR 405). - Modify `Learner.save()` to deal properly with reading in and re-saving older models (issue 406, PR 407). - Fix regression that caused the output directories to not be automatically created (issue 408, PR 409).
This is a major new release of SKLL. What's new - Several new scikit-learn learners included along with reasonable default parameter grids for tuning, where appropriate (issues 256 & 375, PR 377). - `BayesianRidge` - `DummyRegressor` - `HuberRegressors` - `Lars` - `MLPRegressor` - `RANSACRegressor` - `TheilSenRegressor` - `DummyClassifier` - `MLPClassifier` - `RidgeClassifier` - Allow computing any number of additional evaluation metrics in addition to the tuning objective (issue 350, PR 384). - Rename `cv_folds_file` configuration option to `folds_file`. The former is still supported with a deprecation warning but will be removed in the next release (PR 367). - Add a new configuration option [`use_folds_file_for_grid_search`](http://skll.readthedocs.io/en/latest/run_experiment.htmluse-folds-file-for-grid-search-optional) which controls whether the inner-loop grid-search in a cross-validation experiment with a custom folds file also uses the folds from the file. It's set to True by default. Setting it to False means that the inner loop uses regular 3-fold cross-validation and ignores the file (PR 367). - Also add a keyword argument called `use_custom_folds_for_grid_search` to the `Learner.cross_validate()` method (PR 367). - Learning curves can now be plotted from existing summary files using the new [`plot_learning_curves`](http://skll.readthedocs.io/en/latest/utilities.htmlplot-learning-curves) command line utility (issue 346, PR 396). - Overhaul logging in SKLL. All messages are now logged both to the console (if running interactively) and to log files. Read more about the SKLL log files in the [Output Files section](http://skll.readthedocs.io/en/latest/run_experiment.htmloutput-files) of the documentation (issue 369, PR 380). - `neg_log_loss` is now available as an objective function for classification (issue 327, PR 392). Changes - SKLL now supports Python 3.6. Although Python 3.4 and 3.5 will still work, 3.6 is now the officially supported Python 3 version. Python 2.7 is still supported. (issue 355, PR 360). - The required version of scikit-learn has been bumped up to 0.19.1 (issue 328, PR 330). - The learning curve y-limits are now computed a bit more intelligently (issue 389, PR 390). - Raise a warning if ablation flag is used for an experiment that uses `train_file`/`test_file` - this is not supported (issue 313, PR 392). - Raise a warning if both `fixed_parameters` and `param_grids` are specified (issue 185, PR 297). - Disable grid search if no default parameter grids are available in SKLL and the user doesn't provide parameter grids either (issue 376, PR 378). - SKLL has a copy of scikit-learn's `DictVectorizer` because it needs some custom functionality. _Most_ (but not all) of our modifications have now been merged into scikit-learn so our custom version is now significantly condensed down to just a single method (issue 263, PR 374). - Improved outputs for cross-validation tasks (issues 349 & 371, PRs 365 & 372) - When a folds file is specified, the log erroneously showed the full dictionary. - Show number of cross-validation folds in results to be <n> via folds file if a folds file is specified. - Show grid search folds in results to be <n> via folds file if the grid search ends up using the folds file. - Do not show the stratified folds information in results when a folds file is specified. - Show the value of `use_folds_file_for_grid_search` in results when appropriate. - Show grid search related information in results only when we are actually doing grid search. - The Travis CI plan was broken up into multiple jobs in order to get around the 50 minute limit (issue 385, PR 387). - For the conda package, some of the dependencies are now sourced from the `conda-forge` channel. Bugfixes - Fix the bug that was causing the inner grid-search loop of a cross-validation experiment to use a single job instead of the number specified via `grid_search_jobs` (issue 363, PR 367). - Fix unbound variable in `readers.py` (issue 340, PR 392). - Fix bug when running a learning curve experiment via `gridmap` (issue 386, PR 390). - Fix a mismatch between the default number of grid search folds and the default number of slots requested via `gridmap` (issue 342, PR 367). Documentation - Update documentation and tests for all of the above changes and new features. - Update tutorial and installation instructions (issues 383 and 394, PR 399). - Standardize all of the function and method docstrings to be NumPy style. Add docstrings where missing (issue 373, PR 397).
This is a major new release of SKLL. New features - You can now generate learning curves for multiple learners, multiple feature sets, and multiple objectives in a single experiment by using `task=learning_curve` in the configuration file. See [documentation](http://skll.readthedocs.io/en/latest/run_experiment.htmllearning-curve) for more details (issue 221, PR 332). Changes - The required version of scikit-learn has been bumped up to 0.18.1 (issue 328, PR 330). - SKLL now uses the MKL backend on macOS/Linux instead of OpenBLAS when used as a `conda` package. Bugfixes - Fix deprecation warning when using `Learner.model_params()` (issue 325, PR 329). - Update the definitions of SKLL F1 metrics as a result of scikit-learn upgrade (issue 325, PR 330). - Bring documentation for SVC parameter grids up to date with the code (issue 334, PR 337). - Update documentation to make it clear that the SKLL `conda` package is only available for Python 3.4. For other Python versions, users should use `pip`.
This is primarily a bug fix release but also adds a major new API feature. New API Feature: - If you use the SKLL API, you can now create `FeatureSet` instances _directly_ from `pandas` data frames (issue 261, PR 292). Bugfixes: - Correctly parse floats in scientific notation, e.g., when specifying parameter grids and/or fixed parameters (issue 318, PR 320) - `print_model_weights` now correctly handles models trained with `fit_intercept=False` (issue 322, PR 323).
This release includes major changes as well as a number of bugfixes. Changes: - The required version of scikit-learn has been bumped up to 0.17.1 (issue 273, PRs 288 and 308) - You can now optionally save cross-validation folds to a file for later analysis (issue 259, PR 262) - Update documentation to be clear about when two `FeatureSet` instances are deemed equal (issue 272, PR 294) - You can now specify multiple objective functions for parameter tuning (issue 115, PR 291) Bugfixes: - Use a fixed random state when doing non-stratified k-fold cross-validation (issue 247, PR 286) - Fix errors when using reusing relative paths in output section (issue 252, PR 287) - `print_model_weights` now works correctly for multi-class logistic regression models (issue 274, PR 267) - Correctly raise an `IOError` if the config file is not correctly specified (issue 275, PR 281) - The `evaluate` task does not crash when the test data has labels that were not seen in training data (issue 279, PR 290) - The `fit()` method for rescaled versions of learners now works correctly when not doing grid search (issue 304, PR 306) - Fix minor typos in the documentation and tutorial.
This is a minor bugfix release. It fixes: - Issue where a `FileExistsError` would be raised when processing many configs (PR 260) - Instance of `cv_folds` instead of `num_cv_folds` in the documentation (PR 248). - Crash with `print_model_weights` and Logistic Regression models without intercepts (issue 250, PR 251) - Division by zero error when there was only one example (issue 253, PR 254)
The biggest changes in this release are that the required version of scikit-learn has been bumped up to 0.16.1 and config file parsing is much more robust and gives much better error messages when users make mistakes. Implemented enhancements - Base estimators other than the defaults are now supported for `AdaBoost` classifiers and regressors (238) - User can now specify number of cross-validation folds to use in the config file (222) - Decision Trees and Random Forests no longer need dense inputs (207) - Stratification during cross-validation is now optional (160) Fixed bugs - Bug when checking if `hasher_features` is a valid option (234) - Invalid/missing/duplicate options in configuration are now detected (223) - Stop modifying global numpy random seed (220) - Relative paths specified in the config file are now relative to the config file location instead of to the current directory (213) Closed issues - Incompatibility with the latest version of scikit-learn (v0.16.1) (235, 241, 233) - Learner.model_params will return weights with the wrong sign if sklearn is fixed (111) Merged pull requests - Overhaul configuration file parsing (desilinguist, 246) - Several minor bugfixes (desilinguist, 245) - Compatibility with scikit-learn v0.16.1 (desilinguist, 243) - Expose cv_folds and stratified (aoifecahill, 240) - Adding Report tests (brianray, 237) [Full Changelog](https://github.com/EducationalTestingService/skll/compare/v1.0.1...v1.1.0)
This is a fairly minor bugfix release. Changes include: - Update links in README. - Fix crash when trying to run experiments with integer labels (Issue 225, PR 219) - Update documentation about ablation to note that there will always be a run with all features (Issue 224, PR 226) - Update documentation about format of `cv_folds_file` (Issue 225, PR 228) - Remove duplicate words in documentation (PR 218) - Fixed `KeyError` when trying to build conda recipe. - Update outdated parameter grids in `run_experiment` documentation (commit 80d78e4)
The 1.0 release is finally here! It's been a little over a year since our first public release, and we're ready to say that SKLL is 1.0. Read our massive release notes: :warning: We did make some API- and config-file-breaking changes. They are listed at the end of the release notes. They should all be addressable by a quick find-and-replace. Bug fixes - Fixed path problems in iris example (issue 103, PR 171) - Fixed bug where `ablated_features` field was incorrect when config file contained multiple feature sets (issue 125) - Fixed bug where CV would crash with rare classes (issue 109, PR 165) - Fixed issue where warning about extremely large feature values was being issued before rescaling - Fixed issue where some warning messages used mix of new-style and old-style replacement strings with old-style formatting. - Fixed a number of bugs with filtering `FeatureSet` objects and writing filtered sets to files. - Fixed bug in `FeatureSet.__sub__` where feature names were being passed instead of indices. - Fixed issue where `MegaMWriter` could not print numbers in Python 2.7. New features - SKLL releases are now for specific versions of scikit-learn. 1.0.0 requires scikit-learn 0.15.2 (issue 138, PR 170) - Added [tutorial](https://skll.readthedocs.org/en/master/tutorial.html) to documentation that walks new users through using SKLL in much the same way as our PyData talks (issue 153). - Added support for custom learners (issue 92, PR 183) - Added two command-line utilities, `join_features` and `filter_features`, for joining and filtering feature files. These replace `join_megam` and `filter_megam` (issue 79, PR 198) - Added support for specifying the field in ARFF, CSV, or TSV files that contains the IDs for each instance (issue 204, PR 206) - Added train/test set sizes to result files (issue 150, PR 161) - Added intercept to `print_model_weights` output (issue 155, PR 163) - Added total time and end time-stamp to experiment results (issue 91, PR 167) - Added exception when `featureset_name` is longer than 210 characters (issue 121, PR 168) - Added regression example data, `boston` (issue 162) - Added ability to specify number of grid search folds (issue 122, PR 175) - Added warning message when number of features in training model are different than those for FeatureSet passed to `Learner.predict()` (issue 145) - Added `conda.yaml` file to repository to make conda package creation simpler (issue 159, PR 173) - Added loads more unit tests, greatly increased unit test coverage, and generally cleaned up test modules (issues 97, 148, 157, 188, and 202; PRs 176, 184, 196, 203, and 205) - Added `train_file` and `test_file` fields to config files, which can be used to specify single file feature sets. This greatly simplifies running simple experiments (issue 12, PR 197) - Added support for merging feature sets with IDs in different orders (issue 149, PR 177) - Added `ValueError` when invalid tuning objective is specified (issues 117 and 179; PRs 174 and 181) - Added `shuffle` option to config files to decide whether training data should be shuffled before training. By default this is `False`, but if `grid_search` is `True`, we will automatically `shuffle`. Previously, the default was `True`, and there was no option in the config files. (issue 189, PR 190) - Updated documentation to indicate that we're using `StratifiedKFold` (issue 160) - Added `FeatureSet.__eq__` and `FeatureSet.__getitem__` methods. Minor changes without issues - Overhauled and cleaned up all documentation. [Look](https://skll.readthedocs.org) how pretty it is! - Updated docstrings all over the place to be more accurate. - Updated `generate_predictions` to use new `Reader` API. - Added `argv` optional argument to all utility script `main` functions to simplify testing. - Added `mock` tests, so SKLL now requires `mock` to work with Python 2.7. - Added prettier SVG badges to README. - Added link to Data Science at the Command Line to README. - `LibSVMReader` now converts UTF-8 replacement characters that are used by `LibSVMWriter` when a feature name contains an `=`, `|`, ``, `:`, or ` ` back to the original ASCII characters. :warning: API breaking changes :warning: - `FeatureSetWriter` :arrow_right: `Writer` - `load_examples(path)` :arrow_right: `Reader.for_path(path).read()` - `write_feature_file(...)` :arrow_right: `Writer.for_path(FeatureSet(...)).write()` - `FeatureSet.classes` :arrow_right: `FeatureSet.labels` - All other instances of word "classes" changed to "labels" (166) - `FeatureSet.feat_vectorizer` :arrow_right: `FeatureSet.vectorizer` - `run_ablation(all_combos=True)` :arrow_right: `run_configuration(ablation=None)` - `run_ablation()` :arrow_right: `run_configuration(ablation=1)` - `ExamplesTuple(ids, classes, features, vectorizer)` :arrow_right: `FeatureSet(name, ids, classes, features, vectorizer)` - Removed `feature_hasher` argument to all `Learner` methods, because its unnecessary - `Learner.model_type` is now the actual type of the underlying model instead of just a string. - `FeatureSet.__len__` now returns the number of examples instead of the number of features. - Removed `skll.learner._REGRESSION_MODELS` and now we check for regression by seeing if model is subclass of `RegressorMixin`. :warning: Config file breaking changes :warning: - Removed all short names for learners (PR 199) - Can no longer use `classifiers` instead of `learners` - `train_location` :arrow_right: `train_directory` - `test_location` :arrow_right: `train_directory` - `cv_folds_location` :arrow_right: `cv_folds_file`
Bug fix release that fixes issue where `python setup.py install` would not work because the `skll.data` packages wasn't include in the list of packages.
This release has some big behind-the-scenes changes. First, we split the `data.py` module up into a sub-package (147). There is also a new `FeatureSet` class that replaces the old `namedtuple`-based `ExamplesTuple` (81), so `ExamplesTuple` is now deprecated and will be removed in SKLL 1.0.0. Speaking of which, we're having an all-day SKLL sprint on the October 17th where we hope to resolve all the remaining issues preventing the 1.0 release. Other changes include: - Fixed a bunch of minor problems with loading/writing LibSVM files - Added file reading/writing progress indicators - Fixed crash with `generate_predictions` when the model was not trained with `probability` set to True (144). - Deprecated `write_feature_file` function in favor of using a `FeatureSetWriter` object. - Deprecated `load_examples` function in favor of using a `Reader` object. - Temporarily added replacement version of scikit-learn `DictVectorizer` class until scikit-learn/scikit-learn3683 version is included in a release. This allows us to make file loading substantially more memory efficient.
The main new feature in this release is that `.libsvm` files are now fully supported by `skll_convert` and `run_experiment`. Because of this change, we've removed `megam_to_libsvm`. Other changes include: - Integer keys are now allowed in `fixed_parameters` and `param_grids` (134). Therefore, SKLL now requires PyYAML to function properly. - Added documentation about using `class_weights` to manage imbalanced datasets (132) - Added information about pre-specified folds (via `cv_folds_location) to results JSON and plain-text files. (108) - Added warning when encountering classes that are not in `class_map`. (114) - Fixed issue where sampler `random_state` parameter would be overridden. - Fixed license headers in CLI package. They were still GPL for some reason. - Fixed issue 112 by switching to `joblib.pool.MemmappingPool` for handling parallel file loading. SKLL now requires joblib 0.8 to function properly. - Fixed issue 104 by making result formatting more consistent. - `compute_eval_from_predictions` now supports string-valued classes, as it should have. (135) - We now raise an exception instead of allowing you to overwrite your results by including the same learner in the `learners` list in your config file twice (140). - Fixed warning about files being left open in Python 3.4 (by not leaving them open anymore). - Short names for learners have been deprecated and will be removed in SKLL 1.0.
- Added AdaBoost and KNeighbors classifiers and regressors (finally closing 7). - Added support for [kernel approximation samplers](http://scikit-learn.org/stable/modules/kernel_approximation.html). (Thanks nineil) - All linear models are now supported by `print_model_weights` (issue 119). - Added `f1_score_weighted` metric so that weighted F1 will be calculated even for binary classification tasks. - Modified `f1_score_micro` and `f1_score_macro` to also always return average for binary classification tasks (instead of previous behavior where only performance on positive class was returned).
This release includes a long-standing request being finally fulfilled (part of 7). We now support Stochastic Gradient Descent! Full changelog: - Added support for `SGDClassifier` and `SGDRegressor` - Added option to use FeatureHasher instead of DictVectorizer to make learning with feature sets that have millions of features possible. - Minor documentation fix for `generate_predictions`. All the credit for this release goes to nineil. Thanks Nils!
- Added `compute_eval_from_predictions` utility for computing evaluation metrics after experiments have been run. - Made rounding consistent in Python 2 `kappa` code use banker's rounding, just like Python 3 does. - Added support for printing model weights for linear SVR (110) - Made `print_model_weights` only print all negative or all positive weights (105) - Little PEP8 and documentation tweaks.
Fixed issue where some models would be different depending on order of feature files specified in config file (101).
- Add `--resume` option to `run_experiment` for resuming large experiments in the event of a crash. - Fix issue where `grid_scores` was undefined when using `--keep-models`. - Automatically generated feature set names now have sorted features to ensure they will always be generated in the same fashion.
Fixed a bug where command-line scripts didn't work after previous release. (This should hopefully be the last of these rapid fire releases. We will add unit tests for these in the future.)
Fix missing `import sys` in `run_experiment.py`
Very minor bug fix release. Changes are: - `main` functions for all utility scripts now take optional argument lists to make unit testing simpler (and not require subprocesses). - Fix another bug that was causing missing "ablated features" lists in summary files.
Fix crash with `filter_megam` and `join_megam` due to references to old API.
Minor bug fix release. Changes are: - Switch to joblib.dump and joblib.load for serialization (should fix 94) - Switch to using official drmaa-python release now that it's updated on PyPI - Fix issue where training examples were being loaded for pre-trained models (95) - Change to using `entry_points` to generate scripts instead of `scripts` in `setup.py`, and utilities are now in a sub-package.
This release features mostly bug fixes, but also includes a few minor features: - Change license to BSD 3 clause. Now any of our code could be added back into scikit-learn without licensing issues. - Add gamma to default paramater search grid for SVC (84). - Add `--verbose` flag to `run_experiment` to simplify debugging. - Add support for wheel packaging. - Fixed bug in `_write_summary_file` that prevented writing of summary files for `--ablation_all` experiments. - Fixed SVR kernel string type issue (87). - Fixed `fit_intercept` default value issue (88). - Fixed incorrect error message (86) - Tweaked `.travis.yml` to make builds a little faster.
- Added support for `ElasticNet`, `Lasso`, and `LinearRegression` learners. - Reorganized examples, and created new example based on the Kaggle [Titanic](http://www.kaggle.com/c/titanic-gettingStarted) data set. - Added ability to easily create multiple files at once when using `write_feature_file`. (80) - Added support for the `.ndj` file extension for new-line delimited JSON files. It's the same format as `.jsonlines`, just with a different name. - Added support for comments and skipping blank lines in `.jsonlines` files. - Made some efficiency tweaks when creating logging messages. - Made labels in `.results` files a little clearer for objective function scores. - Fixed some misleading error messages. - Fixed issue with backward-compatibility unit test in Python 2.7. - Fixed issue where predict mode required data to already be labelled.
- Refactored `experiments` module to remove unnecessary child processes, and greatly simplify ablation code. This should fix issues 73 and 49. - Deprecated `run_ablation` function, as its functionality has been folded into `run_configuration`. - Removed ability to run multiple configuration files in parallel, since this lead to too many processes being created most of the time. - Added ability to run multiple ablation experiments from the same configuration file by adding support for multiple featuresets. - Added `min_feature_count` value to results files, which fixes 62. - Added more informative error messages when we run out of memory while converting things to dense. They now say why something was converted to dense in the first place. - Added option to `skll_convert` for creating ARFF files that can be used for regression in Weka. Previously, files would always contain non-numeric labels, which would not work with Weka. - Added ability to name relation in output ARFF files with `skll_convert`. - Added `class_map` setting for collapsing multiple classes into one (or just renaming them). See the [run_experiment documentation](http://skll.readthedocs.org/en/latest/run_experiment.htmlinput) for details. - Added warning when using `SVC` with `probability` flag set (2). - Made logging much less verbose by default and switched to using `QueueHandler` and `QueueListener` instances when dealing with multiple processes/threads to prevent deadlocks (75). - Added simple no-crash unit test for all learners. We check results with some, but not all. (63)
- Added support for running ablation experiments with _all_ combinations of features (instead of just holding out one feature at a time) via `run_experiment --ablation_all`. As a result, we've also changed the names of the `ablated_feature` column in result summary files to `ablated_features`. - Added ARFF and CSV file support across the board. As a result, all instances of the parameter `tsv_label` have now been replaced with `label_col`. - Fixed issue 71. - Fixed process leak that was causing sporadic issues. - Removed `arff_to_megam`, `csv_to_megam`, `megan_to_arff`, and `megam_to_csv` because they are all superseded by ARFF and CSV support in `skll_convert`. - Switched to using Anaconda for installing Atlas. - Switched back to http://skll.readthedocs.org URLs for documentation, now that rtfd/readthedocs.org456 has been fixed.
- Fixed crash when `modelpath` is blank and `task` is not `cross_validate`. - Fixed crash with `convert_examples` when given a generator. - Refactored `skll.data`'s private `_*_dict_iter` functions to be classes to reduce code duplication.