All trainers now have a default logger, early stopping and checkpoint object. To modify the behavior, pass in your own versions of those.
- Removed collisions with logger versions by tying it to job id.
- Added new DDP implementation. It uses DP in a node but allows multiple nodes. Useful for models which need negative samples, etc...
- support for LBFGS. If you pass in LBFGS Lightning handles the closure for you automatically.
- No longer need to set master port, Lightning does it for you using the job id.
- training_step and validation_end now return two separate dicts, one for the progress bar and one for logging.
- Added options to memory printing: 'min_max' logs only the max/min memory use. 'all' logs all the GPUs on the root node.
This release has breaking API changes. See 124 for all details.
Syntax changes are:
in trainer options use: train, test, val
for data: val_dataloader, test_dataloader, train_dataloader
data_batch -> batch
prog -> progress
gradient_clip -> gradient_clip_val
add_log_row_interval -> row_log_interval
This release does the following:
- Moves SLURM resubmit from test-tube to PL (which removes the need for cluster parameter).
- Cluster checkpoint done by Lightning now (not test-tube). Also doesn't require a checkpoint object to restore weights when on cluster.
- Loads all models on CPU when restoring weights to avoid OOM issues in PyTorch. User now needs to move to GPU manually. However, if using Lightning, lightning will move to correct GPUs automatically.
- Fixes various subtle bugs in DDP implementation.
- documentation updates
- validation_step, val_dataloader are now optional.
- enabled multiple dataloaders for validation.
- support for latest test-tube logger optimized for PT 1.2.0.
- lr_scheduler now activated after epoch
0.4.0 is the first public release after a short period testing with public users. Thanks for all the help ironing out bugs to get Lightning to run on everything from notebooks to local to server machines.
This release includes:
- Extensively tested code.
- Cleaner API to accommodate the various research use cases
- No need for experiment object in trainer.
- Training continuation (not just weights, but also epoch, global step, etc...)
- if the folder the checkpoint callback uses has weights, it loads the last weights automatically.
- training step and validation step don't reduce outputs automatically anymore. This fixes issues with reducing generated outputs for example (images, text).
- 16-bit can now be used with a single GPU (no DP or DDP in this case). bypasses issue with NVIDIA apex and PT compatibility for DP+16-bit training.
- Extra tests for CPU models.
- Experiment object is process-safe, it will only write from process_rank=0
Simplified data loader.
Added a decorator to do lazy loading internally:
if self._tng_dataloader is None:
self._tng_dataloader = DataLoader(...)
- Code coverage (99%)
- Full tests that run multiple models in different configs
- Full tests that test specific functionality in trainer.