Aws-parallelcluster-node

Latest version: v3.9.1

Safety actively analyzes 628881 Python packages for vulnerabilities to keep your Python projects secure.

Page 7 of 10

2.8.0

Not secure

-----

**ENHANCEMENTS**
- Dynamically generate the architecture-specific portion of the path to the SGE binaries directory.

2.7.0

Not secure

-----

**ENHANCEMENTS**
- `sqswatcher`: The daemon is now compatible with VPC Endpoints so that SQS messages can be passed without traversing
the public internet.

2.6.1

Not secure

-----

**ENHANCEMENTS**
- Improved the management of SQS messages and retries to speed-up recovery times when failures occur.

**CHANGES**
- Do not launch a replacement for an unhealthy or unresponsive node until this is terminated. This makes cluster slower
at provisioning new nodes when failures occur but prevents any temporary over-scaling with respect to the expected
capacity.
- Increase parallelism when starting `slurmd` on compute nodes that join the cluster from 10 to 30.
- Reduce the verbosity of messages logged by the node daemons.
- Do not dump logs to `/home/logs` when nodewatcher encounters a failure and terminates the node. CloudWatch can be
used to debug such failures.
- Reduce the number of retries for failed REMOVE events in sqswatcher.

**BUG FIXES**
- Fixed a bug in the ordering and retrying of SQS messages that was causing, under certain circumstances of heavy load,
the scheduler configuration to be left in an inconsistent state.
- Delete from queue the REMOVE events that are discarded due to hostname collision with another event fetched as part
of the same `sqswatcher` iteration.

2.6.0

Not secure

-----

**CHANGES**
- Remove logic that was adding compute nodes identity to known_hosts file for all OSs except CentOS6

**BUG FIXES**
- Fix Torque issue that was limiting the max number of running jobs to the max size of the cluster.

2.5.1

Not secure

-----

**BUG FIXES**
- Fix bug in sqswatcher that was causing the daemon to crash when more than 100 DynamoDB tables are present in the
cluster region.

2.5.0

Not secure

-----

**ENHANCEMENTS**
- Slurm:
- Add support for scheduling with GPU options. Currently supports the following GPU-related options: `—G/——gpus,
——gpus-per-task, ——gpus-per-node, ——gres=gpu, ——cpus-per-gpu`.
- Add gres.conf and slurm_parallelcluster_gres.conf in order to enable GPU options. slurm_parallelcluster_gres.conf
is automatically generated by node daemon and contains GPU information from compute instances. If need to specify
additional GRES options manually, please modify gres.conf and avoid changing slurm_parallelcluster_gres.conf when
possible.
- Integrated GPU requirements into scaling logic, cluster will scale automatically to satisfy GPU/CPU requirements
for pending jobs. When submitting GPU jobs, CPU/node/task information is not required but preferred in order to
avoid ambiguity. If only GPU requirements are specified, cluster will scale up to the minimum number of nodes
required to satisfy all GPU requirements.
- Slurm daemons will now keep running when cluster is stopped for better stability. However, it is not recommended
to submit jobs when the cluster is stopped.
- Change jobwatcher logic to consider both GPU and CPU when making scaling decision for slurm jobs. In general,
cluster will scale up to the minimum number of nodes needed to satisfy all GPU/CPU requirements.
- Reduce number of calls to ASG in nodewatcher to avoid throttling, especially at cluster scale-down.

**CHANGES**
- Increase max number of SQS messages that can be processed by sqswatcher in a single batch from 50 to 200. This
improves the scaling time especially with increased ASG launch rates.
- Increase faulty node termination timeout from 1 minute to 5 in order to give some additional time to the scheduler
to recover when under heavy load.

**BUG FIXES**
- Fix jobwatcher behaviour that was marking nodes locked by the nodewatcher as busy even if they had been removed
already from the ASG Desired count. This was causing, in rare circumstances, a cluster overscaling.
- Better handling of errors occurred when adding/removing nodes from the scheduler config.
- Fix bug that was causing failures in sqswatcher when ADD and REMOVE event for the same host are fetched together.

Page 7 of 10

Releases

Has known vulnerabilities

Previous Next

Aws-parallelcluster-node

Page 7 of 10

2.8.0

2.7.0

2.6.1

2.6.0

2.5.1

2.5.0

Page 7 of 10

Links

Releases