Aws-parallelcluster

Latest version: v3.9.1

Safety actively analyzes 630094 Python packages for vulnerabilities to keep your Python projects secure.

Page 8 of 15

2.5.1

Not secure

-----

**ENHANCEMENTS**

- Add ``--show-url`` flag to ``pcluster dcv connect`` command in order to generate a one-time URL that can be used to
start a DCV session. This unblocks the usage of DCV when the browser cannot be launched automatically.

**CHANGES**

- Upgrade NVIDIA driver to Tesla version 440.33.01.
- Upgrade CUDA library to version 10.2.
- Using a Placement Group is not required anymore but highly recommended when enabling EFA.
- Increase default root volume size in Centos 6 AMI to 25GB.
- Increase the retention of CloudWatch logs produced when generating AWS Batch Docker images from 1 to 14 days.
- Increase the total time allowed to build Docker images from 20 minutes to 30 minutes. This is done to better deal
with slow networking in China regions.
- Upgrade EFA installer to version 1.7.1:
- Kernel module: ``efa-1.4.1``
- RDMA core: ``rdma-core-25.0``
- Libfabric: ``libfabric-aws-1.8.1amzn1.3``
- Open MPI: ``openmpi40-aws-4.0.2``

**BUG FIXES**

- Fix installation of NVIDIA drivers on Ubuntu 18.
- Fix installation of CUDA toolkit on Centos 6.
- Fix invalid default value for ``spot_price``.
- Fix issue that was preventing the cluster from being created in VPCs configured with multiple CIDR blocks.
- Correctly handle failures when retrieving ASG in ``pcluster instances`` command.
- Fix the default mount dir when a single EBS volume is specified through a dedicated ebs configuration section.
- Correctly handle failures when there is an invalid parameter in the ``aws`` config section.
- Fix a bug in ``pcluster delete`` that was causing the cli to exit with error when the cluster is successfully deleted.
- Exit with status code 1 if ``pcluster create`` fails to create a stack.
- Better handle the case of multiple or no network interfaces on FSX filesystems.
- Fix ``pcluster configure`` to retain default values from old config file.
- Fix bug in sqswatcher that was causing the daemon to fail when more than 100 DynamoDB tables are present in the
cluster region.
- Fix installation of Munge on Amazon Linux, Centos 6, Centos 7 and Ubuntu 16.

2.5.0

Not secure

-----

**ENHANCEMENTS**

- Add support for new OS: Ubuntu 18.04
- Add support for AWS Batch scheduler in China partition and in ``eu-north-1``.
- Revamped ``pcluster configure`` command which now supports automated networking configuration.
- Add support for NICE DCV on Centos 7 to setup a graphical remote desktop session on the Master node.
- Add support for new EFA supported instances: ``c5n.metal``, ``m5dn.24xlarge``, ``m5n.24xlarge``, ``r5dn.24xlarge``,
``r5n.24xlarge``
- Add support for scheduling with GPU options in Slurm. Currently supports the following GPU-related options: ``—G/——gpus,
——gpus-per-task, ——gpus-per-node, ——gres=gpu, ——cpus-per-gpu``.
Integrated GPU requirements into scaling logic, cluster will scale automatically to satisfy GPU/CPU requirements
for pending jobs. When submitting GPU jobs, CPU/node/task information is not required but preferred in order to
avoid ambiguity. If only GPU requirements are specified, cluster will scale up to the minimum number of nodes
required to satisfy all GPU requirements.
- Add new cluster configuration option to automatically disable Hyperthreading (``disable_hyperthreading = true``)
- Install Intel Parallel Studio 2019.5 Runtime in Centos 7 when ``enable_intel_hpc_platform = true`` and share /opt/intel over NFS
- Additional EC2 IAM Policies can now be added to the role ParallelCluster automatically creates for cluster nodes by
simply specifying ``additional_iam_policies`` in the cluster config.

**CHANGES**

- Ubuntu 14.04 is no longer supported
- Upgrade Intel MPI to version U5.
- Upgrade EFA Installer to version 1.7.0, this also upgrades Open MPI to 4.0.2.
- Upgrade NVIDIA driver to Tesla version 418.87.
- Upgrade CUDA library to version 10.1.
- Upgrade Slurm to version 19.05.3-2.
- Install EFA in China AMIs.
- Increase default EBS volume size from 17GB to 25GB
- FSx Lustre now supports new storage_capacity options 1,200 and 2,400 GiB
- Enable ``flock user_xattr noatime`` Lustre mount options by default everywhere and
``x-systemd.automount x-systemd.requires=lnet.service`` for systemd based systems.
- Increase the number of hosts that can be processed by scaling daemons in a single batch from 50 to 200. This
improves the scaling time especially with increased ASG launch rates.
- Change default sshd config in order to disable X11 forwarding and update the list of supported ciphers.
- Increase faulty node termination timeout from 1 minute to 5 in order to give some additional time to the scheduler
to recover when under heavy load.
- Extended ``pcluster createami`` command to specify the VPC and network settings when building the AMI.
- Support inline comments in config file
- Support Python 3.8 in pcluster CLI.
- Deprecate Python 2.6 support
- Add ``ClusterName`` tag to EC2 instances.
- Search for new available version only at ``pcluster create`` action.
- Enable ``sanity_check`` by default.

**BUG FIXES**

- Fix sanity check for custom ec2 role. Fixes [1241](https://github.com/aws/aws-parallelcluster/issues/1241).
- Fix bug when using same subnet for both master and compute.
- Fix bug when ganglia is enabled ganglia urls are shown. Fixes [1322](https://github.com/aws/aws-parallelcluster/issues/1322).
- Fix bug with ``awsbatch`` scheduler that prevented Multi-node jobs from running.
- Fix jobwatcher behaviour that was marking nodes locked by the nodewatcher as busy even if they had been removed
already from the ASG Desired count. This was causing, in rare circumstances, a cluster overscaling.
- Fix bug that was causing failures in sqswatcher when ADD and REMOVE event for the same host are fetched together.
- Fix bug that was preventing nodes to mount partitioned EBS volumes.
- Implement paginated calls in ``pcluster list``.
- Fix bug when creating ``awsbatch`` cluster with name longer than 31 chars
- Fix a bug that lead to ssh not working after ssh'ing into a compute node by ip address.

2.4.1

Not secure

-----

**ENHANCEMENTS**

- Add support for ap-east-1 region (Hong Kong)
- Add possibility to specify instance type to use when building custom AMIs with ``pcluster createami``
- Speed up cluster creation by having compute nodes starting together with master node. **Note** this requires one new IAM permissions in the [ParallelClusterInstancePolicy](https://docs.aws.amazon.com/en_us/parallelcluster/latest/ug/iam.html#parallelclusterinstancepolicy), ``cloudformation:DescribeStackResource``
- Enable ASG CloudWatch metrics for the ASG managing compute nodes. **Note** this requires two new IAM permissions in the [ParallelClusterUserPolicy](https://docs.aws.amazon.com/parallelcluster/latest/ug/iam.html#parallelclusteruserpolicy), ``autoscaling:DisableMetricsCollection`` and ``autoscaling:EnableMetricsCollection``
- Install Intel MPI 2019u4 on Amazon Linux, Centos 7 and Ubuntu 1604
- Upgrade Elastic Fabric Adapter (EFA) to version 1.4.1 that supports Intel MPI
- Run all node daemons and cookbook recipes in isolated Python virtualenvs. This allows our code to always run with the
required Python dependencies and solves all conflicts and runtime failures that were being caused by user packages
installed in the system Python
- Torque:
- Process nodes added to or removed from the cluster in batches in order to speed up cluster scaling
- Scale up only if required CPU/nodes can be satisfied
- Scale down if pending jobs have unsatisfiable CPU/nodes requirements
- Add support for jobs in hold/suspended state (this includes job dependencies)
- Automatically terminate and replace faulty or unresponsive compute nodes
- Add retries in case of failures when adding or removing nodes
- Add support for ncpus reservation and multi nodes resource allocation (e.g. -l nodes=2:ppn=3+3:ppn=6)
- Optimized Torque global configuration to faster react to the dynamic cluster scaling

**CHANGES**

- Update EFA installer to a new version, note this changes the location of ``mpicc`` and ``mpirun``.
To avoid breaking existing code, we recommend you use the modulefile ``module load openmpi`` and ``which mpicc``
for anything that requires the full path
- Eliminate Launch Configuration and use Launch Templates in all the regions
- Torque: upgrade to version 6.1.2
- Run all ParallelCluster daemons with Python 3.6 in a virtualenv. Daemons code now supports Python >= 3.5

**BUG FIXES**

- Fix issue with sanity check at creation time that was preventing clusters from being created in private subnets
- Fix pcluster configure when relative config path is used
- Make FSx Substack depend on ComputeSecurityGroupIngress to keep FSx from trying to create prior to the SG
allowing traffic within itself
- Restore correct value for ``filehandle_limit`` that was getting reset when setting ``memory_limit`` for EFA
- Torque: fix compute nodes locking mechanism to prevent job scheduling on nodes being terminated
- Restore logic that was automatically adding compute nodes identity to SSH ``known_hosts`` file
- Slurm: fix issue that was causing the ParallelCluster daemons to fail when the cluster is stopped and an empty compute nodes file
is imported in Slurm config

2.4.0

Not secure

-----

**ENHANCEMENTS**

- Add support for EFA on Centos 7, Amazon Linux and Ubuntu 1604
- Add support for Ubuntu in China region ``cn-northwest-1``
- SGE:
- process nodes added to or removed from the cluster in batches in order to speed up cluster scaling.
- scale up only if required slots/nodes can be satisfied
- scale down if pending jobs have unsatisfiable CPU/nodes requirements
- add support for jobs in hold/suspended state (this includes job dependencies)
- automatically terminate and replace faulty or unresponsive compute nodes
- add retries in case of failures when adding or removing nodes
- configure scheduler to handle rescheduling and cancellation of jobs running on failing or terminated nodes
- Slurm:
- scale up only if required slots/nodes can be satisfied
- scale down if pending jobs have unsatisfiable CPU/nodes requirements
- automatically terminate and replace faulty or unresponsive compute nodes
- decrease SlurmdTimeout to 120 seconds to speed up replacement of faulty nodes
- Automatically replace compute instances that fail initialization and dump logs to shared home directory.
- Dynamically fetch compute instance type and cluster size in order to support updates in scaling daemons
- Always use full master FQDN when mounting NFS on compute nodes. This solves some issues occurring with some networking
setups and custom DNS configurations
- List the version and status during ``pcluster list``
- Remove double quoting of the post_install args
- ``awsbsub``: use override option to set the number of nodes rather than creating multiple JobDefinitions
- Add support for AWS_PCLUSTER_CONFIG_FILE env variable to specify pcluster config file

**CHANGES**

- Update openmpi library to version 3.1.4 on Centos 7, Amazon Linux and Ubuntu 1604. This also changes the default
openmpi path to ``/opt/amazon/efa/bin/`` and the openmpi module name to ``openmpi/3.1.4``
- Set soft and hard ulimit on open files to 10000 for all supported OSs
- For a better security posture, we're removing AWS credentials from the ``parallelcluster`` config file
Credentials can be now setup following the canonical procedure used for the aws cli
- When using FSx or EFS do not enforce in sanity check that the compute security group is open to 0.0.0.0/0
- When updating an existing cluster, the same template version is now used, no matter the pcluster cli version
- SQS messages that fail to be processed in ``sqswatcher`` are now re-queued only 3 times and not forever
- Reset ``nodewatcher`` idletime to 0 when the host becomes essential for the cluster (because of min size of ASG or
because there are pending jobs in the scheduler queue)
- SGE: a node is considered as busy when in one of the following states "u", "C", "s", "d", "D", "E", "P", "o".
This allows a quick replacement of the node without waiting for the ``nodewatcher`` to terminate it.
- Do not update DynamoDB table on cluster updates in order to avoid hitting strict API limits (1 update per day).

**BUG FIXES**

- Fix issue that was preventing Torque from being used on Centos 7
- Start node daemons at the end of instance initialization. The time spent for post-install script and node
initialization is not counted as part of node idletime anymore.
- Fix issue which was causing an additional and invalid EBS mount point to be added in case of multiple EBS
- Install Slurm libpmpi/libpmpi2 that is distributed in a separate package since Slurm 17
- ``pcluster ssh`` command now works for clusters with ``use_public_ips = false``
- Slurm: add "BeginTime", "NodeDown", "Priority" and "ReqNodeNotAvail" to the pending reasons that trigger
a cluster scaling
- Add a timeout on remote commands execution so that the daemons are not stuck if the compute node is unresponsive
- Fix an edge case that was causing the ``nodewatcher`` to hang forever in case the node had become essential to the
cluster during a call to ``self_terminate``.
- Fix ``pcluster start/stop`` commands when used with an ``awsbatch`` cluster

2.3.1

Not secure

-----

**ENHANCEMENTS**

- Add support for FSx Lustre with Amazon Linux. In case of custom AMI,
The kernel will need to be ``>= 4.14.104-78.84.amzn1.x86_64``
- Slurm
- set compute nodes to DRAIN state before removing them from cluster. This prevents the scheduler from submitting a job to a node that is being terminated.
- dynamically adjust max cluster size based on ASG settings
- dynamically change the number of configured FUTURE nodes based on the actual nodes that join the cluster. The max size of the cluster seen by the scheduler always matches the max capacity of the ASG.
- process nodes added to or removed from the cluster in batches. This speeds up cluster scaling which is able to react with a delay of less than 1 minute to variations in the ASG capacity.
- add support for job dependencies and pending reasons. The cluster won't scale up if the job cannot start due to an unsatisfied dependency.
- set ``ReturnToService=1`` in scheduler config in order to recover instances that were initially marked as down due to a transient issue.
- Validate FSx parameters. Fixes [896](https://github.com/aws/aws-parallelcluster/issues/896).

**CHANGES**

- Slurm - Upgrade version to 18.08.6.2
- NVIDIA - update drivers to version 418.56
- CUDA - update toolkit to version 10.0
- Increase default EBS volume size from 15GB to 17GB
- Disabled updates to FSx File Systems, updates to most parameters would cause the filesystem, and all it's data, to be deleted

**BUG FIXES**

- Cookbook wasn't fetched when ``custom_ami`` parameter specified in the config
- Cfn-init is now fetched from us-east-1, this bug effected non-alinux custom ami's in regions other than us-east-1.
- Account limit check not done for SPOT or AWS Batch Clusters
- Account limit check fall back to master subnet. Fixes [910](https://github.com/aws/aws-parallelcluster/issues/910).
- Boto3 upperbound removed

2.2.1

Not secure

-----

**ENHANCEMENTS**

- Add support for FSx Lustre in Centos 7. In case of custom AMI, FSx Lustre is
only supported with Centos 7.5 and Centos 7.6.
- Check AWS EC2 instance account limits before starting cluster creation
- Allow users to force job deletion with ``SGE`` scheduler

**CHANGES**

- Set default value to ``compute`` for ``placement_group`` option
- ``pcluster ssh``: use private IP when the public one is not available
- ``pcluster ssh``: now works also when stack is not completed as long as the master IP is available
- Remove unused dependency on ``awscli`` from ParallelCluster package

**BUG FIXES**

- ``awsbsub``: fix file upload with absolute path
- ``pcluster ssh``: fix issue that was preventing the command from working correctly when stack status is
``UPDATE_ROLLBACK_COMPLETE``
- Fix block device conversion to correctly attach EBS nvme volumes
- Wait for Torque scheduler initialization before completing master node setup
- ``pcluster version``: now works also when no ParallelCluster config is present
- Improve ``nodewatcher`` daemon logic to detect if a SGE compute node has running jobs

**DOCS**

- Add documentation on how to use FSx Lustre
- Add tutorial for encrypted EBS with a Custom KMS Key
- Add ``ebs_kms_key_id`` to Configuration section

**TESTING**

- Define a new framework to write and run ParallelCluster integration tests
- Improve scaling integration tests to detect over-scaling
- Implement integration tests for awsbatch scheduler
- Implement integration tests for FSx Lustre file system

Page 8 of 15

Releases

Has known vulnerabilities

Previous Next

Aws-parallelcluster

Page 8 of 15

2.5.1

2.5.0

2.4.1

2.4.0

2.3.1

2.2.1

Page 8 of 15

Links

Releases