Why?

Turing hardware, management software, networking, and storage is being upgraded.

Impact

Turing research cluster is unavailable during the upgrade. Running jobs must be stopped.

Benefits

Users will have better, faster, more reliable hardware for research.

Action Needed

SLURM will not schedule any jobs that won't complete their allocated time by the deadline (9AM). If your jobs aren't submitting, and they don't use their entire time, reducing the time limit should allow them to run.

Details

Some major improvements to Turing:

  • New all-flash storage for home and scratch directories located inside the cluster network.
  • New 100 gigabit backend network providing access to storage, as well as providing an RDMA fabric for MPI between compute nodes. Along with the new storage, this has shown as much as 25x better performance in synthetic IO workloads.
  • Major software upgrade of Bright Cluster Manager from 8.1 to 9.2. Along with this, the base operating system is changing from RHEL 7.6 to Ubuntu 20.04, which is both newer, and should be more familiar to more people.
  • New hardware to support a virtual head node and login nodes. If someone overloads a login node, it will no longer incapacitate the whole system for all users. Additionally, most hardware maintenance can be performed by IT without causing interruption or downtime.

Impact of the architectural changes:

  • Due to the new operating system version new modules have been built. Please report to IT if any are missing. For Python, IT will automatically repair all root level venvs to use an equivalent new python module.
  • Turing home directories are no longer shared with Ace.

Additional changes/notes:

  • Renamed /work to /scratch to better reflect its intended purpose: a non-backed-up temporary/shared scratch space. There is a symbolic link from /work to /scratch so existing references to /work should still function correctly.
  • Most hostnames will be from new .turing.wpi.edu subdomain.
  • Host keys carried over from previous Turing configuration, so SSH should not be impacted. If any SSH key errors occur, please report these to IT.
  • With the completely new install, job numbers are resetting, and will start counting from 1 again.