If you are an existing user of either Pacman or Fish, it is important that you know what to expect when you begin using Chinook. Below is a comparison of some characteristics across the different HPC clusters currently operated by RCS:
|Operating System||CentOS 6 (CentOS 7 upgrade planned)||Cray Linux Environment (CLE) 4||RHEL 6|
|Workload manager||Slurm||PBS/Torque (Cray)||PBS/Torque|
|Usernames||UA usernames||Legacy ARSC usernames||Legacy ARSC usernames|
|Login nodes||2 (with more coming)||2||12 + 1 high memory|
|Compute nodes||Intel Xeon, 24/28 cores per node||AMD Istanbul/Interlagos, 12/16 cores per node, nVidia Tesla GPUs||AMD Opteron, 16/32 cores per node|
|Interconnect||QLogic QDR InfiniBand (EDR upgrade planned)||Cray Gemini||QLogic QDR and Voltaire SDR InfiniBand|
|Default compiler suite||Intel||PGI||PGI|
|$HOME||Yes (only available on cluster)||Yes (only available on cluster)||Yes|
|/usr/local/unsupported||Yes (only available on cluster)||Yes (only available on cluster)||Yes|
All software previously compiled on Pacman or Fish will need to be recompiled for Chinook. This is due to differences between the hardware and Linux kernel present on Chinook and those on Pacman / Fish.
Software stack differences
Compiler toolchain modules
On Pacman and Fish, the environment modules responsible for loading a compiler and related set of core HPC libraries follow the "PrgEnv" naming convention common to many HPC facilities. Chinook's equivalent modules are called compiler toolchains, or just toolchains. See Compiler Toolchains for more information on what is available.
At this time, the "PrgEnv" modules on Chinook are deprecated, and have been replaced by toolchain modules instead. Depending on feedback, we may provide "PrgEnv"-style symlinks to the toolchain modules in the future.
Dependency loading behavior
The modules on Chinook now each specify and load a (mostly) complete module dependency tree. To illustrate, consider loading an Intel-compiled netCDF library. Here is what happens on Pacman:
$ module purge $ module load netcdf/4.3.0.intel-2013_sp1 $ module list --terse Currently Loaded Modulefiles: netcdf/4.3.0.intel-2013_sp1
And an equivalent action on Chinook:
$ module purge $ module load data/netCDF/4.4.1-pic-intel-2016b $ module list --terse Currently Loaded Modulefiles: compiler/GCCcore/5.4.0 tools/binutils/2.26-GCCcore-5.4.0 compiler/icc/2016.3.210-GCC-5.4.0-2.26 compiler/ifort/2016.3.210-GCC-5.4.0-2.26 openmpi/intel/1.10.2 toolchain/pic-iompi/2016b numlib/imkl/188.8.131.52-pic-iompi-2016b toolchain/pic-intel/2016b lib/zlib/1.2.8-pic-intel-2016b tools/Szip/2.1-pic-intel-2016b data/HDF5/1.8.17-pic-intel-2016b tools/cURL/7.49.1-pic-intel-2016b data/netCDF/4.4.1-pic-intel-2016b
On Pacman and Fish, you get exactly the module you requested and no more (with a few exceptions). This has advantages and disadvantages:
- advantage: It is easy to experiment with different software builds by swapping library modules
- disadvantage: It is not immediately obvious which libraries were used during any given software build when multiple versions of those libraries exist
- disadvantage: It is trivial to introduce a fatal error in an application by inadvertently loading an incompatible library module or omitting a needed one
On Chinook, standardizing and loading all module dependencies results in consistency and reproducibility. When you load the Intel-compiled netCDF module on Chinook, for example, you get modules loaded for the following:
- The netCDF library
- Its immediate dependencies (HDF5, zlib, curl)
- The dependencies for the dependencies (and so on, recursively)
- The exact Intel compiler, MPI library, and Intel Math Kernel Library (MKL) used to build netCDF
- An upgraded version of GCC to supersede the ever-present system version
This takes out the guesswork of manually piecing together a software stack module by module. Every successive dependency will modify
LD_LIBRARY_PATH and other variables appropriately, so that desired application or library will dynamically link to the proper supporting libraries instead of accidentally picking up an inappropriate matching library.
One ramification of loading a full dependency tree is that trying to load software compiled with different compiler toolchains will likely result in module conflicts - even if the tools you are trying to load provide only binaries and nothing else. This is because combining two or more different dependency trees will likely result in unintended and harmful dynamic linking due to two different builds of a core compiler or library being loaded.
LD_LIBRARY_PATH ensures that the library version found first will be used to satisfy all dependencies on that particular library, causing no problems for the software packages that expect it and possibly wreaking havoc for the packages that expect a different build.
Slurm Translation Guide
One of the most immediately evident changes with Chinook is that it uses Slurm for job scheduling rather than PBS/Torque. The workflow for submitting jobs has not changed significantly, but the syntax and commands have. Below is an excerpt from SchedMD's "Rosetta Stone of Workload Managers" relevant to PBS/Torque.
For more information on Slurm, please see Using the Batch System.
Source: http://slurm.schedmd.com/rosetta.pdf, 28-Apr-2013
|Job submission||qsub [script_file]||sbatch [script_file]|
|Job deletion||qdel [job_id]||scancel [job_id]|
|Job status (by job)||qstat [job_id]||squeue [job_id]|
|Job status (by user)||qstat -u [user_name]||squeue -u [user_name]|
|Job hold||qhold [job_id]||scontrol hold [job_id]|
|Job release||qrls [job_id]||scontrol release [job_id]|
|Queue list||qstat -Q||squeue|
|Node list||pbsnodes -l||sinfo -N OR scontrol show nodes|
|Cluster status||qstat -a||sinfo|
|Job Array Index||$PBS_ARRAYID||$SLURM_ARRAY_TASK_ID|
|Queue||-q [queue]||-p [queue]|
|Node Count||-l nodes=[count]||-N [min[-max]]|
|CPU Count||-l ppn=[count] OR -l mppwidth=[PE_count]||-n [count]|
|Wall Clock Limit||-l walltime=[hh:mm:ss]||-t [min] OR -t [days-hh:mm:ss]|
|Standard Output File||-o [file_name]||-o [file_name]|
|Standard Error File||-e [file_name]||e [file_name]|
|Combine stdout/err||-j oe (both to stdout) OR -j eo (both to stderr)||(use -o without -e)|
|Copy Environment||-V||--export=[ALL | NONE | variables]|
|Event Notification||-m abe||--mail-type=[events]|
|Email Address||-M [address]||--mail-user=[address]|
|Job Name||-N [name]||--job-name=[name]|
|Job Restart||-r [y|n]||--requeue OR --no-requeue (NOTE: configurable default)|
|Resource Sharing||-l naccesspolicy=singlejob||--exclusive OR--shared|
|Memory Size||-l mem=[MB]||--mem=[mem][M|G|T] OR --mem-per-cpu=[mem][M|G|T]|
|Account to Charge||-W group_list=[account]||--account=[account]|
|Tasks Per Node||-l mppnppn [PEs_per_node]||--tasks-per-node=[count]|
|CPUs Per Task||--cpus-per-task=[count]|
|Job Dependency||-d [job_id]||--depend=[state:job_id]|
|Job host preference||--nodelist=[nodes] AND/OR --exclude= [nodes]|
|Quality of Service||-l qos=[name]||--qos=[name]|
|Job Arrays||-t [array_spec]||--array=[array_spec] (Slurm version 2.6+)|
|Generic Resources||-l other=[resource_spec]||--gres=[resource_spec]|
|Begin Time||-A "YYYY-MM-DD HH:MM:SS"||--begin=YYYY-MM-DD[THH:MM[:SS]]|
Frequently Asked Questions
Will I need to copy my files from Pacman/Fish to Chinook?
$ARCHIVE and $CENTER are mounted on Chinook, so you will have access to all your existing files on Chinook. However, your home directory is new and we will not be automatically copying any Pacman/Fish home directory contents to Chinook. If you would like to transfer any files from your Pacman/Fish home directory, you may do so using
sftp between Pacman/Fish and Chinook.
I used the PGI compiler on Pacman/Fish. What are my options on Chinook?
Support for the PGI compiler suite will expire in FY17. If possible, please look into compiling your code using the Intel or GNU compiler suites. If not, the latest version of the PGI compilers available when support lapses will remain installed on Chinook.
I have a PBS/Torque batch script. Can I use it on Chinook?
Possibly. Slurm does provide compatibility scripts for various PBS/Torque commands including
qsub. The compatibility is not perfect, and you will likely need to debug why your batch script isn't doing what you expect. It is worth putting that time towards porting the PBS script to Slurm syntax and using