My sys admin uses all these terms! What do they mean?

The software stack driving HPC systems is intricate and often appears convoluted. Below we try to define and briefly discuss many of the terms and concepts you may encounter on your journey into the realm of High Performance Computing. You may note that many of the software packages are mentioned in multiple sections. This is because some software packages can perform multiple functions. Also be aware that just because package X can do I, J, and K functions, this does not always preclude an admin from chosing package Y to do one of those functions because package Y does it better.

Cluster Provisioning
This is the process that determines how compute nodes are configured/loaded with software. It provides such things as the OS, software sets, IP addresses, etc. Cluster provisioners are software packages that can detect new nodes that appear on the network and auto-provision them with pre-determined configurations.
Example provisioners include:

Stateful Provisioning
In most computers, the OS is installed on the local disk. Stateful provisioning is the process of obtaining an OS image from a server and installing that OS to the local disk of the client node. Because of it's reboot speed, this approach is useful when bandwidth is extremely limited or changes in OS image are expected to be infrequent.

Stateless Provisioning
This is the opposite of stateful provisioning. Instead of installing the OS on the local disk, the provided image is installed into RAM which allows the hard disk space to be used for something else (such as swap). The advantages of this approach are that the OS image is refreshed at each reboot and that any changes to the server-hosted image is propagated to each node. It also saves some disk space on each node (the nodes could be completely diskless, as long as the work loads did not require fast swap).

Statelite Provisioning
Statelite is an intermediate step between stateful and stateless provisioning. Some files can persistent over reboot, while allowing the advantage of only needing to manage a single image. Statelite also provides a configurable list of directories and files that can be read-write. These read-write directories and files can either be persistent across reboots, or volatile (reset to its original state after reboot).

File systems
In short, a file system is used to control how data is stored and retrieved from storage media such as spinning disk (traditional hard drives), SSD's (flash memory), tape, etc. Example desktop filesystems include:

  • XFS
  • Ext4
  • NTFS

Network, or Distributed filesystems use protocols such as NFS, or CIFS, to add a networking component so that a server can provide remote access to its data from multiple clients at one time.
A clustered file system expands on the above by virtue of being mounted across multiple servers via a network rather than directly attached. Multiple clients can access multiple disks or sets of disks without necessarily being aware that they are doing so.
A parallel file system writes data across multiple storage nodes to provide redundancy or improved performance. Bighorn uses a parallel file system known as GPFS.

A fairly exhaustive listing of filesystems can be found here.

Job Scheduler
In the context of HPC, a job (or batch) scheduler manages the job queue for the cluster(s). When jobs are submitted to the queue, the scheduler stores them in order of submission then releases them to run as requisite conditions are met and adequate resources become available.
Several software packages are capable of job scheduling and all are derivatives of the Portable Batch System (PBS) originally developed for NASA.

There are software packages that allow remote monitoring of jobs from handheld devices:

Resource or Workload Managers
The resource manager is the primary management entity responsible for coordinating all non-local interactions, and providing control over batch jobs and distributed compute nodes.
Resource managers are often also job schedulers, but offer more intricate methods of controlling how and when jobs are released from the queue.
Some examples include: