Deploying Slurm on Rocky Linux
Knowledge:
Reading time: 30 minutes
AI usage disclosure¶
To generate the playbooks for Rocky Linux 8, 9, and 10, the human author used AI. The human author manually tested these procedures to ensure that they work correctly. The author takes full responsibility for the correctness of this document. Please report any errors you encounter to the Rocky Linux documentation team.
Introduction¶
slurm is an integral technology in the HPC world. It is the backbone of science experimentation ranging from space exploration to weather forecasting. slurm allows for the easy deployment of workloads across a cluster of hundreds or even thousands of nodes.
By the end of this guide, you will better understand what slurm is, how to deploy slurm on a basic controller-compute node configuration, and how to run a basic job on a compute node.
Background¶
slurm is a cluster management and job scheduling system for Linux clusters. You can use it to run workloads on nodes, anything from memory usage checks to full aerodynamic simulations.
Prerequisites¶
-
Two Rocky Linux VMs / two physical Rocky Linux nodes.
-
Ansible host to configure the two VMs / nodes.
Slurm setup on Rocky Linux 8¶
Rocky Linux 8 is still using cgroups v1 and so as part of the install process, you must enable cgroups v2 on the host system.
An Ansible playbook is in use for the deployment process.
Ansible hosts file setup:
cat << "EOF" | sudo tee /etc/ansible/hosts
[rocky-linux8-slurm]
rocky-linux8-slurm-controller-node ansible_ssh_host=192.168.1.120
rocky-linux8-slurm-compute-node ansible_ssh_host=192.168.1.121
EOF
Ansible group_vars setup:
cat << "EOF" | sudo tee /etc/ansible/group_vars/all
---
rocky_linux8_slurm_controller_node_ip: 192.168.1.120
rocky_linux8_slurm_compute_node_ip: 192.168.1.121
EOF
Ansible requirements.yaml:
cat << "EOF" | tee ~/requirements.yaml
---
collections:
- name: community.crypto
- name: community.general
- name: community.mysql
- name: ansible.posix
EOF
ansible-galaxy collection install -r requirements.yaml
Rocky Linux 8 Slurm Deployment Playbook
---
- name: Slurm setup on a Rocky Linux 8 controller and compute Node
hosts: rocky-linux8-slurm
tasks:
- name: Upgrade all packages on both hosts
become: true
ansible.builtin.dnf:
name: "*"
state: latest
- name: Enable cgroupsv2 in the kernel cmdline parameters
become: true
ansible.builtin.shell: grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1"
- name: Reboot the machines
become: true
ansible.builtin.reboot:
reboot_timeout: 3600
- name: Disable cloud-init management of /etc/hosts so entries persist across reboots
become: true
ansible.builtin.lineinfile:
path: /etc/cloud/cloud.cfg
regexp: '^\s*manage_etc_hosts'
line: 'manage_etc_hosts: false'
- name: Add the IPs and hostnames from each node to /etc/hosts
become: true
ansible.builtin.lineinfile:
path: /etc/hosts
line: '{{ item }}'
with_items:
- '{{ rocky_linux8_slurm_controller_node_ip }} rocky-linux8-slurm-controller-node'
- '{{ rocky_linux8_slurm_compute_node_ip }} rocky-linux8-slurm-compute-node'
- name: Install Python 3.8 for compatibility with community.crypto.openssh_keypair
become: true
ansible.builtin.dnf:
name: python38
state: latest
- name: Set Python 3.8 as the Ansible interpreter
ansible.builtin.set_fact:
ansible_python_interpreter: /usr/bin/python3.8
- name: Generate an OpenSSH keypair of 4096 rsa so that the keys can be exchanged between the nodes
become: true
community.crypto.openssh_keypair:
path: /root/.ssh/id_rsa
- name: Fetch the public ssh key from the Controller Node
become: true
ansible.builtin.slurp:
src: /root/.ssh/id_rsa.pub
register: rocky_linux8_slurm_controller_node_public_key
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Set the Controller Node's public key as a fact for all hosts
ansible.builtin.set_fact:
rocky_linux8_slurm_controller_node_ssh_public_key: "{{ hostvars['rocky-linux8-slurm-controller-node']['rocky_linux8_slurm_controller_node_public_key']['content'] | b64decode | trim }}"
- name: Add the Controller Node's public key to the Compute Node's authorized_keys file
become: true
ansible.posix.authorized_key:
user: root
state: present
key: "{{ rocky_linux8_slurm_controller_node_ssh_public_key }}"
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Add the Compute Node to the Controller Node's `known_hosts` file
become: true
ansible.builtin.known_hosts:
path: /root/.ssh/known_hosts
name: "{{ rocky_linux8_slurm_compute_node_ip }}"
key: "{{ lookup('pipe', 'ssh-keyscan ' + rocky_linux8_slurm_compute_node_ip) }}"
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Fetch the public ssh key from the Compute Node
become: true
ansible.builtin.slurp:
src: /root/.ssh/id_rsa.pub
register: rocky_linux8_slurm_compute_node_public_key
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Set the Compute Node's public key as a fact for all hosts
ansible.builtin.set_fact:
rocky_linux8_slurm_compute_node_ssh_public_key: "{{ hostvars['rocky-linux8-slurm-compute-node']['rocky_linux8_slurm_compute_node_public_key']['content'] | b64decode | trim }}"
- name: Add the Compute Node's public key to the Controller Node's authorized_keys file
become: true
ansible.posix.authorized_key:
user: root
state: present
key: "{{ rocky_linux8_slurm_compute_node_ssh_public_key }}"
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Add the Controller Node to the Compute Node's known_hosts file
become: true
ansible.builtin.known_hosts:
path: /root/.ssh/known_hosts
name: "{{ rocky_linux8_slurm_controller_node_ip }}"
key: "{{ lookup('pipe', 'ssh-keyscan ' + rocky_linux8_slurm_controller_node_ip) }}"
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Pull down the slurm 25.05 tarball
ansible.builtin.get_url:
url: https://download.schedmd.com/slurm/slurm-25.05.5.tar.bz2
dest: /root/slurm-25.05.5.tar.bz2
- name: Install autofs, munge, nfs and the openmpi utilities
become: true
ansible.builtin.dnf:
name:
- autofs
- munge
- openmpi
- openmpi-devel
- nfs-utils
- rpcbind
state: latest
- name: Enable and start the nfs-server service
become: true
ansible.builtin.command: systemctl enable --now nfs-server
- name: Enable and start the rpcbind service
become: true
ansible.builtin.command: systemctl enable --now rpcbind
- name: Add the source IP address from the Controller Node to the Compute Node's firewall
become: true
ansible.builtin.command: firewall-cmd --add-rich-rule='rule family="ipv4" source address="{{ rocky_linux8_slurm_controller_node_ip }}" accept' --permanent
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Add the source IP address from the Compute Node to the Controller Node's firewall
become: true
ansible.builtin.command: firewall-cmd --add-rich-rule='rule family="ipv4" source address="{{ rocky_linux8_slurm_compute_node_ip }}" accept' --permanent
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Reload the firewall
become: true
ansible.builtin.command: firewall-cmd --reload
- name: Create the nfs directory in the root user's home directory
ansible.builtin.file:
path: /root/nfs
state: directory
- name: Set up the Controller Node's exports file with the NFS share for the Compute Node
become: true
ansible.builtin.lineinfile:
path: /etc/exports
line: /root/nfs {{ rocky_linux8_slurm_compute_node_ip }}(rw,sync,no_subtree_check,no_root_squash)
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Set the NFS share to be mounted on the Compute Node via autofs
become: true
ansible.builtin.lineinfile:
path: /etc/auto.master
line: /root/nfs /etc/auto.nfs
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Create the map file for the NFS share on the Compute Node
become: true
ansible.builtin.file:
path: /etc/auto.nfs
state: touch
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Add the Controller Node's NFS share to the auto.nfs map on the Compute Node
become: true
ansible.builtin.lineinfile:
path: /etc/auto.nfs
line: rocky-linux8-slurm-controller-node:/root/nfs
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Restart the autofs service
become: true
ansible.builtin.command: systemctl restart autofs
- name: Restart the rpcbind service
become: true
ansible.builtin.command: systemctl restart rpcbind
- name: Add the following openmpi PATH lines in .bashrc for the root user
ansible.builtin.lineinfile:
path: /root/.bashrc
line: '{{ item }}'
with_items:
- 'PATH=$PATH:/usr/lib64/openmpi/bin'
- '# LD_LIBRARY_PATH=/usr/lib64/openmpi/lib'
- name: Source .bashrc
ansible.builtin.shell: source /root/.bashrc
- name: Create a machinefile in the root user's NFS directory on the Controller Node
become: true
ansible.builtin.file:
path: /root/nfs/machinefile
state: touch
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Add the Compute Node's hostname to the Controller Node's machinefile
ansible.builtin.lineinfile:
path: /root/nfs/machinefile
line: rocky-linux8-slurm-compute-node
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Install the mariadb packages on the Controller Node
become: true
ansible.builtin.dnf:
name:
- mariadb
- mariadb-devel
- mariadb-server
- python38-devel
state: latest
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Enable and start the mariadb service on the Controller Node
become: true
ansible.builtin.command: systemctl enable --now mariadb
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Install the pexpect Python library, so that the mysql_secure_installation script can complete without manual intervention
become: true
ansible.builtin.pip:
name: pexpect
executable: pip3.8
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Complete the mysql_secure_installation script on the Controller Node
become: yes
ansible.builtin.expect:
command: mysql_secure_installation
responses:
'Enter current password for root': ''
'Set root password': 'n'
'Remove anonymous users': 'y'
'Disallow root login remotely': 'y'
'Remove test database': 'y'
'Reload privilege tables now': 'y'
timeout: 1
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Install the python2-PyMySQL package on the Controller Node
become: yes
ansible.builtin.dnf:
name: python2-PyMySQL
state: latest
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Install the mysqlclient Python library, so that the slurm user can be added to the mariadb database
become: true
ansible.builtin.pip:
name: mysqlclient
executable: pip3.8
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Create the slurm_acct_db database needed for slurm
community.mysql.mysql_db:
name: slurm_acct_db
state: present
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Create the slurm MySQL user and grant privileges on the Controller Node
become: yes
community.mysql.mysql_user:
name: slurm
password: '1234'
host: localhost
priv: 'slurm_acct_db.*:ALL,GRANT'
state: present
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Enable the devel repository for access to the munge-devel package
become: true
community.general.dnf_config_manager:
name: devel
state: enabled
- name: Install packages to build the slurm RPMs on each node
become: true
ansible.builtin.dnf:
name:
- autoconf
- automake
- dbus-devel
- libbpf-devel
- make
- mariadb-devel
- munge-devel
- pam-devel
- perl-devel
- perl-ExtUtils-MakeMaker
- python3
- readline-devel
- rpm-build
state: latest
- name: Check if the rpmbuild directory already exists
ansible.builtin.stat:
path: /root/rpmbuild
register: rpmbuild_directory
- name: Build the slurm RPMs on each node
ansible.builtin.shell: rpmbuild -ta slurm-25.05.5.tar.bz2
when: not rpmbuild_directory.stat.exists
- name: Install the slurm RPMs for the Controller Node
ansible.builtin.dnf:
name:
- /root/rpmbuild/RPMS/x86_64/slurm-25.05.5-1.el8.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-perlapi-25.05.5-1.el8.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-slurmctld-25.05.5-1.el8.x86_64.rpm
state: present
disable_gpg_check: yes
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Install the slurm RPMs for the Compute Node
ansible.builtin.dnf:
name:
- /root/rpmbuild/RPMS/x86_64/slurm-25.05.5-1.el8.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-perlapi-25.05.5-1.el8.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-slurmd-25.05.5-1.el8.x86_64.rpm
state: present
disable_gpg_check: yes
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Place the output of the date command from each node and place it into a variable
ansible.builtin.shell: date
register: nodes_date_command_output
- name: Print the output of the variable to std_out to show the nodes are in sync
ansible.builtin.debug:
msg: "The current time and date of the node is: {{ nodes_date_command_output }}"
- name: Create a systemd directory for slurmd on the Compute Node
become: true
ansible.builtin.file:
path: /etc/systemd/system/slurmd.service.d
state: directory
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Add Delegate=yes to the slurmd service for cgroups v2 on the Compute Node
become: true
ansible.builtin.copy:
dest: /etc/systemd/system/slurmd.service.d/delegate.conf
content: |
[Service]
Delegate=yes
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Reload systemd daemon
become: true
ansible.builtin.command: systemctl daemon-reload
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Create the munge.key using dd (the /usr/sbin/mungekey command is only available for munge version 5.14+) on the Controller Node
become: true
ansible.builtin.shell: dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Change the owner of the munge.key to the user munge and the group munge
become: true
ansible.builtin.shell: chown munge:munge /etc/munge/munge.key
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Set only read permissions for the owner (the munge user)
become: true
ansible.builtin.shell: chmod 400 /etc/munge/munge.key
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Copy the munge.key from the Controller Node to the Compute Node
become: true
ansible.posix.synchronize:
src: /etc/munge/munge.key
dest: /etc/munge/munge.key
mode: push
delegate_to: rocky-linux8-slurm-controller-node
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Set the user and group ownership of the munge.key to the munge user on the Compute Node
become: true
ansible.builtin.shell: chown munge:munge /etc/munge/munge.key
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Set only read permissions for the owner (the munge user) on the Compute Node
become: true
ansible.builtin.shell: chmod 400 /etc/munge/munge.key
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Create the /etc/slurm directory on both nodes
become: true
ansible.builtin.file:
path: /etc/slurm
state: directory
- name: Create the slurm.conf configuration file on each node
become: true
ansible.builtin.copy:
dest: /etc/slurm/slurm.conf
content: |
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=rocky-linux8-slurm-cluster
SlurmctldHost=rocky-linux8-slurm-controller-node
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
#MpiDefault=
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
#SwitchType=
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePort=
#AccountingStorageType=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompParams=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=
JobAcctGatherFrequency=30
#JobAcctGatherType=
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=rocky-linux8-slurm-compute-node NodeAddr={{ rocky_linux8_slurm_compute_node_ip }} State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
- name: Add the munge group with a GID of 2001
ansible.builtin.group:
name: munge
gid: 2001
state: present
- name: Add the munge user with a UID of 2001
ansible.builtin.user:
name: munge
uid: 2001
group: munge
comment: "MUNGE Uid 'N' Gid Emporium"
home: /var/lib/munge
shell: /sbin/nologin
create_home: yes
state: present
- name: Add the slurm group with a GID of 2002
ansible.builtin.group:
name: slurm
gid: 2002
state: present
- name: Add the slurm user with a UID of 2002
ansible.builtin.user:
name: slurm
uid: 2002
group: slurm
comment: "SLURM Workload Manager"
home: /var/lib/slurm
shell: /sbin/nologin
create_home: yes
state: present
- name: Create the required munge and slurm directories
become: true
ansible.builtin.file:
path: "{{ item }}"
state: directory
loop:
- /run/munge
- /etc/slurm
- /run/slurm
- /var/lib/slurm
- /var/log/slurm
- /var/spool/slurmd
- /var/spool/slurmctld
- name: Change ownership of the munge directories to the munge user
become: true
ansible.builtin.shell: "chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/"
- name: Change ownership of the slurm directories to the slurm user
become: true
ansible.builtin.shell: "chown -R slurm: /etc/slurm/ /var/log/slurm/ /var/lib/slurm/ /run/slurm/ /var/spool/slurmd/ /var/spool/slurmctld"
- name: Set permissions of 0755 on the munge and slurm directories (due to munge version 5.13 being installed)
become: true
ansible.builtin.shell: "chmod 0755 /run/munge /etc/slurm/ /var/log/slurm/ /var/lib/slurm/ /run/slurm/ /var/spool/slurmd/ /var/spool/slurmctld"
- name: Create the /etc/slurm/cgroup.conf file on both nodes
become: true
ansible.builtin.file:
path: /etc/slurm/cgroup.conf
state: touch
- name: Set EnableControllers and cgroups v2 in /etc/slurm/cgroup.conf
become: true
ansible.builtin.lineinfile:
path: /etc/slurm/cgroup.conf
line: "{{ item }}"
loop:
- "CgroupPlugin=cgroup/v2"
- "EnableControllers=yes"
- name: Enable and start the munge service on both nodes
become: true
ansible.builtin.shell: systemctl enable --now munge
- name: Enable and start the slurmctld service on the Controller Node
become: true
ansible.builtin.shell: systemctl enable --now slurmctld
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Enable and start the slurmd service on the Compute Node
become: true
ansible.builtin.shell: systemctl enable --now slurmd
when: inventory_hostname == 'rocky-linux8-slurm-compute-node'
- name: Run a test srun command on the Controller Node, which should return the hostname of the Compute Node
ansible.builtin.shell: 'srun -c 1 -n 1 -J crunchy "/bin/hostname"'
register: srun_hostname_output_from_compute_node
when: inventory_hostname == 'rocky-linux8-slurm-controller-node'
- name: Print the output of the variable to std_out
ansible.builtin.debug:
msg: "{{ srun_hostname_output_from_compute_node }}"
Notes on the Slurm installation for Rocky Linux 8¶
Task Enable the devel repository for access to the munge-devel package requires the Devel5 repository for ease of deployment of munge.
Once the build of the munge package is complete, you should disable the Devel repository again with dnf config-manager --set-disabled devel.
As a note, it is best practice to build munge from source4
Task Create the slurm.conf configuration file on each node uses a configuration generated by the Slurm Configurator Tool6. The main lines of note are:
ClusterName=rocky-linux8-slurm-cluster
SlurmctldHost=rocky-linux8-slurm-controller-node
SlurmUser=slurm
NodeName=rocky-linux8-slurm-compute-node NodeAddr={{ rocky_linux8_slurm_compute_node_ip }} State=UNKNOWN
Alter each of the above values to suit your cluster.
Generally it is best to leave SlurmUser as slurm, as you do not want to run your workloads as root for security8 reasons.
Slurm setup on Rocky Linux 9¶
Ansible hosts file setup:
cat << "EOF" | sudo tee /etc/ansible/hosts
[rocky-linux9-slurm]
rocky-linux9-slurm-controller-node ansible_ssh_host=<CONTROLLER_IP>
rocky-linux9-slurm-compute-node ansible_ssh_host=<COMPUTE_IP>
[rocky-linux9-slurm:vars]
ansible_user=root
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
EOF
Ansible group_vars setup:
cat << "EOF" | sudo tee /etc/ansible/group_vars/all
---
rocky_linux9_slurm_controller_node_ip: <CONTROLLER_IP>
rocky_linux9_slurm_compute_node_ip: <COMPUTE_IP>
EOF
Ansible requirements.yaml:
cat << "EOF" | tee ~/requirements.yaml
---
collections:
- name: community.crypto
- name: community.general
- name: community.mysql
- name: ansible.posix
EOF
Rocky Linux 9 Slurm Deployment Playbook
---
# Slurm 25.05 Deployment Playbook for Rocky Linux 9.7
# Controller-Compute Node Configuration
- name: Slurm setup on a Rocky Linux 9 controller and compute Node
hosts: rocky-linux9-slurm
tasks:
- name: Upgrade all packages on both hosts
become: true
ansible.builtin.dnf:
name: "*"
state: latest
- name: Enable cgroupsv2 in the kernel cmdline parameters
become: true
ansible.builtin.shell: grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1"
- name: Reboot the machines
become: true
ansible.builtin.reboot:
reboot_timeout: 3600
- name: Disable cloud-init management of /etc/hosts so entries persist across reboots
become: true
ansible.builtin.lineinfile:
path: /etc/cloud/cloud.cfg
regexp: '^\s*manage_etc_hosts'
line: 'manage_etc_hosts: false'
- name: Add the IPs and hostnames from each node to /etc/hosts
become: true
ansible.builtin.lineinfile:
path: /etc/hosts
line: '{{ item }}'
with_items:
- '{{ rocky_linux9_slurm_controller_node_ip }} rocky-linux9-slurm-controller-node'
- '{{ rocky_linux9_slurm_compute_node_ip }} rocky-linux9-slurm-compute-node'
- name: Generate an OpenSSH keypair so that the keys can be exchanged between the nodes
become: true
community.crypto.openssh_keypair:
path: /root/.ssh/id_rsa
- name: Fetch the public ssh key from the Controller Node
become: true
ansible.builtin.slurp:
src: /root/.ssh/id_rsa.pub
register: rocky_linux9_slurm_controller_node_public_key
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Set the Controller Node's public key as a fact for all hosts
ansible.builtin.set_fact:
rocky_linux9_slurm_controller_node_ssh_public_key: "{{ hostvars['rocky-linux9-slurm-controller-node']['rocky_linux9_slurm_controller_node_public_key']['content'] | b64decode | trim }}"
- name: Add the Controller Node's public key to the Compute Node's authorized_keys file
become: true
ansible.posix.authorized_key:
user: root
state: present
key: "{{ rocky_linux9_slurm_controller_node_ssh_public_key }}"
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Add the Compute Node to the Controller Node's known_hosts file
become: true
ansible.builtin.known_hosts:
path: /root/.ssh/known_hosts
name: "{{ rocky_linux9_slurm_compute_node_ip }}"
key: "{{ lookup('pipe', 'ssh-keyscan ' + rocky_linux9_slurm_compute_node_ip) }}"
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Fetch the public ssh key from the Compute Node
become: true
ansible.builtin.slurp:
src: /root/.ssh/id_rsa.pub
register: rocky_linux9_slurm_compute_node_public_key
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Set the Compute Node's public key as a fact for all hosts
ansible.builtin.set_fact:
rocky_linux9_slurm_compute_node_ssh_public_key: "{{ hostvars['rocky-linux9-slurm-compute-node']['rocky_linux9_slurm_compute_node_public_key']['content'] | b64decode | trim }}"
- name: Add the Compute Node's public key to the Controller Node's authorized_keys file
become: true
ansible.posix.authorized_key:
user: root
state: present
key: "{{ rocky_linux9_slurm_compute_node_ssh_public_key }}"
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Add the Controller Node to the Compute Node's known_hosts file
become: true
ansible.builtin.known_hosts:
path: /root/.ssh/known_hosts
name: "{{ rocky_linux9_slurm_controller_node_ip }}"
key: "{{ lookup('pipe', 'ssh-keyscan ' + rocky_linux9_slurm_controller_node_ip) }}"
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Pull down the slurm 25.05 tarball
ansible.builtin.get_url:
url: https://download.schedmd.com/slurm/slurm-25.05.5.tar.bz2
dest: /root/slurm-25.05.5.tar.bz2
- name: Install autofs, munge, nfs and the openmpi utilities
become: true
ansible.builtin.dnf:
name:
- autofs
- munge
- openmpi
- openmpi-devel
- nfs-utils
- rpcbind
state: latest
- name: Enable and start the nfs-server service
become: true
ansible.builtin.systemd:
name: nfs-server
enabled: yes
state: started
- name: Enable and start the rpcbind service
become: true
ansible.builtin.systemd:
name: rpcbind
enabled: yes
state: started
- name: Add the source IP address from the Controller Node to the Compute Node's firewall
become: true
ansible.builtin.command: firewall-cmd --add-rich-rule='rule family="ipv4" source address="{{ rocky_linux9_slurm_controller_node_ip }}" accept' --permanent
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Add the source IP address from the Compute Node to the Controller Node's firewall
become: true
ansible.builtin.command: firewall-cmd --add-rich-rule='rule family="ipv4" source address="{{ rocky_linux9_slurm_compute_node_ip }}" accept' --permanent
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Reload the firewall
become: true
ansible.builtin.command: firewall-cmd --reload
- name: Create the nfs directory in the root user's home directory
ansible.builtin.file:
path: /root/nfs
state: directory
- name: Set up the Controller Node's exports file with the NFS share for the Compute Node
become: true
ansible.builtin.lineinfile:
path: /etc/exports
line: /root/nfs {{ rocky_linux9_slurm_compute_node_ip }}(rw,sync,no_subtree_check,no_root_squash)
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Export the NFS shares on the Controller Node
become: true
ansible.builtin.command: exportfs -a
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Set the NFS share to be mounted on the Compute Node via autofs
become: true
ansible.builtin.lineinfile:
path: /etc/auto.master
line: /root/nfs /etc/auto.nfs
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Create the map file for the NFS share on the Compute Node
become: true
ansible.builtin.file:
path: /etc/auto.nfs
state: touch
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Add the Controller Node's NFS share to the auto.nfs map on the Compute Node
become: true
ansible.builtin.lineinfile:
path: /etc/auto.nfs
line: rocky-linux9-slurm-controller-node:/root/nfs
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Restart the autofs service
become: true
ansible.builtin.systemd:
name: autofs
state: restarted
- name: Restart the rpcbind service
become: true
ansible.builtin.systemd:
name: rpcbind
state: restarted
- name: Add the following openmpi PATH lines in .bashrc for the root user
ansible.builtin.lineinfile:
path: /root/.bashrc
line: '{{ item }}'
with_items:
- 'PATH=$PATH:/usr/lib64/openmpi/bin'
- '# LD_LIBRARY_PATH=/usr/lib64/openmpi/lib'
- name: Source .bashrc
ansible.builtin.shell: source /root/.bashrc
- name: Create a machinefile in the root user's NFS directory on the Controller Node
become: true
ansible.builtin.file:
path: /root/nfs/machinefile
state: touch
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Add the Compute Node's hostname to the Controller Node's machinefile
ansible.builtin.lineinfile:
path: /root/nfs/machinefile
line: rocky-linux9-slurm-compute-node
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Reset the mariadb module stream to resolve modular filtering
become: true
ansible.builtin.command: dnf module reset mariadb -y
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
ignore_errors: yes
- name: Enable the mariadb module stream
become: true
ansible.builtin.command: dnf module enable mariadb -y
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
ignore_errors: yes
- name: Install the mariadb packages on the Controller Node
become: true
ansible.builtin.dnf:
name:
- mariadb
- mariadb-devel
- mariadb-server
- python3-devel
state: latest
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Enable and start the mariadb service on the Controller Node
become: true
ansible.builtin.systemd:
name: mariadb
enabled: yes
state: started
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Install the pexpect Python library for mysql_secure_installation automation
become: true
ansible.builtin.pip:
name: pexpect
executable: pip3
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Complete the mysql_secure_installation script on the Controller Node
become: yes
ansible.builtin.expect:
command: mysql_secure_installation
responses:
'Enter current password for root': ''
'Switch to unix_socket authentication': 'n'
'Change the root password': 'n'
'Set root password': 'n'
'Remove anonymous users': 'y'
'Disallow root login remotely': 'y'
'Remove test database': 'y'
'Reload privilege tables now': 'y'
timeout: 30
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
ignore_errors: yes
- name: Install the python3-PyMySQL package on the Controller Node
become: yes
ansible.builtin.dnf:
name: python3-PyMySQL
state: latest
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Create the slurm_acct_db database needed for slurm
become: true
community.mysql.mysql_db:
name: slurm_acct_db
state: present
login_unix_socket: /var/lib/mysql/mysql.sock
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Create the slurm MySQL user and grant privileges on the Controller Node
become: yes
community.mysql.mysql_user:
name: slurm
password: '1234'
host: localhost
priv: 'slurm_acct_db.*:ALL,GRANT'
state: present
login_unix_socket: /var/lib/mysql/mysql.sock
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Enable the CRB repository for access to the munge-devel package
become: true
ansible.builtin.command: dnf config-manager --set-enabled crb
- name: Install packages to build the slurm RPMs on each node
become: true
ansible.builtin.dnf:
name:
- autoconf
- automake
- dbus-devel
- libbpf-devel
- make
- mariadb-devel
- munge-devel
- pam-devel
- perl-devel
- perl-ExtUtils-MakeMaker
- python3
- readline-devel
- rpm-build
state: latest
- name: Check if the rpmbuild directory already exists
ansible.builtin.stat:
path: /root/rpmbuild
register: rpmbuild_directory
- name: Build the slurm RPMs on each node
ansible.builtin.shell: rpmbuild -ta slurm-25.05.5.tar.bz2
args:
chdir: /root
when: not rpmbuild_directory.stat.exists
- name: Install the slurm RPMs for the Controller Node
become: true
ansible.builtin.dnf:
name:
- /root/rpmbuild/RPMS/x86_64/slurm-25.05.5-1.el9.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-perlapi-25.05.5-1.el9.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-slurmctld-25.05.5-1.el9.x86_64.rpm
state: present
disable_gpg_check: yes
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Install the slurm RPMs for the Compute Node
become: true
ansible.builtin.dnf:
name:
- /root/rpmbuild/RPMS/x86_64/slurm-25.05.5-1.el9.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-perlapi-25.05.5-1.el9.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-slurmd-25.05.5-1.el9.x86_64.rpm
state: present
disable_gpg_check: yes
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Place the output of the date command from each node and place it into a variable
ansible.builtin.shell: date
register: nodes_date_command_output
- name: Print the output of the variable to std_out to show the nodes are in sync
ansible.builtin.debug:
msg: "The current time and date of the node is: {{ nodes_date_command_output.stdout }}"
- name: Create a systemd directory for slurmd on the Compute Node
become: true
ansible.builtin.file:
path: /etc/systemd/system/slurmd.service.d
state: directory
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Add Delegate=yes to the slurmd service for cgroups v2 on the Compute Node
become: true
ansible.builtin.copy:
dest: /etc/systemd/system/slurmd.service.d/delegate.conf
content: |
[Service]
Delegate=yes
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Reload systemd daemon
become: true
ansible.builtin.command: systemctl daemon-reload
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Create the munge.key using dd on the Controller Node
become: true
ansible.builtin.shell: dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Change the owner of the munge.key to the user munge and the group munge
become: true
ansible.builtin.file:
path: /etc/munge/munge.key
owner: munge
group: munge
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Set only read permissions for the owner (the munge user)
become: true
ansible.builtin.file:
path: /etc/munge/munge.key
mode: '0400'
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Copy the munge.key from the Controller Node to the Compute Node
become: true
ansible.posix.synchronize:
src: /etc/munge/munge.key
dest: /etc/munge/munge.key
mode: push
delegate_to: rocky-linux9-slurm-controller-node
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Set the user and group ownership of the munge.key to the munge user on the Compute Node
become: true
ansible.builtin.file:
path: /etc/munge/munge.key
owner: munge
group: munge
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Set only read permissions for the owner (the munge user) on the Compute Node
become: true
ansible.builtin.file:
path: /etc/munge/munge.key
mode: '0400'
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Create the /etc/slurm directory on both nodes
become: true
ansible.builtin.file:
path: /etc/slurm
state: directory
- name: Create the slurm.conf configuration file on each node
become: true
ansible.builtin.copy:
dest: /etc/slurm/slurm.conf
content: |
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=rocky-linux9-slurm-cluster
SlurmctldHost=rocky-linux9-slurm-controller-node
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
#MpiDefault=
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
#SwitchType=
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePort=
#AccountingStorageType=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompParams=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=
JobAcctGatherFrequency=30
#JobAcctGatherType=
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=rocky-linux9-slurm-compute-node NodeAddr={{ rocky_linux9_slurm_compute_node_ip }} State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
- name: Add the munge group with a GID of 2001
become: true
ansible.builtin.group:
name: munge
gid: 2001
state: present
- name: Add the munge user with a UID of 2001
become: true
ansible.builtin.user:
name: munge
uid: 2001
group: munge
comment: "MUNGE Uid 'N' Gid Emporium"
home: /var/lib/munge
shell: /sbin/nologin
create_home: yes
state: present
- name: Add the slurm group with a GID of 2002
become: true
ansible.builtin.group:
name: slurm
gid: 2002
state: present
- name: Add the slurm user with a UID of 2002
become: true
ansible.builtin.user:
name: slurm
uid: 2002
group: slurm
comment: "SLURM Workload Manager"
home: /var/lib/slurm
shell: /sbin/nologin
create_home: yes
state: present
- name: Create the required munge and slurm directories
become: true
ansible.builtin.file:
path: "{{ item }}"
state: directory
loop:
- /run/munge
- /etc/slurm
- /run/slurm
- /var/lib/slurm
- /var/log/slurm
- /var/spool/slurmd
- /var/spool/slurmctld
- name: Change ownership of the munge directories to the munge user
become: true
ansible.builtin.file:
path: "{{ item }}"
owner: munge
group: munge
recurse: yes
loop:
- /etc/munge
- /var/log/munge
- /var/lib/munge
- /run/munge
- name: Change ownership of the slurm directories to the slurm user
become: true
ansible.builtin.file:
path: "{{ item }}"
owner: slurm
group: slurm
recurse: yes
loop:
- /etc/slurm
- /var/log/slurm
- /var/lib/slurm
- /run/slurm
- /var/spool/slurmd
- /var/spool/slurmctld
- name: Set permissions of 0755 on the munge and slurm directories
become: true
ansible.builtin.file:
path: "{{ item }}"
mode: '0755'
loop:
- /run/munge
- /etc/slurm
- /var/log/slurm
- /var/lib/slurm
- /run/slurm
- /var/spool/slurmd
- /var/spool/slurmctld
- name: Create the /etc/slurm/cgroup.conf file on both nodes
become: true
ansible.builtin.file:
path: /etc/slurm/cgroup.conf
state: touch
- name: Set EnableControllers and cgroups v2 in /etc/slurm/cgroup.conf
become: true
ansible.builtin.lineinfile:
path: /etc/slurm/cgroup.conf
line: "{{ item }}"
loop:
- "CgroupPlugin=cgroup/v2"
- "EnableControllers=yes"
- name: Enable and start the munge service on both nodes
become: true
ansible.builtin.systemd:
name: munge
enabled: yes
state: started
- name: Enable and start the slurmctld service on the Controller Node
become: true
ansible.builtin.systemd:
name: slurmctld
enabled: yes
state: started
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Enable and start the slurmd service on the Compute Node
become: true
ansible.builtin.systemd:
name: slurmd
enabled: yes
state: started
when: inventory_hostname == 'rocky-linux9-slurm-compute-node'
- name: Wait for slurmd to register with controller
ansible.builtin.pause:
seconds: 10
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Run a test srun command on the Controller Node, which should return the hostname of the Compute Node
ansible.builtin.shell: 'srun -c 1 -n 1 -J crunchy "/bin/hostname"'
register: srun_hostname_output_from_compute_node
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
- name: Print the output of the variable to std_out
ansible.builtin.debug:
msg: "{{ srun_hostname_output_from_compute_node }}"
when: inventory_hostname == 'rocky-linux9-slurm-controller-node'
Slurm setup on Rocky Linux 10¶
Ansible hosts file setup:
cat << "EOF" | sudo tee /etc/ansible/hosts
[rocky-linux10-slurm]
rocky-linux10-slurm-controller-node ansible_ssh_host=<CONTROLLER_IP>
rocky-linux10-slurm-compute-node ansible_ssh_host=<COMPUTE_IP>
[rocky-linux10-slurm:vars]
ansible_user=root
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
EOF
Ansible group_vars setup:
cat << "EOF" | sudo tee /etc/ansible/group_vars/all
---
rocky_linux10_slurm_controller_node_ip: <CONTROLLER_IP>
rocky_linux10_slurm_compute_node_ip: <COMPUTE_IP>
EOF
Ansible requirements.yaml:
cat << "EOF" | tee ~/requirements.yaml
---
collections:
- name: community.crypto
- name: community.general
- name: community.mysql
- name: ansible.posix
EOF
Rocky Linux 10 Slurm Deployment Playbook
---
# Slurm 25.05 Deployment Playbook for Rocky Linux 10.1
# Controller-Compute Node Configuration
# Note: Rocky Linux 10 uses cgroups v2 by default, no kernel parameter modification required
- name: Slurm setup on a Rocky Linux 10 controller and compute Node
hosts: rocky-linux10-slurm
tasks:
- name: Upgrade all packages on both hosts
become: true
ansible.builtin.dnf:
name: "*"
state: latest
# Note: cgroups v2 is the default on Rocky Linux 10, no grubby command needed
- name: Reboot the machines to ensure all updates are applied
become: true
ansible.builtin.reboot:
reboot_timeout: 3600
- name: Disable cloud-init management of /etc/hosts so entries persist across reboots
become: true
ansible.builtin.lineinfile:
path: /etc/cloud/cloud.cfg
regexp: '^\s*manage_etc_hosts'
line: 'manage_etc_hosts: false'
- name: Add the IPs and hostnames from each node to /etc/hosts
become: true
ansible.builtin.lineinfile:
path: /etc/hosts
line: '{{ item }}'
with_items:
- '{{ rocky_linux10_slurm_controller_node_ip }} rocky-linux10-slurm-controller-node'
- '{{ rocky_linux10_slurm_compute_node_ip }} rocky-linux10-slurm-compute-node'
- name: Generate an OpenSSH keypair so that the keys can be exchanged between the nodes
become: true
community.crypto.openssh_keypair:
path: /root/.ssh/id_rsa
- name: Fetch the public ssh key from the Controller Node
become: true
ansible.builtin.slurp:
src: /root/.ssh/id_rsa.pub
register: rocky_linux10_slurm_controller_node_public_key
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Set the Controller Node's public key as a fact for all hosts
ansible.builtin.set_fact:
rocky_linux10_slurm_controller_node_ssh_public_key: "{{ hostvars['rocky-linux10-slurm-controller-node']['rocky_linux10_slurm_controller_node_public_key']['content'] | b64decode | trim }}"
- name: Add the Controller Node's public key to the Compute Node's authorized_keys file
become: true
ansible.posix.authorized_key:
user: root
state: present
key: "{{ rocky_linux10_slurm_controller_node_ssh_public_key }}"
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Add the Compute Node to the Controller Node's known_hosts file
become: true
ansible.builtin.known_hosts:
path: /root/.ssh/known_hosts
name: "{{ rocky_linux10_slurm_compute_node_ip }}"
key: "{{ lookup('pipe', 'ssh-keyscan ' + rocky_linux10_slurm_compute_node_ip) }}"
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Fetch the public ssh key from the Compute Node
become: true
ansible.builtin.slurp:
src: /root/.ssh/id_rsa.pub
register: rocky_linux10_slurm_compute_node_public_key
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Set the Compute Node's public key as a fact for all hosts
ansible.builtin.set_fact:
rocky_linux10_slurm_compute_node_ssh_public_key: "{{ hostvars['rocky-linux10-slurm-compute-node']['rocky_linux10_slurm_compute_node_public_key']['content'] | b64decode | trim }}"
- name: Add the Compute Node's public key to the Controller Node's authorized_keys file
become: true
ansible.posix.authorized_key:
user: root
state: present
key: "{{ rocky_linux10_slurm_compute_node_ssh_public_key }}"
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Add the Controller Node to the Compute Node's known_hosts file
become: true
ansible.builtin.known_hosts:
path: /root/.ssh/known_hosts
name: "{{ rocky_linux10_slurm_controller_node_ip }}"
key: "{{ lookup('pipe', 'ssh-keyscan ' + rocky_linux10_slurm_controller_node_ip) }}"
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Pull down the slurm 25.05 tarball
ansible.builtin.get_url:
url: https://download.schedmd.com/slurm/slurm-25.05.5.tar.bz2
dest: /root/slurm-25.05.5.tar.bz2
- name: Install autofs, munge, nfs and the openmpi utilities
become: true
ansible.builtin.dnf:
name:
- autofs
- munge
- openmpi
- openmpi-devel
- nfs-utils
- rpcbind
state: latest
- name: Enable and start the nfs-server service
become: true
ansible.builtin.systemd:
name: nfs-server
enabled: yes
state: started
- name: Enable and start the rpcbind service
become: true
ansible.builtin.systemd:
name: rpcbind
enabled: yes
state: started
- name: Add the source IP address from the Controller Node to the Compute Node's firewall
become: true
ansible.builtin.command: firewall-cmd --add-rich-rule='rule family="ipv4" source address="{{ rocky_linux10_slurm_controller_node_ip }}" accept' --permanent
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Add the source IP address from the Compute Node to the Controller Node's firewall
become: true
ansible.builtin.command: firewall-cmd --add-rich-rule='rule family="ipv4" source address="{{ rocky_linux10_slurm_compute_node_ip }}" accept' --permanent
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Reload the firewall
become: true
ansible.builtin.command: firewall-cmd --reload
- name: Create the nfs directory in the root user's home directory
ansible.builtin.file:
path: /root/nfs
state: directory
- name: Set up the Controller Node's exports file with the NFS share for the Compute Node
become: true
ansible.builtin.lineinfile:
path: /etc/exports
line: /root/nfs {{ rocky_linux10_slurm_compute_node_ip }}(rw,sync,no_subtree_check,no_root_squash)
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Export the NFS shares on the Controller Node
become: true
ansible.builtin.command: exportfs -a
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Set the NFS share to be mounted on the Compute Node via autofs
become: true
ansible.builtin.lineinfile:
path: /etc/auto.master
line: /root/nfs /etc/auto.nfs
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Create the map file for the NFS share on the Compute Node
become: true
ansible.builtin.file:
path: /etc/auto.nfs
state: touch
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Add the Controller Node's NFS share to the auto.nfs map on the Compute Node
become: true
ansible.builtin.lineinfile:
path: /etc/auto.nfs
line: rocky-linux10-slurm-controller-node:/root/nfs
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Restart the autofs service
become: true
ansible.builtin.systemd:
name: autofs
state: restarted
- name: Restart the rpcbind service
become: true
ansible.builtin.systemd:
name: rpcbind
state: restarted
- name: Add the following openmpi PATH lines in .bashrc for the root user
ansible.builtin.lineinfile:
path: /root/.bashrc
line: '{{ item }}'
with_items:
- 'PATH=$PATH:/usr/lib64/openmpi/bin'
- '# LD_LIBRARY_PATH=/usr/lib64/openmpi/lib'
- name: Source .bashrc
ansible.builtin.shell: source /root/.bashrc
- name: Create a machinefile in the root user's NFS directory on the Controller Node
become: true
ansible.builtin.file:
path: /root/nfs/machinefile
state: touch
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Add the Compute Node's hostname to the Controller Node's machinefile
ansible.builtin.lineinfile:
path: /root/nfs/machinefile
line: rocky-linux10-slurm-compute-node
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Install the mariadb packages on the Controller Node
become: true
ansible.builtin.dnf:
name:
- mariadb
- mariadb-devel
- mariadb-server
- python3-devel
state: latest
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Enable and start the mariadb service on the Controller Node
become: true
ansible.builtin.systemd:
name: mariadb
enabled: yes
state: started
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Install the pexpect Python library for mysql_secure_installation automation
become: true
ansible.builtin.pip:
name: pexpect
executable: pip3
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Complete the mysql_secure_installation script on the Controller Node
become: yes
ansible.builtin.expect:
command: mysql_secure_installation
responses:
'Enter current password for root': ''
'Switch to unix_socket authentication': 'n'
'Change the root password': 'n'
'Set root password': 'n'
'Remove anonymous users': 'y'
'Disallow root login remotely': 'y'
'Remove test database': 'y'
'Reload privilege tables now': 'y'
timeout: 30
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
ignore_errors: yes
- name: Install the python3-PyMySQL package on the Controller Node
become: yes
ansible.builtin.dnf:
name: python3-PyMySQL
state: latest
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Create the slurm_acct_db database needed for slurm
become: true
community.mysql.mysql_db:
name: slurm_acct_db
state: present
login_unix_socket: /var/lib/mysql/mysql.sock
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Create the slurm MySQL user and grant privileges on the Controller Node
become: yes
community.mysql.mysql_user:
name: slurm
password: '1234'
host: localhost
priv: 'slurm_acct_db.*:ALL,GRANT'
state: present
login_unix_socket: /var/lib/mysql/mysql.sock
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Enable the CRB repository for access to the munge-devel package
become: true
ansible.builtin.command: dnf config-manager --set-enabled crb
- name: Install packages to build the slurm RPMs on each node
become: true
ansible.builtin.dnf:
name:
- autoconf
- automake
- dbus-devel
- libbpf-devel
- make
- mariadb-devel
- munge-devel
- pam-devel
- perl-devel
- perl-ExtUtils-MakeMaker
- python3
- readline-devel
- rpm-build
state: latest
- name: Check if the rpmbuild directory already exists
ansible.builtin.stat:
path: /root/rpmbuild
register: rpmbuild_directory
- name: Build the slurm RPMs on each node
ansible.builtin.shell: rpmbuild -ta slurm-25.05.5.tar.bz2
args:
chdir: /root
when: not rpmbuild_directory.stat.exists
- name: Install the slurm RPMs for the Controller Node
become: true
ansible.builtin.dnf:
name:
- /root/rpmbuild/RPMS/x86_64/slurm-25.05.5-1.el10.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-perlapi-25.05.5-1.el10.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-slurmctld-25.05.5-1.el10.x86_64.rpm
state: present
disable_gpg_check: yes
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Install the slurm RPMs for the Compute Node
become: true
ansible.builtin.dnf:
name:
- /root/rpmbuild/RPMS/x86_64/slurm-25.05.5-1.el10.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-perlapi-25.05.5-1.el10.x86_64.rpm
- /root/rpmbuild/RPMS/x86_64/slurm-slurmd-25.05.5-1.el10.x86_64.rpm
state: present
disable_gpg_check: yes
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Place the output of the date command from each node and place it into a variable
ansible.builtin.shell: date
register: nodes_date_command_output
- name: Print the output of the variable to std_out to show the nodes are in sync
ansible.builtin.debug:
msg: "The current time and date of the node is: {{ nodes_date_command_output.stdout }}"
- name: Create a systemd directory for slurmd on the Compute Node
become: true
ansible.builtin.file:
path: /etc/systemd/system/slurmd.service.d
state: directory
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Add Delegate=yes to the slurmd service for cgroups v2 on the Compute Node
become: true
ansible.builtin.copy:
dest: /etc/systemd/system/slurmd.service.d/delegate.conf
content: |
[Service]
Delegate=yes
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Reload systemd daemon
become: true
ansible.builtin.command: systemctl daemon-reload
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Create the munge.key using dd on the Controller Node
become: true
ansible.builtin.shell: dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Change the owner of the munge.key to the user munge and the group munge
become: true
ansible.builtin.file:
path: /etc/munge/munge.key
owner: munge
group: munge
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Set only read permissions for the owner (the munge user)
become: true
ansible.builtin.file:
path: /etc/munge/munge.key
mode: '0400'
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Copy the munge.key from the Controller Node to the Compute Node
become: true
ansible.posix.synchronize:
src: /etc/munge/munge.key
dest: /etc/munge/munge.key
mode: push
delegate_to: rocky-linux10-slurm-controller-node
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Set the user and group ownership of the munge.key to the munge user on the Compute Node
become: true
ansible.builtin.file:
path: /etc/munge/munge.key
owner: munge
group: munge
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Set only read permissions for the owner (the munge user) on the Compute Node
become: true
ansible.builtin.file:
path: /etc/munge/munge.key
mode: '0400'
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Create the /etc/slurm directory on both nodes
become: true
ansible.builtin.file:
path: /etc/slurm
state: directory
- name: Create the slurm.conf configuration file on each node
become: true
ansible.builtin.copy:
dest: /etc/slurm/slurm.conf
content: |
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=rocky-linux10-slurm-cluster
SlurmctldHost=rocky-linux10-slurm-controller-node
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
#MpiDefault=
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
#SwitchType=
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePort=
#AccountingStorageType=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompParams=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=
JobAcctGatherFrequency=30
#JobAcctGatherType=
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=rocky-linux10-slurm-compute-node NodeAddr={{ rocky_linux10_slurm_compute_node_ip }} State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
- name: Add the munge group with a GID of 2001
become: true
ansible.builtin.group:
name: munge
gid: 2001
state: present
- name: Add the munge user with a UID of 2001
become: true
ansible.builtin.user:
name: munge
uid: 2001
group: munge
comment: "MUNGE Uid 'N' Gid Emporium"
home: /var/lib/munge
shell: /sbin/nologin
create_home: yes
state: present
- name: Add the slurm group with a GID of 2002
become: true
ansible.builtin.group:
name: slurm
gid: 2002
state: present
- name: Add the slurm user with a UID of 2002
become: true
ansible.builtin.user:
name: slurm
uid: 2002
group: slurm
comment: "SLURM Workload Manager"
home: /var/lib/slurm
shell: /sbin/nologin
create_home: yes
state: present
- name: Create the required munge and slurm directories
become: true
ansible.builtin.file:
path: "{{ item }}"
state: directory
loop:
- /run/munge
- /etc/slurm
- /run/slurm
- /var/lib/slurm
- /var/log/slurm
- /var/spool/slurmd
- /var/spool/slurmctld
- name: Change ownership of the munge directories to the munge user
become: true
ansible.builtin.file:
path: "{{ item }}"
owner: munge
group: munge
recurse: yes
loop:
- /etc/munge
- /var/log/munge
- /var/lib/munge
- /run/munge
- name: Change ownership of the slurm directories to the slurm user
become: true
ansible.builtin.file:
path: "{{ item }}"
owner: slurm
group: slurm
recurse: yes
loop:
- /etc/slurm
- /var/log/slurm
- /var/lib/slurm
- /run/slurm
- /var/spool/slurmd
- /var/spool/slurmctld
- name: Set permissions of 0755 on the munge and slurm directories
become: true
ansible.builtin.file:
path: "{{ item }}"
mode: '0755'
loop:
- /run/munge
- /etc/slurm
- /var/log/slurm
- /var/lib/slurm
- /run/slurm
- /var/spool/slurmd
- /var/spool/slurmctld
- name: Create the /etc/slurm/cgroup.conf file on both nodes
become: true
ansible.builtin.file:
path: /etc/slurm/cgroup.conf
state: touch
- name: Set EnableControllers and cgroups v2 in /etc/slurm/cgroup.conf
become: true
ansible.builtin.lineinfile:
path: /etc/slurm/cgroup.conf
line: "{{ item }}"
loop:
- "CgroupPlugin=cgroup/v2"
- "EnableControllers=yes"
- name: Enable and start the munge service on both nodes
become: true
ansible.builtin.systemd:
name: munge
enabled: yes
state: started
- name: Enable and start the slurmctld service on the Controller Node
become: true
ansible.builtin.systemd:
name: slurmctld
enabled: yes
state: started
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Enable and start the slurmd service on the Compute Node
become: true
ansible.builtin.systemd:
name: slurmd
enabled: yes
state: started
when: inventory_hostname == 'rocky-linux10-slurm-compute-node'
- name: Wait for slurmd to register with controller
ansible.builtin.pause:
seconds: 10
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Run a test srun command on the Controller Node, which should return the hostname of the Compute Node
ansible.builtin.shell: 'srun -c 1 -n 1 -J crunchy "/bin/hostname"'
register: srun_hostname_output_from_compute_node
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
- name: Print the output of the variable to std_out
ansible.builtin.debug:
msg: "{{ srun_hostname_output_from_compute_node }}"
when: inventory_hostname == 'rocky-linux10-slurm-controller-node'
Conclusion¶
Now that you have your slurm cluster running on Rocky Linux, there are almost infinite possibilities on where to go next. Here are a few practical options for your home lab:
-
Set up
slurmaccounting1, so you are able to track the submission of every job and the resources consumed. -
Deploy Grafana3 to your
slurmcluster to have a visual confirmation of your GPU utilization, memory usage, jobs running, and more. -
How about Apptainer2 and using that to pull your container images and submit
slurmjobs that run the containerized applications? -
For an added challenge, try running a Minecraft or similar game server on your
slurmcluster.
The list is endless and slurm is massively used in HPC and data science.
References¶
- "Accounting and Resource Limits" by SchedMD https://slurm.schedmd.com/accounting.html
- "Batch Scheduler / Slurm" by the Apptainer Team https://apptainer.org/docs/user/main/mpi.html#batch-scheduler-slurm
- "GPU Monitoring with Grafana" by Sean Smith https://swsmith.cc/posts/grafana-slurm.html
- "munge Installation Guide" by Chris Dunlap (Dun) https://github.com/dun/munge/wiki/Installation-Guide
- "Notes on: Devel" by the Rocky Linux Team https://wiki.rockylinux.org/rocky/repo/#notes-on-devel
- "Slurm Configuration Tool" by SchedMD https://slurm.schedmd.com/configurator.html
- "Slurm Documentation" by SchedMD https://slurm.schedmd.com/documentation.html
- "[slurm-users] Why SlurmUser is set to slurm by default?" from the SchedMD mailing list https://lists.schedmd.com/pipermail/slurm-users/2018-May/001443.html
Author: Howard Van Der Wal
Contributors: Steven Spencer