Quick Start User Guide. 快速入门用户指南

Overview

概述

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm 是一个开源、容错、高度可扩展的集群管理和作业调度系统,适用于大型和小型 Linux 集群。Slurm 的操作不需要修改内核,而且相对独立。作为集群工作负载管理器,slurm 有三个关键功能。首先,它在一段时间内为用户分配对资源(计算节点)的独占和/或非独占访问权,以便他们能够执行工作。其次,它提供了一个框架,用于在分配的节点集上启动、执行和监视工作(通常是一个并行作业)。最后,它通过管理挂起的工作队列来仲裁资源争用。

Architecture

结构

As depicted in Figure 1, Slurm consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. The user commands include: sacct, sacctmgr, salloc, sattach, sbatch, sbcast, scancel, scontrol, scrontab, sdiag, sh5util, sinfo, sprio, squeue, sreport, srun, sshare, sstat, strigger and sview. All of the commands can run anywhere in the cluster.

如图1所示,slurm 由一个运行在每个计算节点上的 slurmd 守护进程和一个运行在管理节点上的中央 slurmctld 守护进程(带有可选的故障转移孪生节点)组成。Slurmd 守护进程提供容错的分层通信。用户命令包括: sacct、 sacctmgr、 salloc、 satta、 satch、 sbcast、 scancel、 scontrol、 scontab、 sdiag、 sh5util、 sinfo、 sprio、 squeue、 sreport、 srun、 sshare、 sstat、 strigger 和 sview。所有命令都可以在集群中的任何地方运行。

The entities managed by these Slurm daemons, shown in Figure 2, include nodes, the compute resource in Slurm, partitions, which group nodes into logical (possibly overlapping) sets, jobs, or allocations of resources assigned to a user for a specified amount of time, and job steps, which are sets of (possibly parallel) tasks within a job. The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted. Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance, a single job step may be started that utilizes all nodes allocated to the job, or several job steps may independently use a portion of the allocation.

由这些 slurm 守护进程管理的实体(如图2所示)包括节点、 slurm 中的计算资源、分区(将节点分组为逻辑(可能重叠)集合)、作业或分配给用户的指定时间内的资源,以及作业步骤(作业步骤是作业中的一组任务(可能并行))。分区可以被视为作业队列,每个分区都有各种各样的约束,比如作业大小限制、作业时间限制、允许使用它的用户等等。按优先级排序的作业是在一个分区内分配节点,直到该分区内的资源(节点、处理器、内存等)耗尽为止。一旦一个作业被分配了一组节点,用户就能够在分配中的任何配置中以作业步骤的形式启动并行工作。例如,可以启动一个作业步骤来利用分配给该作业的所有节点,或者几个作业步骤可以独立地使用分配的一部分。

Commands

命令

Man pages exist for all Slurm daemons, commands, and API functions. The command option --help also provides a brief summary of options. Note that the command options are all case sensitive.

手册页适用于所有 slurm 守护进程、命令和 API 函数。Command 选项--help 还提供了选项的简要摘要。注意,命令选项都是区分大小写的。

sacct is used to report job or job step accounting information about active or completed jobs.

sacct 用于报告关于活动或已完成的工作或工作步骤的计费信息。

salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

salloc 用于实时为作业分配资源。这通常用于分配资源和生成 shell。然后使用 shell 执行 srun 命令来启动并行任务。

sattach is used to attach standard input, output, and error plus signal capabilities to a currently running job or job step. One can attach to and detach from jobs multiple times.

sattach 用于将标准输入、输出和错误加信号功能附加到当前正在运行的作业或作业步骤。可以多次附加到作业并从作业分离。

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

sbatch 用于提交作业脚本,以便以后执行。该脚本通常包含一个或多个用于启动并行任务的 srun 命令。

sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.

sbcast 用于将文件从本地磁盘传输到作业运行节点的本地磁盘。这可以用来有效地使用无磁盘计算节点,或者相对于共享文件系统提供更好的性能。

scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

scancel 用于取消挂起的、正在运行的作业或作业步骤。它还可以用来向与正在运行的作业或作业步骤相关联的所有进程发送任意信号。

scontrol is the administrative tool used to view and/or modify Slurm state. Note that many scontrol commands can only be executed as user root.

scontrol 是用于查看和/或修改 slurm 状态的管理工具。注意,许多 scontrol 命令只能以用户 root 身份执行。

sinfo reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options.

sinfo 报告由 slurm 管理的分区和节点的状态。它有各种各样的筛选、排序和格式化选项。

sprio is used to display a detailed view of the components affecting a job's priority.

sprio 用于显示影响作业优先级的组件的详细视图。

squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.

squeue 报告作业或作业步骤的状态。它有各种各样的筛选、排序和格式化选项。默认情况下,它按优先级顺序报告正在运行的作业,然后按优先级顺序报告挂起的作业。

srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared resources within the job's node allocation.

srun 用于提交要执行的作业或实时启动作业步骤。srun 有各种各样的选项来指定资源需求,包括: 最小和最大节点数、处理器数、要使用或不使用的特定节点,以及特定节点特性(多少的内存、磁盘空间、某些必需特性等)。一个作业可以包含多个作业步骤,这些作业步骤在作业节点分配的独立或共享资源上顺序或并行执行。

sshare displays detailed information about fairshare usage on the cluster. Note that this is only viable when using the priority/multifactor plugin.

sshare 显示集群上公平共享使用关于的详细信息。请注意,这只有在使用优先级/多因素插件时才可行。

sstat is used to get information about the resources utilized by a running job or job step.

sstat 用于获取正在运行的作业或作业步骤所使用的资源的关于。

strigger is used to set, get or view event triggers. Event triggers include things such as nodes going down or jobs approaching their time limit.

strigger 用于设置、获取或查看事件触发器。事件触发器包括节点下线或接近时间限制的作业等情况。

sview is a graphical user interface to get and update state information for jobs, partitions, and nodes managed by Slurm.

sview 是一个图形用户界面,用于获取和更新由 slurm 管理的作业、分区和节点的状态信息。

Examples

例子

First we determine what partitions exist on the system, what nodes they include, and general system state. This information is provided by the sinfo command. In the example below we find there are two partitions: debug and batch. The * following the name debug indicates this is the default partition for submitted jobs. We see that both partitions are in an UP state. Some configurations may include partitions for larger jobs that are DOWN except on weekends or at night. The information about each partition may be split over more than one line so that nodes in different states can be identified. In this case, the two nodes adev[1-2] are down. The * following the state down indicate the nodes are not responding. Note the use of a concise expression for node name specification with a common prefix adev and numeric ranges or specific numbers identified. This format allows for very large clusters to be easily managed. The sinfo command has many options to easily let you view the information of interest to you in whatever format you prefer. See the man page for more information.

首先,我们确定系统上存在哪些分区,它们包含哪些节点,以及一般的系统状态。此信息由 sinfo 命令提供。在下面的例子中,我们发现有两个分区: debug和batch。debug分区后面的 * 表示这是提交作业的默认分区。我们看到两个分区都处于 UP 状态。一些配置可能包括用于除周末或晚上之外关闭的较大作业的分区。每个分区的关于信息可以分割成多行,以便能够识别不同状态的节点。在这种情况下,两个节点 adev[1-2]被关闭。down* 表示节点没有响应。请注意,节点名称规范使用了一个简洁的表达式,前缀为 adev,范围为数字,或者标识了特定的数字。这种格式允许轻松地管理非常大的集群。sinfo 命令有很多选项,可以让您轻松地以任何喜欢的格式查看感兴趣的信息。有关更多信息,请参见手册页。

adev0: sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
debug*       up      30:00     2  down* adev[1-2]
debug*       up      30:00     3   idle adev[3-5]
batch        up      30:00     3  down* adev[6,13,15]
batch        up      30:00     3  alloc adev[7-8,14]
batch        up      30:00     4   idle adev[9-12]

Next we determine what jobs exist on the system using the squeue command. The ST field is job state. Two jobs are in a running state (R is an abbreviation for Running) while one job is in a pending state (PD is an abbreviation for Pending). The TIME field shows how long the jobs have run for using the format days-hours:minutes:seconds. The NODELIST(REASON) field indicates where the job is running or the reason it is still pending. Typical reasons for pending jobs are Resources (waiting for resources to become available) and Priority (queued behind a higher priority job). The squeue command has many options to easily let you view the information of interest to you in whatever format you prefer. See the man page for more information.

接下来,我们使用 squue 命令确定系统上存在哪些作业。ST 字段是作业状态。两个作业处于运行状态(R 是“运行”的缩写) ,而一个作业处于挂起状态(PD 是“挂起”的缩写)。TIME 字段显示使用 日-小时:分钟:秒 格式的作业运行了多长时间。NODELIST (REASON)字段指示作业正在运行的位置或作业仍然挂起的原因。挂起作业的典型原因是资源(等待资源变得可用)和优先级(排在较高优先级作业后面)。squeue 命令有许多选项,可以让您轻松地以自己喜欢的任何格式查看感兴趣的信息。有关更多信息,请参见手册页。

adev0: squeue
JOBID PARTITION  NAME  USER ST  TIME NODES NODELIST(REASON)
65646     batch  chem  mike  R 24:19     2 adev[7-8]
65647     batch   bio  joan  R  0:09     1 adev14
65648     batch  math  phil PD  0:00     6 (Resources)

The scontrol command can be used to report more detailed information about nodes, partitions, jobs, job steps, and configuration. It can also be used by system administrators to make configuration changes. A couple of examples are shown below. See the man page for more information.

Scontrol 命令可用于报告节点、分区、作业、作业步骤和配置等关于的更详细信息。系统管理员也可以使用它进行配置更改。下面显示了两个示例。有关更多信息,请参见手册页。

adev0: scontrol show partition
PartitionName=debug TotalNodes=5 TotalCPUs=40 RootOnly=NO
   Default=YES OverSubscribe=FORCE:4 PriorityTier=1 State=UP
   MaxTime=00:30:00 Hidden=NO
   MinNodes=1 MaxNodes=26 DisableRootJobs=NO AllowGroups=ALL
   Nodes=adev[1-5] NodeIndices=0-4

PartitionName=batch TotalNodes=10 TotalCPUs=80 RootOnly=NO
   Default=NO OverSubscribe=FORCE:4 PriorityTier=1 State=UP
   MaxTime=16:00:00 Hidden=NO
   MinNodes=1 MaxNodes=26 DisableRootJobs=NO AllowGroups=ALL
   Nodes=adev[6-15] NodeIndices=5-14


adev0: scontrol show node adev1
NodeName=adev1 State=DOWN* CPUs=8 AllocCPUs=0
   RealMemory=4000 TmpDisk=0
   Sockets=2 Cores=4 Threads=1 Weight=1 Features=intel
   Reason=Not responding [slurm@06/02-14:01:24]

65648     batch  math  phil PD  0:00     6 (Resources)
adev0: scontrol show job
JobId=65672 UserId=phil(5136) GroupId=phil(5136)
   Name=math
   Priority=4294901603 Partition=batch BatchFlag=1
   AllocNode:Sid=adev0:16726 TimeLimit=00:10:00 ExitCode=0:0
   StartTime=06/02-15:27:11 EndTime=06/02-15:37:11
   JobState=PENDING NodeList=(null) NodeListIndices=
   NumCPUs=24 ReqNodes=1 ReqS:C:T=1-65535:1-65535:1-65535
   OverSubscribe=1 Contiguous=0 CPUs/task=0 Licenses=(null)
   MinCPUs=1 MinSockets=1 MinCores=1 MinThreads=1
   MinMemory=0 MinTmpDisk=0 Features=(null)
   Dependency=(null) Account=(null) Requeue=1
   Reason=None Network=(null)
   ReqNodeList=(null) ReqNodeListIndices=
   ExcNodeList=(null) ExcNodeListIndices=
   SubmitTime=06/02-15:27:11 SuspendTime=None PreSusTime=0
   Command=/home/phil/math
   WorkDir=/home/phil

It is possible to create a resource allocation and launch the tasks for a job step in a single command line using the srun command. Depending upon the MPI implementation used, MPI jobs may also be launched in this manner. See the MPI section for more MPI-specific information. In this example we execute /bin/hostname on three nodes (-N3) and include task numbers on the output (-l). The default partition will be used. One task per node will be used by default. Note that the srun command has many options available to control what resource are allocated and how tasks are distributed across those resources.

可以使用 srun 命令在单个命令行中创建资源分配并启动作业步骤的任务。根据所使用的 MPI 实现,MPI 作业也可以以这种方式启动。有关更多 MPI 特定信息,请参见 MPI 部分。在本例中,我们在三个节点(- N3)上执行/bin/hostname,并在输出(- l)上包含任务编号。将使用默认分区。默认情况下,每个节点将使用一个任务。注意,srun 命令有许多选项可用于控制分配什么资源以及如何在这些资源之间分配任务。

adev0: srun -N3 -l /bin/hostname
0: adev3
1: adev4
2: adev5

This variation on the previous example executes /bin/hostname in four tasks (-n4). One processor per task will be used by default (note that we don't specify a node count).

前面示例的变体在四个任务(- n4)中执行/bin/hostname。默认情况下,每个任务使用一个处理器(注意,我们没有指定节点数)。

adev0: srun -n4 -l /bin/hostname
0: adev3
1: adev3
2: adev3
3: adev3

One common mode of operation is to submit a script for later execution. In this example the script name is my.script and we explicitly use the nodes adev9 and adev10 (-w "adev[9-10]", note the use of a node range expression). We also explicitly state that the subsequent job steps will spawn four tasks each, which will ensure that our allocation contains at least four processors (one processor per task to be launched). The output will appear in the file my.stdout ("-o my.stdout"). This script contains a timelimit for the job embedded within itself. Other options can be supplied as desired by using a prefix of "#SBATCH" followed by the option at the beginning of the script (before any commands to be executed in the script). Options supplied on the command line would override any options specified within the script. Note that my.script contains the command /bin/hostname that executed on the first node in the allocation (where the script runs) plus two job steps initiated using the srun command and executed sequentially.

一种常见的操作模式是提交脚本以供以后执行。在这个示例中,脚本名称是 my.script,我们显式地使用节点 adev9和 adev10(- w“ adev [9-10]”,注意使用了节点范围表达式)。我们还明确指出,随后的作业步骤将分别产生四个任务,这将确保我们的分配包含至少四个处理器(每个要启动的任务一个处理器)。输出将出现在文件 my.stdout (“-o my.stdout”)中。此脚本包含嵌入在其中的作业的时间限制。其他选项可以根据需要提供,方法是使用前缀“ # SBATCH”,后跟脚本开头的选项(在脚本中执行任何命令之前)。命令行上提供的选项将覆盖脚本中指定的任何选项。注意,my.script 包含在分配的第一个节点(脚本运行的地方)上执行的命令/bin/hostname,以及使用 srun 命令启动并按顺序执行的两个作业步骤。

adev0: cat my.script
#!/bin/sh
#SBATCH --time=1
/bin/hostname
srun -l /bin/hostname
srun -l /bin/pwd

adev0: sbatch -n4 -w "adev[9-10]" -o my.stdout my.script
sbatch: Submitted batch job 469

adev0: cat my.stdout
adev9
0: adev9
1: adev9
2: adev10
3: adev10
0: /home/jette
1: /home/jette
2: /home/jette
3: /home/jette

The final mode of operation is to create a resource allocation and spawn job steps within that allocation. The salloc command is used to create a resource allocation and typically start a shell within that allocation. One or more job steps would typically be executed within that allocation using the srun command to launch the tasks (depending upon the type of MPI being used, the launch mechanism may differ, see MPI details below). Finally the shell created by salloc would be terminated using the exit command. Slurm does not automatically migrate executable or data files to the nodes allocated to a job. Either the files must exists on local disk or in some global file system (e.g. NFS or Lustre). We provide the tool sbcast to transfer files to local storage on allocated nodes using Slurm's hierarchical communications. In this example we use sbcast to transfer the executable program a.out to /tmp/joe.a.out on local storage of the allocated nodes. After executing the program, we delete it from local storage

最后一种操作模式是创建资源分配,并在该分配中产生作业步骤。Salloc 命令用于创建资源分配,并通常在该分配中启动 shell。一个或多个作业步骤通常会在该分配中使用 srun 命令执行,以启动任务(根据所使用的 MPI 类型,启动机制可能不同,请参阅下面的 MPI 详细信息)。最后,使用 exit 命令终止由 salloc 创建的 shell。Slurm 不会自动将可执行文件或数据文件迁移到分配给作业的节点。文件必须存在于本地磁盘或某个全局文件系统(例如 NFS 或 Lustre)中。我们提供 sbcast 工具,使用 slurm 的分层通信将文件传输到分配节点上的本地存储器。在本例中,我们使用 sbcast 将可执行程序 a.out 传输到所分配节点的本地存储上的/tmp/joe.a.out。执行程序后,我们从本地存储中删除它

tux0: salloc -N1024 bash
$ sbcast a.out /tmp/joe.a.out
Granted job allocation 471
$ srun /tmp/joe.a.out
Result is 3.14159
$ srun rm /tmp/joe.a.out
$ exit
salloc: Relinquishing job allocation 471

In this example, we submit a batch job, get its status, and cancel it.

在本例中,我们提交一个批作业,获取其状态并取消它。

adev0: sbatch test
srun: jobid 473 submitted

adev0: squeue
JOBID PARTITION NAME USER ST TIME  NODES NODELIST(REASON)
  473 batch     test jill R  00:00 1     adev9

adev0: scancel 473

adev0: squeue
JOBID PARTITION NAME USER ST TIME  NODES NODELIST(REASON)

Best Practices, Large Job Counts

最佳实践,大工作量

Consider putting related work into a single Slurm job with multiple job steps both for performance reasons and ease of management. Each Slurm job can contain a multitude of job steps and the overhead in Slurm for managing job steps is much lower than that of individual jobs.

考虑将相关工作放在一个单一的 slurm 工作中,同时包含多个工作步骤,这既是出于绩效考虑,也是为了便于管理。每个 slurm 作业可以包含多个作业步骤,管理作业步骤的 slurm 开销远低于单个作业的开销。

Job arrays are an efficient mechanism of managing a collection of batch jobs with identical resource requirements. Most Slurm commands can manage job arrays either as individual elements (tasks) or as a single entity (e.g. delete an entire job array in a single command).

作业数组是管理具有相同资源需求的批作业集合的有效机制。大多数 slurm 命令可以将作业数组管理为单个元素(任务)或单个实体(例如,在单个命令中删除整个作业数组)。

MPI

MPI use depends upon the type of MPI being used. There are three fundamentally different modes of operation used by these various MPI implementations.

MPI 的使用取决于所使用的 MPI 的类型。这些不同的 MPI 实现使用三种根本不同的操作模式。

  1. Slurm directly launches the tasks and performs initialization of communications through the PMI2 or PMIx APIs. (Supported by most modern MPI implementations.)

  2. Slurm 直接启动任务,并通过 PMI2或 PMIx API 执行通信初始化。(由大多数现代 MPI 实现支持。)

  3. Slurm creates a resource allocation for the job and then mpirun launches tasks using Slurm's infrastructure (older versions of OpenMPI).

  4. Slurm 为作业创建一个资源分配,然后 mpirun 使用 slurm 的基础设施(旧版本的 OpenMPI)启动任务。

  5. Slurm creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than Slurm, such as SSH or RSH. These tasks are initiated outside of Slurm's monitoring or control. Slurm's epilog should be configured to purge these tasks when the job's allocation is relinquished. The use of pam_slurm_adopt is also strongly recommended.

  6. Slurm 为作业创建一个资源分配,然后 mpirun 使用除 slurm 之外的其他机制(如 SSH 或 RSH)启动任务。这些任务是在 slurm 的监视或控制之外发起的。当作业的分配被放弃时,slurm 的 pilog 应该被配置为清除这些任务。强烈建议使用 pam _ slurm _ opt。

Links to instructions for using several varieties of MPI with Slurm are provided below.

以下连结载有使用多种 MPI slurm 的说明。

Last modified 29 June 2021

最后修订日期: 2021年6月29日

Last updated