CPU Management User and Administrator Guide. CPU 管理用户和管理员指南
Overview
概述
The purpose of this guide is to assist Slurm users and administrators in selecting configuration options and composing command lines to manage the use of CPU resources by jobs, steps and tasks. The document is divided into the following sections:
本指南旨在协助 slurm 用户和管理员选择配置选项和编写命令行,以便按作业、步骤和任务管理 CPU 资源的使用。该文件分为以下几节:
CPU Management through user commands is constrained by the configuration parameters chosen by the Slurm administrator. The interactions between different CPU management options are complex and often difficult to predict. Some experimentation may be required to discover the exact combination of options needed to produce a desired outcome. Users and administrators should refer to the man pages for slurm.conf, cgroup.conf, salloc, sbatch and srun for detailed explanations of each option. The following html documents may also be useful:
通过用户命令进行的 CPU 管理受到 slurm 管理员选择的配置参数的限制。不同 CPU 管理选项之间的交互非常复杂,通常难以预测。可能需要一些试验来发现产生预期结果所需的选项的确切组合。用户和管理员应该参考手册页中的 slurm.conf、 cgroup.conf、 salloc、 satch 和 srun 来获得每个选项的详细说明。下列 html 文件也可能有用:
Consumable Resources in Slurm. Slurm 消耗性资源 Sharing Consumable Resources. 共享消耗性资源 Support for Multi-core/Multi-thread Architectures. 对多核/多线程体系结构的支持 Plane distribution. 平面分布
This document describes Slurm CPU management for conventional Linux clusters only. For information on Cray ALPS systems, please refer to the appropriate documents.
本文档仅描述传统 Linux 集群的 slurm CPU 管理。有关 Cray ALPS 系统的资料,请参阅相关文件。
CPU Management Steps performed by Slurm
CPU管理 slurm 执行的步骤
Slurm uses four basic steps to manage CPU resources for a job/step:
Slurm 使用四个基本步骤来管理作业/步骤的 CPU 资源:
STEP 1: SELECTION OF NODES
步骤1: 选择节点
In Step 1, Slurm selects the set of nodes from which CPU resources are to be allocated to a job or job step. Node selection is therefore influenced by many of the configuration and command line options that control the allocation of CPUs (Step 2 below). If SelectType=select/linear is configured, all resources on the selected nodes will be allocated to the job/step. If SelectType is configured to be select/cons_res or select/cons_tres, individual sockets, cores and threads may be allocated from the selected nodes as consumable resources. The consumable resource type is defined by SelectTypeParameters.
在步骤1中,slurm 选择将 CPU 资源分配给作业或作业步骤的节点集。因此,节点选择受到控制 CPU 分配的许多配置和命令行选项的影响(下面的步骤2)。如果配置了 SelectType = select/line,则所选节点上的所有资源都将分配给作业/步骤。如果 SelectType 配置为 select/con _ res 或 select/con _ tre,则可以从选定的节点分配单个套接字、内核和线程作为可消耗资源。可使用资源类型由 SelectTypeParameter 定义。 Step 1 is performed by slurmctld and the select plugin. 第1步由 slurmctld 和 select 插件执行。
slurm.conf parameter
Conf 参数
Possible values
可能的值
Description
描述
<name of the node> Plus additional parameters. See man page for details.
< name of the node > Plus 其他参数。请参阅手册页了解详细信息。
Defines a node. This includes the number and layout of boards, sockets, cores, threads and processors (logical CPUs) on the node.
定义节点。这包括节点上的主板、套接字、核心、线程和处理器(逻辑 CPU)的数量和布局。
PartitionName 分区名称
<name of the partition> Plus additional parameters. See man page for details.
< 分区名称 > 加上其他参数。请参阅手册页了解详细信息。
Defines a partition. Several parameters of the partition definition affect the selection of nodes (e.g., Nodes, OverSubscribe, MaxNodes)
定义一个分区。分区定义的几个参数影响节点的选择(例如,Nodes,OverSubscribe,MaxNodes)
参数
config_overrides
Config _ 覆盖
Controls how the information in a node definition is used.
控制如何使用节点定义中的信息。
select/linear | select/cons_res | select/cons_tres
Select/line | select/con _ res | select/con _ three
Controls whether CPU resources are allocated to jobs and job steps in units of whole nodes or as consumable resources (sockets, cores or threads).
控制是将 CPU 资源分配给以整个节点为单位的作业和作业步骤,还是作为可消耗资源(套接字、核心或线程)分配。
选择类型参数
CR_CPU | CR_CPU_Memory | CR_Core | CR_Core_Memory | CR_Socket | CR_Socket_Memory Plus additional options. See man page for details.
CR _ CPU | CR _ CPU _ Memory | CR _ Core | CR _ Core _ Memory | CR _ Socket | CR _ Socket _ Memory Plus 附加选项。
Defines the consumable resource type and controls other aspects of CPU resource allocation by the select plugin.
通过选择插件定义可消耗资源类型并控制 CPU 资源分配的其他方面。
Command line option
命令行选项
Possible values
可能的价值
Description
描述
- B-额外节点信息
<sockets[:cores[:threads]]>
< 套接字[ : 核心[ : 线程]] >
Restricts node selection to nodes with a specified layout of sockets, cores and threads.
将节点选择限制为具有指定的套接字、核心和线程布局的节点。
- C,——约束
<list>
< 列表 >
Restricts node selection to nodes with specified attributes
将节点选择限制为具有指定属性的节点
相邻的
N/A
不知道
Restricts node selection to contiguous nodes
将节点选择限制为连续节点
每个插座的核心
<cores>
< 核心 >
Restricts node selection to nodes with at least the specified number of cores per socket
将节点选择限制为每个套接字具有至少指定数量的核的节点
- c,—— cpus-per-task
<ncpus>
< ncpus >
Controls the number of CPUs allocated per task
控制为每个任务分配的 CPU 数
独家新闻
N/A
不知道
Prevents sharing of allocated nodes with other jobs. Suballocates CPUs to job steps.
防止分配的节点与其他作业共享。将 CPU 子分配到作业步骤。
- F-Nodefile
<node file>
< 节点档 >
File containing a list of specific nodes to be selected for the job (salloc and sbatch only)
文件,该文件包含要为作业选择的特定节点的列表(仅限于 salloc 和 satch)
提示
compute_bound | memory_bound | [no]multithread
多线程
Additional controls on allocation of CPU resources
额外的 CPU 资源分配控制
马克普斯
<n>
< n >
Controls the minimum number of CPUs allocated per node
控制为每个节点分配的最小 CPU 数
- N-节点
<minnodes[-maxnodes]>
< 小节点[-最大节点] >
Controls the minimum/maximum number of nodes allocated to the job
控制分配给作业的最小/最大节点数
任务
<number>
< 数目 >
Controls the number of tasks to be created for the job
控制要为作业创建的任务数
每核心任务
<number>
< 数目 >
Controls the maximum number of tasks per allocated core
控制每个分配的核心的最大任务数
每个套接字的任务
<number>
< 数目 >
Controls the maximum number of tasks per allocated socket
控制每个已分配套接字的最大任务数
—— nasks-per-node
<number>
< 数目 >
Controls the maximum number of tasks per allocated node
控制每个分配节点的最大任务数
- 哦,-过度承诺
N/A
不知道
Allows fewer CPUs to be allocated than the number of tasks
允许分配的 CPU 数量少于任务数量
- P-分区
<partition_names>
< 分区 _ 名称 >
Controls which partition is used for the job
控制用于作业的分区
- 是的-订阅过量
N/A
不知道
Allows sharing of allocated nodes with other jobs
允许与其他作业共享分配的节点
——每个节点的套接字
<sockets>
< 插座 >
Restricts node selection to nodes with at least the specified number of sockets
将节点选择限制为至少具有指定数目的套接字的节点
每个核心的螺纹
<threads>
< 线程 >
Restricts node selection to nodes with at least the specified number of threads per core
将节点选择限制为每个核心至少具有指定数量的线程的节点
- 什么...-节点清洁工
<host1,host2,... or filename>
< host1,host2,... or filename >
List of specific nodes to be allocated to the job
要分配给作业的特定节点的列表
- x-排除
<host1,host2,... or filename>
< host1,host2,... or filename >
List of specific nodes to be excluded from allocation to the job
要从作业分配中排除的特定节点列表
- Z-不分配
N/A
不知道
Bypass normal allocation (privileged option available to users “SlurmUser” and “root” only)
绕过正常分配(仅对用户“ SlurmUser”和“ root”可用的特权选项)
STEP 2: ALLOCATION OF CPUS FROM THE SELECTED NODES
步骤2: 从选定节点分配 CPU
In Step 2, Slurm allocates CPU resources to a job/step from the set of nodes selected in Step 1. CPU allocation is therefore influenced by the configuration and command line options that relate to node selection. If SelectType=select/linear is configured, all resources on the selected nodes will be allocated to the job/step. If SelectType is configured to be select/cons_res or select/cons_tres, individual sockets, cores and threads may be allocated from the selected nodes as consumable resources. The consumable resource type is defined by SelectTypeParameters.
在步骤2中,slurm 从步骤1中选择的一组节点中为一个作业/步骤分配 CPU 资源。因此,CPU 分配受到与节点选择相关的配置和命令行选项的影响。如果配置了 SelectType = select/line,则所选节点上的所有资源都将分配给作业/步骤。如果 SelectType 配置为 select/con _ res 或 select/con _ tre,则可以从选定的节点分配单个套接字、内核和线程作为可消耗资源。可使用资源类型由 SelectTypeParameter 定义。
When using a SelectType of select/cons_res or select/cons_tres, the default allocation method across nodes is block allocation (allocate all available CPUs in a node before using another node). The default allocation method within a node is cyclic allocation (allocate available CPUs in a round-robin fashion across the sockets within a node). Users may override the default behavior using the appropriate command line options described below. The choice of allocation methods may influence which specific CPUs are allocated to the job/step.
当使用 select/con _ res 或 select/con _ tri 的 SelectType 时,节点之间的默认分配方法是块分配(在使用另一个节点之前在一个节点中分配所有可用的 CPU)。节点内的默认分配方法是循环分配(在节点内的套接字之间以循环方式分配可用 CPU)。用户可以使用下面描述的适当命令行选项覆盖默认行为。分配方法的选择可能会影响将哪些特定 CPU 分配给作业/步骤。 Step 2 is performed by slurmctld and the select plugin.第2步由 slurmctld 和 select 插件执行。
slurm.conf parameter
Conf 参数
Possible values
可能的价值
Description
描述
<name of the node> Plus additional parameters. See man page for details.
< name of the node > Plus 其他参数。请参阅手册页了解详细信息。
Defines a node. This includes the number and layout of boards, sockets, cores, threads and processors (logical CPUs) on the node.
定义节点。这包括节点上的主板、套接字、核心、线程和处理器(逻辑 CPU)的数量和布局。
PartitionName 分区名称
<name of the partition> Plus additional parameters. See man page for details.
< 分区名称 > 加上其他参数。请参阅手册页了解详细信息。
Defines a partition. Several parameters of the partition definition affect the allocation of CPU resources to jobs (e.g., Nodes, OverSubscribe, MaxNodes)
定义分区。分区定义的几个参数会影响 CPU 资源在作业中的分配(例如,Nodes、 OverSubscribe、 MaxNodes)
参数
config_overrides
Config _ 覆盖
Controls how the information in a node definition is used.
控制如何使用节点定义中的信息。
select/linear | select/cons_res | select/cons_tres
Select/line | select/con _ res | select/con _ three
Controls whether CPU resources are allocated to jobs and job steps in units of whole nodes or as consumable resources (sockets, cores or threads).
控制是将 CPU 资源分配给以整个节点为单位的作业和作业步骤,还是作为可消耗资源(套接字、核心或线程)分配。
选择类型参数
CR_CPU | CR_CPU_Memory | CR_Core | CR_Core_Memory | CR_Socket | CR_Socket_Memory Plus additional options. See man page for details.
CR _ CPU | CR _ CPU _ Memory | CR _ Core | CR _ Core _ Memory | CR _ Socket | CR _ Socket _ Memory Plus 附加选项。
Defines the consumable resource type and controls other aspects of CPU resource allocation by the select plugin.
通过选择插件定义可消耗资源类型并控制 CPU 资源分配的其他方面。
Command line option
命令行选项
Possible values
可能的价值
Description
描述
- B-额外节点信息
<sockets[:cores[:threads]]>
< 套接字[ : 核心[ : 线程]] >
Restricts node selection to nodes with a specified layout of sockets, cores and threads.
将节点选择限制为具有指定的套接字、核心和线程布局的节点。
- C,——约束
<list>
< 列表 >
Restricts node selection to nodes with specified attributes
将节点选择限制为具有指定属性的节点
相邻的
N/A
不知道
Restricts node selection to contiguous nodes
将节点选择限制为连续节点
每个插座的核心
<cores>
< 核心 >
Restricts node selection to nodes with at least the specified number of cores per socket
将节点选择限制为每个套接字具有至少指定数量的核的节点
- c,—— cpus-per-task
<ncpus>
< ncpus >
Controls the number of CPUs allocated per task
控制为每个任务分配的 CPU 数
- 分配,-m
block|cyclic |arbitrary|plane=<options>[:block|cyclic]
块 | 循环 | 任意 | 平面 = < options > [ : 块 | 循环]
The second specified distribution (after the ":") can be used to override the default allocation method within nodes
第二个指定的分发版(在“ :”之后)可用于覆盖节点内的默认分配方法
独家新闻
N/A
不知道
Prevents sharing of allocated nodes with other jobs
防止与其他作业共享分配的节点
- F-Nodefile
<node file>
< 节点档 >
File containing a list of specific nodes to be selected for the job (salloc and sbatch only)
文件,该文件包含要为作业选择的特定节点的列表(仅限于 salloc 和 satch)
提示
compute_bound | memory_bound | [no]multithread
多线程
Additional controls on allocation of CPU resources
额外的 CPU 资源分配控制
马克普斯
<n>
< n >
Controls the minimum number of CPUs allocated per node
控制为每个节点分配的最小 CPU 数
- N-节点
<minnodes[-maxnodes]>
< 小节点[-最大节点] >
Controls the minimum/maximum number of nodes allocated to the job
控制分配给作业的最小/最大节点数
任务
<number>
< 数目 >
Controls the number of tasks to be created for the job
控制要为作业创建的任务数
每核心任务
<number>
< 数目 >
Controls the maximum number of tasks per allocated core
控制每个分配的核心的最大任务数
每个套接字的任务
<number>
< 数目 >
Controls the maximum number of tasks per allocated socket
控制每个已分配套接字的最大任务数
—— nasks-per-node
<number>
< 数目 >
Controls the maximum number of tasks per allocated node
控制每个分配节点的最大任务数
- 哦,-过度承诺
N/A
不知道
Allows fewer CPUs to be allocated than the number of tasks
允许分配的 CPU 数量少于任务数量
- P-分区
<partition_names>
< 分区 _ 名称 >
Controls which partition is used for the job
控制用于作业的分区
- 是的-订阅过量
N/A
不知道
Allows sharing of allocated nodes with other jobs
允许与其他作业共享分配的节点
——每个节点的套接字
<sockets>
< 插座 >
Restricts node selection to nodes with at least the specified number of sockets
将节点选择限制为至少具有指定数目的套接字的节点
每个核心的螺纹
<threads>
< 线程 >
Restricts node selection to nodes with at least the specified number of threads per core
将节点选择限制为每个核心至少具有指定数量的线程的节点
- 什么...-节点清洁工
<host1,host2,... or filename>
< host1,host2,... or filename >
List of specific nodes to be allocated to the job
要分配给作业的特定节点的列表
- x-排除
<host1,host2,... or filename>
< host1,host2,... or filename >
List of specific nodes to be excluded from allocation to the job
要从作业分配中排除的特定节点列表
- Z-不分配
N/A
不知道
Bypass normal allocation (privileged option available to users “SlurmUser” and “root” only)
绕过正常分配(仅对用户“ SlurmUser”和“ root”可用的特权选项)
STEP 3: DISTRIBUTION OF TASKS TO THE SELECTED NODES
步骤3: 将任务分配到选定的节点
In Step 3, Slurm distributes tasks to the nodes that were selected for the job/step in Step 1. Each task is distributed to only one node, but more than one task may be distributed to each node. Unless overcommitment of CPUs to tasks is specified for the job, the number of tasks distributed to a node is constrained by the number of CPUs allocated on the node and the number of CPUs per task. If consumable resources is configured, or resource sharing is allowed, tasks from more than one job/step may run on the same node concurrently.
在步骤3中,slurm 将任务分配给步骤1中为作业/步骤选择的节点。每个任务只分布在一个节点上,但是可以将多个任务分布在每个节点上。除非为作业指定了 CPU 对任务的超量使用,否则分配给节点的任务数量受节点上分配的 CPU 数量和每个任务的 CPU 数量的限制。如果配置了可消耗资源,或者允许资源共享,则来自多个作业/步骤的任务可以并发地在同一节点上运行。 Step 3 is performed by slurmctld.第3步由 slurmctld 执行。
slurm.conf parameter
Conf 参数
Possible values
可能的价值
Description
描述
<number>
< 数目 >
Controls the maximum number of tasks that a job step can spawn on a single node
控制作业步骤可以在单个节点上产生的最大任务数
Command line option
命令行选项
Possible values
可能的价值
Description
描述
- 分配,-m
block|cyclic |arbitrary|plane=<options>[:block|cyclic]
块 | 循环 | 任意 | 平面 = < options > [ : 块 | 循环]
The first specified distribution (before the ":") controls the sequence in which tasks are distributed to each of the selected nodes. Note that this option does not affect the number of tasks distributed to each node, but only the sequence of distribution.
第一个指定的分发(在“ :”之前)控制将任务分发到每个选定节点的顺序。请注意,此选项不影响分布到每个节点的任务数,而只影响分布的顺序。
每核心任务
<number>
< 数目 >
Controls the maximum number of tasks per allocated core
控制每个分配的核心的最大任务数
每个套接字的任务
<number>
< 数目 >
Controls the maximum number of tasks per allocated socket
控制每个已分配套接字的最大任务数
—— nasks-per-node
<number>
< 数目 >
Controls the maximum number of tasks per allocated node
控制每个分配节点的最大任务数
- r-亲戚
N/A
不知道
Controls which node is used for a job step
控制用于作业步骤的节点
STEP 4: OPTIONAL DISTRIBUTION AND BINDING OF TASKS TO CPUS WITHIN A NODE
步骤4: 节点内任务到 CPU 的可选分布和绑定
In optional Step 4, Slurm distributes and binds each task to a specified subset of the allocated CPUs on the node to which the task was distributed in Step 3. Different tasks distributed to the same node may be bound to the same subset of CPUs or to different subsets. This step is known as task affinity or task/CPU binding.
在可选的步骤4中,slurm 将每个任务分配并绑定到步骤3中任务分配到的节点上分配的 CPU 的指定子集。分布到同一节点的不同任务可以绑定到同一个 CPU 子集或不同的子集。此步骤称为任务关联或任务/CPU 绑定。 Step 4 is performed by slurmd and the task plugin.第4步由 slurmd 和任务插件执行。
slurm.conf parameter
Conf 参数
Possible values
可能的价值
Description
描述
TaskPlugin 任务插件
task/none | task/affinity | task/cgroup
Task/none | task/关联性 | task/cgroup
Controls whether this step is enabled and which task plugin to use
控制是否启用此步骤以及使用哪个任务插件
cgroup.conf parameter
Conf 参数
Possible values
可能的价值
Description
描述
约束核心
yes|no
是的,不是
Controls whether jobs are constrained to their allocated CPUs
控制作业是否约束于其分配的 CPU
Command line option
命令行选项
Possible values
可能的价值
Description
描述
CPU 绑定
See man page
查看手册页
Controls binding of tasks to CPUs (srun only)
控制任务到 CPU 的绑定(仅运行)
每核心任务
<number>
< 数目 >
Controls the maximum number of tasks per allocated core
控制每个分配的核心的最大任务数
- 分配,-m
block|cyclic |arbitrary|plane=<options>[:block|cyclic]
块 | 循环 | 任意 | 平面 = < options > [ : 块 | 循环]
The second specified distribution (after the ":") controls the sequence in which tasks are distributed to allocated CPUs within a node for binding of tasks to CPUs
第二个指定的分发版(在“ :”之后)控制将任务分发到节点内分配的 CPU 的顺序,以便将任务绑定到 CPU
Additional Notes on CPU Management Steps
CPU 管理步骤附加说明
For consumable resources, it is important for users to understand the difference between cpu allocation (Step 2) and task affinity/binding (Step 4). Exclusive (unshared) allocation of CPUs as consumable resources limits the number of jobs/steps/tasks that can use a node concurrently. But it does not limit the set of CPUs on the node that each task distributed to the node can use. Unless some form of CPU/task binding is used (e.g., a task or spank plugin), all tasks distributed to a node can use all of the CPUs on the node, including CPUs not allocated to their job/step. This may have unexpected adverse effects on performance, since it allows one job to use CPUs allocated exclusively to another job. For this reason, it may not be advisable to configure consumable resources without also configuring task affinity. Note that task affinity can also be useful when select/linear (whole node allocation) is configured, to improve performance by restricting each task to a particular socket or other subset of CPU resources on a node.
对于可消耗资源,用户必须了解 CPU 分配(步骤2)和任务关联/绑定(步骤4)之间的区别。作为可消耗资源的独占(非共享) CPU 分配限制了可以并发使用节点的作业/步骤/任务的数量。但是它不限制分布到节点的每个任务可以使用的节点上的 CPU 集。除非使用某种形式的 CPU/任务绑定(例如,任务或打屁股插件) ,否则分布到一个节点的所有任务都可以使用该节点上的所有 CPU,包括未分配给它们的作业/步骤的 CPU。这可能会对性能产生意想不到的负面影响,因为它允许一个作业使用专门分配给另一个作业的 CPU。因此,在配置可消耗资源时不配置任务关联性可能是不明智的。请注意,在配置 select/line (整个节点分配)时,任务关联也很有用,可以通过将每个任务限制到节点上的特定套接字或其他 CPU 资源子集来提高性能。
Getting Information about CPU usage by Jobs/Steps/Tasks
获取关于关于/步骤/任务的 CPU 使用情况的信息
There is no easy way to generate a comprehensive set of CPU management information for a job/step (allocation, distribution and binding). However, several commands/options provide limited information about CPU usage.
要为一个作业/步骤(分配、分发和绑定)生成一组全面的 CPU 管理信息并不容易。然而,一些命令/选项提供了有限的关于 CPU 使用信息。
Command/Option
命令/选项
Information
信息
scontrol show job option: --details
控件显示作业选项: ——详细信息
This option provides a list of the nodes selected for the job and the CPU ids allocated to the job on each node. Note that the CPU ids reported by this command are Slurm abstract CPU ids, not Linux/hardware CPU ids (as reported by, for example, /proc/cpuinfo).
此选项提供为作业选择的节点列表以及每个节点上分配给作业的 CPU id。注意,这个命令报告的 CPU id 是 slurm 抽象的 CPU id,而不是 Linux/硬件 CPU id (例如/proc/cpuinfo 报告的)。
Linux command: env
Linux 命令: env
Man. Slurm environment variables provide information related to node and CPU usage: Slurm 环境变量提供与节点和 CPU 使用量有关的信息: SLURM_JOB_CPUS_PER_NODEJOB _ CPUS _ PER _ NODE SLURM_CPUS_PER_TASK任务 SLURM_CPU_BINDSLURM _ CPU _ BIND SLURM_DISTRIBUTIONSLURM _ distribution SLURM_JOB_NODELIST作业节点列表 SLURM_TASKS_PER_NODESURM _ TASKS _ PER _ NODE SLURM_STEP_NODELISTSTEP _ NODELIST SLURM_STEP_NUM_NODESSURM _ STEP _ NUM _ NODES SLURM_STEP_NUM_TASKSSTEP _ NUM _ TASKS SLURM_STEP_TASKS_PER_NODESURM _ STEP _ TASKS _ PER _ NODE SLURM_JOB_NUM_NODESSURM _ JOB _ NUM _ NODES SLURM_NTASKSSLURM _ NTASKS SLURM_NPROCSSLURM _ NPROCS SLURM_CPUS_ON_NODE节点 SLURM_NODEIDSLURM _ NODEID SLURMD_NODENAMESLURMD _ NODENAME
srun option: --cpu-bind=verbose
Srun 选项: —— cpu-bind = verose
This option provides a list of the CPU masks used by task affinity to bind tasks to CPUs. Note that the CPU ids represented by these masks are Linux/hardware CPU ids, not Slurm abstract CPU ids as reported by scontrol, etc.
此选项提供任务关联用于将任务绑定到 CPU 的 CPU 掩码列表。请注意,这些掩码所代表的 CPU id 是 Linux/硬件 CPU id,而不是 scontrol 等报告的抽象 CPU id slurm。
srun/salloc/sbatch option: -l
Srun/salloc/sbatch 选项:-l
This option adds the task id as a prefix to each line of output from a task sent to stdout/stderr. This can be useful for distinguishing node-related and CPU-related information by task id for multi-task jobs/steps.
此选项将任务 id 作为前缀添加到发送到 stdout/stderr 的任务的每一行输出中。这对于根据多任务作业/步骤的任务 ID 来区分与节点相关的信息和与 CPU 相关的信息非常有用。
Linux command: cat /proc/<pid>/status | grep Cpus_allowed_list
Linux 命令: cat/proc/< pid >/status | grep Cpus _ allow _ list
Given a task's pid (or "self" if the command is executed by the task itself), this command produces a list of the CPU ids bound to the task. This is the same information that is provided by --cpu-bind=verbose, but in a more readable format.
给定任务的 pid (如果命令由任务本身执行,则为“ self”) ,该命令将生成绑定到任务的 CPU id 列表。这与—— cpu-bind = verose 提供的信息相同,但是格式更易读。
A NOTE ON CPU NUMBERING
关于 CPU 编号的一点注记
The number and layout of logical CPUs known to Slurm is described in the node definitions in slurm.conf. This may differ from the physical CPU layout on the actual hardware. For this reason, Slurm generates its own internal, or "abstract", CPU numbers. These numbers may not match the physical, or "machine", CPU numbers known to Linux.
已知逻辑 CPU 的数量和布局 slurm 在 slurm.conf 中的节点定义中有描述。这可能与实际硬件上的物理 CPU 布局不同。基于这个原因,slurm 生成自己的内部或“抽象”CPU 数字。这些数字可能与 Linux 已知的物理或“机器”CPU 数字不匹配。
CPU Management and Slurm Accounting
中央处理器管理及 slurm 会计
CPU management by Slurm users is subject to limits imposed by Slurm Accounting. Accounting limits may be applied on CPU usage at the level of users, groups and clusters. For details, see the sacctmgr man page.
Slurm 用户的中央处理器管理受到 slurm 会计的限制。可以在用户、组和集群的级别上对 CPU 使用进行计算限制。有关详细信息,请参见 sacctmgr 手册页。
CPU Management Examples
CPU 管理示例
The following examples illustrate some scenarios for managing CPU resources using Slurm. Many additional scenarios are possible. In each example, it is assumed that all CPUs on each node are available for allocation.
下面的例子说明了使用 slurm 管理 CPU 资源的一些场景。还可能出现许多其他情况。在每个示例中,都假设每个节点上的所有 CPU 都可用于分配。
EXAMPLE NODE AND PARTITION CONFIGURATION
示例节点和分区配置
For these examples, the Slurm cluster contains the following nodes:
对于这些示例,slurm 集群包含以下节点:
Nodename
姓名
n0
没有
n1
N1
n2
N2
n3
N3
Number of Sockets
套接字数量
2
2
2
2
Number of Cores per Socket
每个套接字的内核数
4
4
4
4
Total Number of Cores
核心总数
8
8
8
8
Number of Threads (CPUs) per Core
每个核心的线程数(CPU)
1
1
1
2
Total Number of CPUs
CPU 总数
8
8
8
16
And the following partitions:
以下分区:
PartitionName
PartitionName 分区名称
regnodes
Regnodes
hypernode
超级节点
Nodes
节点
n0 n1 n2
不,不,不
n3
N3
Default
默认
YES
是的
-
These entities are defined in slurm.conf as follows:
这些实体在 slurm.conf 中定义如下:
These examples show the use of the cons_res select type plugin, but they could use the cons_tres plugin with the same effect.
这些示例显示了 con _ res 选择类型插件的使用,但是它们可以使用具有相同效果的 con _ tre 插件。
EXAMPLE 1: ALLOCATION OF WHOLE NODES
示例1: 整个节点的分配
Allocate a minimum of two whole nodes to a job.
为一个作业分配至少两个完整的节点。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
The SelectType=select/linear configuration option specifies allocation in units of whole nodes. The --nodes=2 srun option causes Slurm to allocate at least 2 nodes to the job.SelectType = select/line 配置选项指定以整个节点单元为单位的分配。- node = 2 srun 选项会导致 slurm 为作业分配至少2个节点。
EXAMPLE 2: SIMPLE ALLOCATION OF CORES AS CONSUMABLE RESOURCES
例2: 作为消耗性资源的核心的简单分配
A job requires 6 CPUs (2 tasks and 3 CPUs per task with no overcommitment). Allocate the 6 CPUs as consumable resources from a single node in the default partition.
一个作业需要6个 CPU (每个任务需要2个任务和3个 CPU,没有超量使用)。将6个 CPU 作为可使用资源从默认分区中的单个节点分配。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
The SelectType configuration options define cores as consumable resources. The --nodes=1-1 srun option restricts the job to a single node. The following table shows a possible pattern of allocation for this job.
SelectType 配置选项将核心定义为可消耗资源。Node = 1-1 srun 选项将作业限制为单个节点。下表显示了此作业的可能分配模式。
Nodename
姓名
n0
没有
n1
N1
n2
N2
Number of Allocated CPUs
分配的 CPU 数量
6
0
0
Number of Tasks
任务数量
2
0
0
EXAMPLE 3: CONSUMABLE RESOURCES WITH BALANCED ALLOCATION ACROSS NODES
示例3: 跨节点均衡分配的可消耗资源
A job requires 9 CPUs (3 tasks and 3 CPUs per task with no overcommitment). Allocate 3 CPUs from each of the 3 nodes in the default partition.
一个作业需要9个 CPU (每个任务需要3个任务和3个 CPU,没有超量使用)。从默认分区中的3个节点分配3个 CPU。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
The options specify the following conditions for the job: 3 tasks, 3 unique CPUs per task, using exactly 3 nodes. To satisfy these conditions, Slurm must allocate 3 CPUs from each node. The following table shows the allocation for this job.
这些选项为作业指定了以下条件: 3个任务,每个任务使用3个唯一的 CPU,正好使用3个节点。要满足这些条件,slurm 必须从每个节点分配3个 CPU。下表显示了此作业的分配。
Nodename
姓名
n0
没有
n1
N1
n2
N2
Number of Allocated CPUs
分配的 CPU 数量
3
3
3
Number of Tasks
任务数量
1
1
1
EXAMPLE 4: CONSUMABLE RESOURCES WITH MINIMIZATION OF RESOURCE FRAGMENTATION
示例4: 以最小化资源分割为目标的可消耗资源
A job requires 12 CPUs (12 tasks and 1 CPU per task with no overcommitment). Allocate CPUs using the minimum number of nodes and the minimum number of sockets required for the job in order to minimize fragmentation of allocated/unallocated CPUs in the cluster.
一个作业需要12个 CPU (12个任务,每个任务1个 CPU,没有超量使用)。使用作业所需的最小节点数和最小套接字数来分配 CPU,以最小化集群中已分配/未分配 CPU 的碎片。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
The default allocation method across nodes is block. This minimizes the number of nodes used for the job. The configuration option CR_CORE_DEFAULT_DIST_BLOCK sets the default allocation method within a node to block. This minimizes the number of sockets used for the job within a node. The combination of these two methods causes Slurm to allocate the 12 CPUs using the minimum required number of nodes (2 nodes) and sockets (3 sockets).The following table shows a possible pattern of allocation for this job.
节点之间的默认分配方法是块。这将最小化用于作业的节点数。配置选项 CR _ CORE _ DEFAULT _ DIST _ BLOCK 设置要阻塞的节点内的默认分配方法。这将最小化用于节点内作业的套接字数量。这两种方法的结合使得 slurm 可以使用所需的最小节点数(2个节点)和套接字(3个套接字)来分配这12个 CPU。下表显示了此作业的可能分配模式。
Nodename
姓名
n0
没有
n1
N1
n2
N2
Socket id
插座识别码
0
1
0
1
0
1
Number of Allocated CPUs
分配的 CPU 数量
4
4
4
0
0
0
Number of Tasks
任务数量
8
4
0
EXAMPLE 5: CONSUMABLE RESOURCES WITH CYCLIC DISTRIBUTION OF TASKS TO NODES
示例5: 具有任务到节点的循环分布的可消耗资源
A job requires 12 CPUs (6 tasks and 2 CPUs per task with no overcommitment). Allocate 6 CPUs each from 2 nodes in the default partition. Distribute tasks to nodes cyclically.
一个作业需要12个 CPU (每个任务需要6个任务和2个 CPU,没有超量使用)。从默认分区中的2个节点分配6个 CPU。循环地将任务分配给节点。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
The options specify the following conditions for the job: 6 tasks, 2 unique CPUs per task, using exactly 2 nodes, and with 3 tasks per node. To satisfy these conditions, Slurm must allocate 6 CPUs from each of the 2 nodes. The --distribution=cyclic option causes the tasks to be distributed to the nodes in a round-robin fashion. The following table shows a possible pattern of allocation and distribution for this job.
这些选项为这个作业指定了以下条件: 6个任务,每个任务使用2个唯一的 CPU,正好使用2个节点,每个节点使用3个任务。要满足这些条件,slurm 必须从每个节点分配6个 CPU。分布 = 循环选项使得任务以循环方式分布到各个节点。下表显示了此作业的可能分配和分发模式。
Nodename
姓名
n0
没有
n1
N1
n2
N2
Number of Allocated CPUs
分配的 CPU 数量
6
6
0
Number of Tasks
任务数量
3
3
0
Distribution of Tasks to Nodes, by Task id
按任务 ID 将任务分布到节点
0 2 4
1 3 5
-
EXAMPLE 6: CONSUMABLE RESOURCES WITH DEFAULT ALLOCATION AND PLANE DISTRIBUTION OF TASKS TO NODES
示例6: 具有默认分配和任务到节点的平面分配的可消耗资源
A job requires 16 CPUs (8 tasks and 2 CPUs per task with no overcommitment). Use all 3 nodes in the default partition. Distribute tasks to each node in blocks of two in a round-robin fashion.
一个作业需要16个 CPU (每个任务需要8个任务和2个 CPU,没有超量使用)。使用默认分区中的所有3个节点。以循环方式将任务以两个块的形式分配给每个节点。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
The options specify the following conditions for the job: 8 tasks, 2 unique CPUs per task, using all 3 nodes in the partition. To satisfy these conditions using the default allocation method across nodes (block), Slurm allocates 8 CPUs from the first node, 6 CPUs from the second node and 2 CPUs from the third node. The --distribution=plane=2 option causes Slurm to distribute tasks in blocks of two to each of the nodes in a round-robin fashion, subject to the number of CPUs allocated on each node. So, for example, only 1 task is distributed to the third node because only 2 CPUs were allocated on that node and each task requires 2 CPUs. The following table shows a possible pattern of allocation and distribution for this job.
这些选项为作业指定了以下条件: 8个任务,每个任务使用2个惟一的 CPU,使用分区中的所有3个节点。为了满足这些条件,使用默认的跨节点(块)分配方法,slurm 从第一个节点分配8个 CPU,从第二个节点分配6个 CPU,从第三个节点分配2个 CPU。分配 = 平面 = 2选项使得 slurm 以循环方式将任务分配给每个节点,并取决于每个节点上分配的 CPU 数量。因此,例如,只有1个任务分布到第三个节点,因为该节点上只分配了2个 CPU,每个任务需要2个 CPU。下表显示了此作业的可能分配和分发模式。
Nodename
姓名
n0
没有
n1
N1
n2
N2
Number of Allocated CPUs
分配的 CPU 数量
8
6
2
Number of Tasks
任务数量
4
3
1
Distribution of Tasks to Nodes, by Task id
按任务 ID 将任务分布到节点
0 1 5 6
2 3 7
4
EXAMPLE 7: CONSUMABLE RESOURCES WITH OVERCOMMITMENT OF CPUS TO TASKS
示例7: CPU 超出任务承诺的可消耗资源
A job has 20 tasks. Run the job in a single node.
一个作业有20个任务。在一个节点中运行该作业。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
The --overcommit option allows the job to run in only one node by overcommitting CPUs to tasks.The following table shows a possible pattern of allocation and distribution for this job.
Overcommit 选项允许通过将 CPU 重新提交到任务,只在一个节点中运行作业。下表显示了此作业的可能分配和分发模式。
Nodename
姓名
n0
没有
n1
N1
n2
N2
Number of Allocated CPUs
分配的 CPU 数量
8
0
0
Number of Tasks
任务数量
20
0
0
Distribution of Tasks to Nodes, by Task id
按任务 ID 将任务分布到节点
0 - 19
-
-
EXAMPLE 8: CONSUMABLE RESOURCES WITH RESOURCE SHARING BETWEEN JOBS
示例8: 工作间资源共享的可消耗资源
2 jobs each require 6 CPUs (6 tasks per job with no overcommitment). Run both jobs simultaneously in a single node.
每个作业需要6个 CPU (每个作业需要6个任务,没有超量使用)。在单个节点中同时运行这两个作业。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
The --nodes=1-1 and --nodelist=n0 srun options together restrict both jobs to node n0. The OverSubscribe=YES option in the partition definition plus the --oversubscribe srun option allows the two jobs to oversubscribe CPUs on the node.
节点 = 1-1和—— nodelist = n0 srun 选项一起将两个作业限制为节点 n0。分区定义中的 OverSubscribe = YES 选项加上—— overordering srun 选项允许两个作业在节点上超订 CPU。
EXAMPLE 9: CONSUMABLE RESOURCES ON MULTITHREADED NODE, ALLOCATING ONLY ONE THREAD PER CORE
示例9: 多线程节点上的可消耗资源,每个核只分配一个线程
A job requires 8 CPUs (8 tasks with no overcommitment). Run the job on node n3, allocating only one thread per core.
一个作业需要8个 CPU (8个任务,没有超量使用)。在节点 n3上运行作业,每个核只分配一个线程。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
The CR_CPU configuration option enables the allocation of only one thread per core. The --hint=nomultithread srun option causes Slurm to allocate only one thread from each core to this job. The following table shows a possible pattern of allocation for this job.
CR _ CPU 配置选项允许每个核心只分配一个线程。提示 = nomultithread srun 选项会导致 slurm 从每个核心只分配一个线程到这个作业。下表显示了此作业的可能分配模式。
Nodename
姓名
n3
N3
Socket id
插座识别码
0
1
Core id
核心身份
0
1
2
3
0
1
2
3
CPU id
CPU ID
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Number of Allocated CPUs
分配的 CPU 数量
4
4
Allocated CPU ids
已分配的 CPU ID
0 2 4 6
8 10 12 14
EXAMPLE 10: CONSUMABLE RESOURCES WITH TASK AFFINITY AND CORE BINDING
示例10: 具有任务亲和性和核心绑定的可消耗资源
A job requires 6 CPUs (6 tasks with no overcommitment). Run the job in a single node in the default partition. Apply core binding to each task.
一个作业需要6个 CPU (6个任务,没有超量使用)。在默认分区中的单个节点中运行作业。对每个任务应用核心绑定。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
Using the default allocation method within nodes (cyclic), Slurm allocates 3 CPUs on each socket of 1 node. Using the default distribution method within nodes (cyclic), Slurm distributes and binds each task to an allocated core in a round-robin fashion across the sockets. The following table shows a possible pattern of allocation, distribution and binding for this job. For example, task id 2 is bound to CPU id 1.
使用节点内的默认分配方法(循环) ,slurm 在每个1节点的套接字上分配3个 CPU。使用节点内的默认分布方法(循环) ,slurm 以循环的方式将每个任务分布并绑定到所分配的核心。下表显示了此作业的分配、分发和绑定的可能模式。例如,任务 id2绑定到 CPU id1。
Nodename
姓名
n0
没有
Socket id
插座识别码
0
1
Number of Allocated CPUs
分配的 CPU 数量
3
3
Allocated CPU ids
已分配的 CPU ID
0 1 2
4 5 6
Binding of Tasks to CPUs
将任务绑定到 CPU
CPU id
CPU ID
0
1
2
3
4
5
6
7
Task id
任务 ID
0
2
4
-
1
3
5
-
EXAMPLE 11: CONSUMABLE RESOURCES WITH TASK AFFINITY AND SOCKET BINDING, CASE 1
示例11: 具有任务亲缘性和套接字绑定的可消耗资源,情况1
A job requires 6 CPUs (6 tasks with no overcommitment). Run the job in a single node in the default partition. Apply socket binding to each task.
一个作业需要6个 CPU (6个任务,没有超量使用)。在默认分区中的单个节点中运行作业。对每个任务应用套接字绑定。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
Using the default allocation method within nodes (cyclic), Slurm allocates 3 CPUs on each socket of 1 node. Using the default distribution method within nodes (cyclic), Slurm distributes and binds each task to all of the allocated CPUs in one socket in a round-robin fashion across the sockets. The following table shows a possible pattern of allocation, distribution and binding for this job. For example, task ids 1, 3 and 5 are all bound to CPU ids 4, 5 and 6.
使用节点内的默认分配方法(循环) ,slurm 在每个1节点的套接字上分配3个 CPU。使用节点内的默认分发方法(循环) ,slurm 以循环的方式将每个任务分发并绑定到一个套接字中所有分配的 CPU。下表显示了此作业的分配、分发和绑定的可能模式。例如,任务 id 1、3和5都绑定到 CPU id 4、5和6。
Nodename
姓名
n0
没有
Socket id
插座识别码
0
1
Number of Allocated CPUs
分配的 CPU 数量
3
3
Allocated CPU ids
已分配的 CPU ID
0 1 2
4 5 6
Binding of Tasks to CPUs
将任务绑定到 CPU
CPU id
CPU ID
0
1
2
3
4
5
6
7
Task ids
任务编号
0 2 4
-
1 3 5
-
EXAMPLE 12: CONSUMABLE RESOURCES WITH TASK AFFINITY AND SOCKET BINDING, CASE 2
示例12: 具有任务亲缘性和套接字绑定的可消耗资源,情况2
A job requires 6 CPUs (2 tasks with 3 cpus per task and no overcommitment). Run the job in a single node in the default partition. Allocate cores using the block allocation method. Distribute cores using the block distribution method. Apply socket binding to each task.
一个作业需要6个 CPU (2个任务,每个任务3个 CPU,没有超量使用)。在默认分区中的单个节点中运行作业。使用块分配方法分配核心。使用块分布方法分布核心。对每个任务应用套接字绑定。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
Using the block allocation method, Slurm allocates 4 CPUs on one socket and 2 CPUs on the other socket of one node. Using the block distribution method within nodes, Slurm distributes 3 CPUs to each task. Applying socket binding, Slurm binds each task to all allocated CPUs in all sockets in which the task has a distributed CPU. The following table shows a possible pattern of allocation, distribution and binding for this job. In this example, using the block allocation method CPU ids 0-3 are allocated on socket id 0 and CPU ids 4-5 are allocated on socket id 1. Using the block distribution method, CPU ids 0-2 were distributed to task id 0, and CPU ids 3-5 were distributed to task id 1. Applying socket binding, task id 0 is therefore bound to the allocated CPUs on socket 0, and task id 1 is bound to the allocated CPUs on both sockets.
使用区块分配方法,slurm 在一个套接字上分配4个 CPU,在一个节点的另一个套接字上分配2个 CPU。使用节点内的块分配方法,slurm 将3个 CPU 分配给每个任务。应用套接字绑定,slurm 将每个任务绑定到所有套接字中的所有分配 CPU,而这些套接字中的任务有一个分布式 CPU。下表显示了此作业的分配、分发和绑定的可能模式。在这个例子中,使用块分配方法,CPU id 0-3在套接字 id 0上分配,CPU id 4-5在套接字 id 1上分配。采用块分布方法,将 CPU id 0-2分布到任务 id 0,将 CPU id 3-5分布到任务 id 1。因此,应用套接字绑定时,任务 id 0绑定到套接字0上分配的 CPU,任务 id 1绑定到两个套接字上分配的 CPU。
Nodename
姓名
n0
没有
Socket id
插座识别码
0
1
Number of Allocated CPUs
分配的 CPU 数量
4
2
Allocated CPU ids
已分配的 CPU ID
0 1 2 3
4 5
Binding of Tasks to CPUs
将任务绑定到 CPU
CPU id
CPU ID
0
1
2
3
4
5
6
7
Task ids
任务编号
0 1
1
-
EXAMPLE 13: CONSUMABLE RESOURCES WITH TASK AFFINITY AND SOCKET BINDING, CASE 3
示例13: 具有任务亲缘性和套接字绑定的可消耗资源,情况3
A job requires 6 CPUs (2 tasks with 3 cpus per task and no overcommitment). Run the job in a single node in the default partition. Allocate cores using the block allocation method. Distribute cores using the cyclic distribution method. Apply socket binding to each task.
一个作业需要6个 CPU (2个任务,每个任务3个 CPU,没有超量使用)。在默认分区中的单个节点中运行作业。使用块分配方法分配核心。循环分布法分布岩心。对每个任务应用套接字绑定。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
Using the block allocation method, Slurm allocates 4 CPUs on one socket and 2 CPUs on the other socket of one node. Using the cyclic distribution method within nodes, Slurm distributes 3 CPUs to each task. Applying socket binding, Slurm binds each task to all allocated CPUs in all sockets in which the task has a distributed CPU. The following table shows a possible pattern of allocation, distribution and binding for this job. In this example, using the block allocation method CPU ids 0-3 are allocated on socket id 0 and CPU ids 4-5 are allocated on socket id 1. Using the cyclic distribution method, CPU ids 0, 1 and 4 were distributed to task id 0, and CPU ids 2, 3 and 5 were distributed to task id 1. Applying socket binding, both tasks are therefore bound to the allocated CPUs on both sockets.
使用区块分配方法,slurm 在一个套接字上分配4个 CPU,在一个节点的另一个套接字上分配2个 CPU。利用节点内的循环分配方法,slurm 将3个 CPU 分配给每个任务。应用套接字绑定,slurm 将每个任务绑定到所有套接字中的所有分配 CPU,而这些套接字中的任务有一个分布式 CPU。下表显示了此作业的分配、分发和绑定的可能模式。在这个例子中,使用块分配方法,CPU id 0-3在套接字 id 0上分配,CPU id 4-5在套接字 id 1上分配。采用循环分布方法,将 CPU id 0、1和4分布到任务 id 0,将 CPU id 2、3和5分布到任务 id 1。因此,应用套接字绑定时,两个任务都绑定到两个套接字上分配的 CPU。
Nodename
姓名
n0
没有
Socket id
插座识别码
0
1
Number of Allocated CPUs
分配的 CPU 数量
4
2
Allocated CPU ids
已分配的 CPU ID
0 1 2 3
4 5
Binding of Tasks to CPUs
将任务绑定到 CPU
CPU id
CPU ID
0
1
2
3
4
5
6
7
Task ids
任务编号
0 1
0 1
-
EXAMPLE 14: CONSUMABLE RESOURCES WITH TASK AFFINITY AND CUSTOMIZED ALLOCATION AND DISTRIBUTION
示例14: 具有任务相似性和定制分配和分配的可消耗资源
A job requires 18 CPUs (18 tasks with no overcommitment). Run the job in the default partition. Allocate 6 CPUs on each node using block allocation within nodes. Use cyclic distribution of tasks to nodes and block distribution of tasks for CPU binding.
一个作业需要18个 CPU (18个任务,没有超量使用)。在默认分区中运行作业。使用节点内的块分配在每个节点上分配6个 CPU。为 CPU 绑定使用任务到节点的循环分布和任务块分布。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
This example shows the use of task affinity with customized allocation of CPUs and distribution of tasks across nodes and within nodes for binding. The srun options specify the following conditions for the job: 18 tasks, 1 unique CPU per task, using all 3 nodes in the partition, with 6 tasks per node. The CR_CORE_DEFAULT_DIST_BLOCK configuration option specifies block allocation within nodes. To satisfy these conditions, Slurm allocates 6 CPUs on each node, with 4 CPUs allocated on one socket and 2 CPUs on the other socket. The --distribution=cyclic:block option specifies cyclic distribution of tasks to nodes and block distribution of tasks to CPUs within nodes for binding. The following table shows a possible pattern of allocation, distribution and binding for this job. For example, task id 10 is bound to CPU id 3 on node n1.
这个例子展示了如何使用任务关联性来自定义 CPU 的分配,以及在节点之间和节点内部分配用于绑定的任务。Srun 选项为作业指定了以下条件: 18个任务,每个任务1个唯一的 CPU,使用分区中的所有3个节点,每个节点6个任务。CR _ CORE _ DEFAULT _ DIST _ BLOCK 配置选项指定节点内的块分配。为了满足这些条件,slurm 在每个节点上分配6个 CPU,在一个套接字上分配4个 CPU,在另一个套接字上分配2个 CPU。分布 = 循环: 块选项指定任务到节点的循环分布,以及任务到 CPU 在节点内的绑定块分布。下表显示了此作业的分配、分发和绑定的可能模式。例如,任务 id10绑定到节点 n1上的 CPU id3。
Nodename
姓名
n0
没有
n1
N1
n2
N2
Socket id
插座识别码
0
1
0
1
0
1
Number of Allocated CPUs
分配的 CPU 数量
4
2
4
2
4
2
Allocated CPU ids
已分配的 CPU ID
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
Number of Tasks
任务数量
6
6
6
Distribution of Tasks to Nodes, by Task id
按任务 ID 将任务分布到节点
0 3 6 9 12 15
1 4 7 10 13 16
2 5 8 11 14 17
Binding of Tasks to CPUs
将任务绑定到 CPU
CPU id
CPU ID
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Task id
任务 ID
0
3
6
9
12
15
-
-
1
4
7
10
13
16
-
-
2
5
8
11
14
17
-
-
EXAMPLE 15: CONSUMABLE RESOURCES WITH TASK AFFINITY TO OPTIMIZE THE PERFORMANCE OF A MULTI-TASK, MULTI-THREAD JOB
示例15: 具有任务亲和性的可消耗资源,以优化多任务、多线程作业的性能
A job requires 9 CPUs (3 tasks and 3 CPUs per task with no overcommitment). Run the job in the default partition, managing the CPUs to optimize the performance of the job.
一个作业需要9个 CPU (每个任务需要3个任务和3个 CPU,没有超量使用)。在默认分区中运行作业,管理 CPU 以优化作业的性能。
slurm.conf options:
Conf 选项:
Command line:
命令行:
Comments:
评论:
To optimize the performance of this job, the user wishes to allocate 3 CPUs from each of 3 sockets and bind each task to the 3 CPUs in a single socket. The SelectTypeParameters configuration option specifies a consumable resource type of cores and block allocation within nodes. The TaskPlugin configuration option enables task affinity. The srun options specify the following conditions for the job: 3 tasks, with 3 unique CPUs per task, with 1 task per node. To satisfy these conditions, Slurm allocates 3 CPUs from one socket in each of the 3 nodes in the default partition. The --cpu-bind=cores option causes Slurm to bind each task to the 3 allocated CPUs on the node to which it is distributed. The following table shows a possible pattern of allocation, distribution and binding for this job. For example, task id 2 is bound to CPU ids 0, 1 and 2 on socket id 0 of node n2.
为了优化此作业的性能,用户希望从3个套接字中的每个套接字分配3个 CPU,并将每个任务绑定到单个套接字中的3个 CPU。SelectTypeParameter 配置选项指定可消耗资源类型的核和节点内的块分配。TaskPlugin 配置选项启用任务关联。Srun 选项为作业指定了以下条件: 3个任务,每个任务有3个唯一的 CPU,每个节点有1个任务。为了满足这些条件,slurm 从默认分区的每个节点的一个套接字中分配3个 CPU。Cpu-bind = core 选项会导致 slurm 将每个任务绑定到分布到的节点上的3个分配的 CPU 上。下表显示了此作业的分配、分发和绑定的可能模式。例如,任务 id 2绑定到节点 n2的套接字 id 0上的 CPU id 0、1和2。
Nodename
姓名
n0
没有
n1
N1
n2
N2
Socket id
插座识别码
0
1
0
1
0
1
Number of Allocated CPUs
分配的 CPU 数量
3
0
3
0
3
0
Allocated CPU ids
已分配的 CPU ID
0 1 2
0 1 2
0 1 2
Number of Tasks
任务数量
1
1
1
Distribution of Tasks to Nodes, by Task id
按任务 ID 将任务分布到节点
0
1
2
Binding of Tasks to CPUs
将任务绑定到 CPU
CPU id
CPU ID
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Task id
任务 ID
0
-
1
-
2
--
EXAMPLE 16: CONSUMABLE RESOURCES WITH TASK CGROUP
示例16: 具有任务组的可消耗资源
A job requires 6 CPUs (6 tasks with no overcommitment). Run the job in a single node in the default partition.
一个作业需要6个 CPU (6个任务,没有超量使用)。在默认分区中的单个节点中运行作业。
slurm.conf options:
Conf 选项:
cgroup.conf options:
Cgroup.conf 选项:
Command line:
命令行:
Comments:
评论:
The task/cgroup plugin currently supports only the block method for allocating cores within nodes. Slurm distributes tasks to the cores but without cpu binding, each task has access to all the allocated CPUs. The following table shows a possible pattern of allocation, distribution and binding for this job.
目前,Task/cgroup 插件只支持在节点内分配核心的块方法。Slurm 将任务分配给核心,但是没有 cpu 绑定,每个任务都可以访问所有分配的 cpu。下表显示了此作业的分配、分发和绑定的可能模式。
Nodename
姓名
n0
没有
Socket id
插座识别码
0
1
Number of Allocated CPUs
分配的 CPU 数量
4
2
Allocated CPU ids
已分配的 CPU ID
0 1 2 3
4 5
Binding of Tasks to CPUs
将任务绑定到 CPU
CPU id
CPU ID
0
1
2
3
4
5
6
7
Task id
任务 ID
0-5
0-5
-
The task/cgroup plugin does not bind tasks to CPUs. To bind tasks to CPUs and for access to all task distribution options, the task/affinity plugin can be used with the task/cgroup plugin:
任务/cgroup 插件不将任务绑定到 CPU。为了将任务绑定到 CPU 并访问所有任务分发选项,可以将任务/亲和性插件与任务/cgroup 插件一起使用:
Last modified 16 March 2022
最后修订日期: 2022年3月16日
Last updated