INTERFACE DIFFERENCES BETWEEN V1 and V2
--------------------------------------------------------

Old Interface			Need to work with
'''''''''''''			''''''''''''''''''''''
memory.limit_in_bytes		memory.max
memory.soft_limit_in_bytes	memory.high
memory.memsw_limit_in_bytes 	memory.swap.max
memory.swappiness		none
freezer.state			cgroup.freeze
cpuset.cpus			cpuset.cpus.effective and cpuset.cpus
cpuset.mems			cpuset.mems.effective and cpuset.mem
cpuacct.stat			cpu.stat

OTHER CHANGES
-------------
- Freezer controller is always implicit in cgroup.freeze interface.
- Devices controller is now an eBPF program.
- Work now always at task level, even if no jobacctgather/cgroup plugin is set.
  Adding a pid to a step adds it to a special task directory, "task_special".
- The memory.stat file has changed and now we do the sum of
  anon+swapcached+anon_thp which should be equivalent to rss concept in v1.
  There are many possibilities besides that one that can be object of future
  study, like the use of memory.current or PSI metrics.
- The cpu.stat interface provides metrics in ms while cpuacct.stat provided
  metrics in USER_HZ. New logic in commit:
  "Recognize different cpu accounting units"
- The slurmstepd daemons are put into its own cgroup, but they are not
  constrained by the step limits. The memory used by the step is accounted
  globally as part of the job. A slurmstepd can sometimes grow in memory, for
  example with the pmi initialization when the user initializes many ranks in a
  mpi job and the mpi stack consumes memory. There are pros and cons to account
  the slurmstepd consumption as part of the step vs part of the job.
  On one hand it seems reasonable to compute it as part of the job because the
  profiling will be more consistent regardless of changes in slurmstepd.
  On the other hand an uncontrolled slurmstepd can terminate the job completely
  instead of only the step. In both cases, if user processes consume too much
  memory and causes OOM to act in this cgroup, it can also kill slurmstepd which
  would impede it to do the proper cleanup of the job. This is worked around as
  in v1 with setting oom_score_adj to -1000 to make slurmstepd unkillable. It
  can be discussed to use a -999 instead to make it killable but more unlikely.

HOW TO START SLURM
------------------

You can start slurmctld and dbd as usual.

slurmd needs to be started through systemd because cgroup v2 API has changed and
the kernel delegates the control to a single writer, which is PID 1, the systemd
pid. In order to be able to work with parts of the tree to other pids, these
must be started by systemd itself and set to Delegate=yes. Then systemd won't
touch these subtree.

But a more important reason is because if we e.g. start slurmd from our terminal
the slurmd itself will reside in the same cgroup than this terminal, for example
on Gnome:

/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-fbd45fd9-2c1f-4dce-b177-da71ad46be3b.scope

then, if slurmd tries to modify the cgroup hierarchy under this tree, he will
find pids other than himself in cgroup.procs, and all changes will affect these
other pids and vice-versa. So for example, creating a subdirectory and attaching
himself to this new directory and then changing the subtree_control for the
parent will fail. subtree_control cannot be set if there are processes at this
level. In that case slurmd should decide what to do with these unrelated pids,
(gnome-terminal or whatever) which is not its responsibility.

Another problem can happen if we have CoreSpec* or MemSpec* set. slurmd will
then constrain these values in its cgroup, affecting also the unrelated pids.

One way to workaround this is that slurmd's cgroupv2 plugin creates a new cgroup
and then attaches himself there, but for this it needs to talk with systemd
unregistering its pid from the current unit, or then systemd accounting will be
unmatched with the reality, with any unforeseeable consequences, like reclaiming
back this pid.

It is also important to set the proper limits for slurmd, specially for cgroups
v2 the MEMLOCK limit can cause the eBPF program for devices to not be able to
load it into the kernel.

Here is an example unit file for one node called "gamba1".

 ]$ systemctl cat slurmd-master-gamba1.service
 # /usr/lib/systemd/system/slurmd-master-gamba1.service
 [Unit]
 Description=Slurm node daemon
 After=munge.service network.target remote-fs.target

 [Service]
 Type=simple
 EnvironmentFile=-/etc/sysconfig/slurmd
 ExecStart=/home/lipi/slurm/master/inst/sbin/slurmd -D -s -N gamba1 $SLURMD_OPTIONS
 ExecReload=/bin/kill -HUP $MAINPID
 KillMode=process
 LimitNOFILE=131072
 LimitMEMLOCK=infinity
 LimitSTACK=infinity
 Delegate=yes
 TasksMax=infinity

 [Install]
 WantedBy=multi-user.target

.-.-.-.-
For testing one can create a template unit file:

]$ cat ../slurmd-master@.service
[Unit]
Description=Slurm node daemon %i
After=munge.service network.target remote-fs.target

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/home/lipi/slurm/master/inst/sbin/slurmd -D -s -N %i $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
TasksMax=infinity

[Install]
WantedBy=multi-user.target

.-.-.-
And can start this by doing systemctl start slurmd-master@sgamba1.

But more conveniently there is another method. It is interesting to start slurmd
just with systemd-run. The interaction can be like this example:

]# systemd-run -G -p Delegate=yes -p LimitMEMLOCK=infinity .... -u sgamba1 slurmd -D -s -N gamba1
]# systemctl status sgamba1
]# systemctl stop sgamba1

DEVELOPER NOTES
---------------
- Systemd has three kinds of cgroup hierarchies, namely legacy, hybrid and
  unified cgroup hierarchies.

- All controllers which support v2 and are not bound to a v1 hierarchy are
  automatically bound to the v2 hierarchy and show up at the root. Controllers
  which are not in active use in the v2 hierarchy can be bound to other
  hierarchies. This allows mixing v2 hierarchy with the legacy v1 multiple
  hierarchies in a fully backward compatible way. In any case we are not going
  to support hybrid (mixed) hierarchies because they have no future, are not
  supported by many software, and simply they are not convenient.

- cgroup.procs contains the list of pids for this cgroup, but is not 100%
  reliable for a pid count:

  The same PID may show up more than once if the process got moved to another
  cgroup and then back or the PID got recycled while reading.

- "/proc/$PID/cgroup" lists a process's cgroup membership. On v1 this contain
  multiple lines, one for each hierarchy. In v2 the entry takes always the
  format "0::$PATH".

- In v2, Controllers which support thread mode are called threaded controllers.
  The ones which don't are called domain controllers. Threaded controllers allow
  thread granularity. see "cgroup.type" file.

- Each non-root cgroup has a "cgroup.events" file which contains a "populated"
  field indicating whether the cgroup's sub-hierarchy has alive processes in it.
  poll and [id]notify events are triggered when the value changes.
  This can be used, for example, to start a clean-up operation after all
  processes of a given sub-hierarchy have exited.

- There are no mount options for a cgroup2 mountpoint.

- No controller is enabled by default. Controllers can be enabled and disabled
  by writing to the "cgroup.subtree_control" file:

   echo "+cpu +memory -io" > cgroup.subtree_control

- Only controllers which are listed in "cgroup.controllers" can be enabled.

- Non-root cgroups can distribute domain resources to their children only when
  they don't have any processes of their own. So, step_0->task_0 couldn't have
  processes in step_0, only in the leaf task_0

  Note that the restriction doesn't get in the way if there is no enabled
  controller in the cgroup's "cgroup.subtree_control". This is important as
  otherwise it wouldn't be possible to create children of a populated cgroup.
  To control resource distribution of a cgroup, the cgroup must create children
  and transfer all its processes to the children before enabling controllers in
  its "cgroup.subtree_control" file.

  This means also that a cgroup which has "cgroup.subtree_control" enabled is
  not intended to be a *leaf*, so it cannot host processes. If one tries to
  attach a pid to cgroup.procs will get an EBUSY.

- A cgroup can be delegated in two ways.

  First, to a less privileged user by granting write access of the directory and
  its "cgroup.procs", "cgroup.threads" and "cgroup.subtree_control" files.

  Second, when the "nsdelegate" mount option is set automatically to a cgroup
  namespace on namespace creation.

  A delegated sub-hierarchy is contained in the sense that processes can't be
  moved into or out of the sub-hierarchy by the delegatee.

- cgroup v2 doesn't have the device controller, it uses eBPF-based device
  controller. Needs privileged containers (root). Since kernel 4.15.

  eBFP stands for extended Berkeley Packet Filter.

  https://speakerdeck.com/kentatada/cgroup-v2-internals?slide=5
  https://medium.com/nttlabs/cgroup-v2-596d035be4d7

- https://systemd.io/CGROUP_DELEGATION/

- If we want to kill by OOM an entire step set memory.oom.group = 1 to step
  cgroup, this is an option for the future if we want to implement this
  possibility.

- echo "+cpu" > cgroup.subtree_control
  bash: cgroup.subtree_control: Access Denied.

   https://www.kernel.org/doc/Documentation/cgroup-v2.txt).

    WARNING: cgroup2 doesn't yet support control of realtime processes and
    the cpu controller can only be enabled when all RT processes are in
    the root cgroup. Be aware that system management software may already
    have placed RT processes into nonroot cgroups during the system boot
    process, and these processes may need to be moved to the root cgroup
    before the cpu controller can be enabled.

    Use this to see the running RT apps:
    ps ax -L -o 'pid tid cls rtprio comm' | grep RR


MANUAL EXAMPLES
----------------

1. Check available controllers in the root and enable subtree_control for the
   required ones:

 cat /sys/fs/cgroup/cgroup.controllers
 cpuset cpu io memory hugetlb pids

2. Check/enable subtree_control for required ones

 cat /sys/fs/cgroup/cgroup.subtree_control
 memory pids
 echo "+cpuset +cpu +memory +pids" > /sys/fs/cgroup/cgroup.subtree_control
 cat /sys/fs/cgroup/cgroup.subtree_control
 cpuset cpu memory pids

3. Create slurm dir and enable subtree_control too:

 mkdir /sys/fs/cgroup/slurm
 cat /sys/fs/cgroup/slurm/cgroup.subtree_control
 <empty>
 echo "+cpuset +cpu +memory +pids" > /sys/fs/cgroup/slurm/cgroup.subtree_control
 cat /sys/fs/cgroup/slurm/cgroup.subtree_control
 cpuset cpu memory pids
