Optimizing HPC Clusters

Enabling On-Demand BIOS Configuration Changes in HPC Clusters

Since around 2000, most high-performance computing (HPC) systems have been set up as clusters based on commodity x86 hardware. These clusters consist of one- or two-socket servers to perform the actual computations, plus storage systems and administrative nodes.

Modern Intel® Xeon® processor-based systems, as well as the Linux* kernel, provide many ways to optimize both hardware and operating system (OS) for a specific application. It’s easy to do if your cluster is only used for a specific workflow―but for more complex usages, it’s beyond the ability of most cluster managers.

There is a way to perform complex optimizations on a per-job basis. Intel tested this idea in its benchmarking data center, which has approximately 500 compute nodes. The cluster, known as Endeavor, is rebuilt on a regular basis with the latest hardware and has been listed among the TOP500 SuperComputer Sites since 2006. Figure 1 shows the layout of Endeavor.

Figure 1 – Layout of Intel’s Endeavor cluster

Users typically connect to one of several login nodes, package up their workloads in the form of job scripts, and submit those jobs to a cluster manager. The cluster manager (e.g., Altair PBS Pro*, Bright Cluster Manager*, IBM LSF*, or Slurm*) is a scheduling tool trying to allocate the cluster resources as efficiently as possible. For the cluster manager, each job is simply a request to use X compute nodes for a Y amount of time.

When a user submits a job, the system checks to see if all necessary parameters are within sensible limits, and then waits until enough resources of the required type become free (Figure 2). Once the cluster manager can assign enough nodes, it runs a special program called the prologue (Figure 3). This program is usually executed on the first node (also called the headnode) assigned to the job. The purpose of the prologue varies, but it might be used to check that all nodes assigned to a job are in good health. Once the prologue successfully completes, the cluster manager starts the actual job on the nodes. In most cases, this is a shell script executed on the headnode. Once this script terminates in any way, the cluster manager cleans up the nodes, running an epilogue program and preparing the nodes for the next job.

 

Figure 2 – LSF job flow chart on Intel’s Endeavor cluster

Figure 3 – Prologue program

This workflow, implemented by most cluster managers, faces a problem when it becomes necessary to reboot nodes as part of the prologue process. Rebooting might be necessary because:

  • The Intel® Xeon Phi™ processor is equipped with fast on-socket MCDRAM memory, usable either as a standard memory block or as a fourth-level cache. Using it as a cache will speed up programs automatically, but using it as standard memory might be even faster. Switching between those modes requires changing a BIOS option and rebooting the system.
  • The Intel® Xeon® processor uses a mesh to communicate among the cores, cache, memory, and the PCIe controller. The BIOS Option Sub Numa Cluster allows a user to reconfigure the CPU and split it into virtual sockets.
  • Modern Linux* kernels know the concept of NO_HZ cores. Normally, cores are interrupted every 100 to 1,000 ms to do kernel work and task switching. For typical HPC workloads, this behavior is counter-productive. With the NO_HZ parameter, one can configure the kernel to schedule only one interrupt per second. This decreases OS noise, increases scalability, and can improve performance but requires kernel changes and a system reboot.

With the standard approach of prologue scripts, the requirement of a reboot leads to a dilemma. The moment the headnode reboots, the cluster manager will assume the prologue has terminated, stop preprocessing of the job, and then reschedule it. A new set of nodes will be allocated for the job, a new headnode will be selected, and the prologue will execute―with exactly the same results.

Finding a Solution

Intel’s HPC benchmarking cluster, Endeavor, currently uses IBM LSF*. The only simple and portable solution we could find was to add a master control node (MCN) to each job (Figure 4). This MCN would automatically become the headnode of a job. It would execute the prologue, use syscfg to make any changes to the BIOS configuration, and reboot the compute nodes using the Intelligent Platform Management Interface (IPMI). It would then wait for the nodes to come back up, check that everything is correct, and finish the prologue. If the prologue completes successfully, the job will start.

Figure 4 – Master control

Users would now see a small difference. Normally, they would request a number of nodes of the same type including the headnode. But for a reconfig-job, the headnode will be of a different type. The user now has two options:

  1. Login to the first node in the nodelist, then continue to work as usual.
  2. Use an option in modern MPI versions to exclude the headnode from the list of nodes used for job processing.

The latter approach has some advantages:

  • Typically, the headnode of an MPI program has to start up additional processes―for example, an ssh process for every MPI process in the program. Since this additional load remains on the MCN, all compute nodes will have an identical system load.
  • Specific to Intel Xeon Phi processor-based clusters, if the MCN is a standard Intel Xeon processor-based server, startup scripts and MPI initialization will process faster than on an Intel Xeon Phi processor-based system.

Implementation

On Endeavor, the prologue and epilogue not only check node health. The system also allows users to change parameters requiring root privileges. Since most of the code for the prologue and epilogue is identical, we use the same script in both cases, switching codepaths when necessary.

Prerequisites

Job Submission

We wanted to stay close to standard LSF syntax, so we wrote a small wrapper script around bsub, extending the command with a -l option. All special requests are translated into unique environment variables, since LSF transfers the environment of the user not only into the job, but also to the prologue and epilogue.

If necessary, the wrapper would automatically extend the resource requirements to include the control nodes:

$ bsub -R ‘2*{select[ekf]span[ptile=1]}’ -l KNL_MEMMODE=1 run.sh
Warning ‘-l KNL_MEMMODE=1’ will reboot compute nodes
Resource_List_KNL_MEMMODE=1
bsub.orig -R ‘1*{select[rebootctrl]} + 2*{select[ekf] span[ptile=1]}’ run.sh
...

The user requests two nodes of the type ekf. The script sets the corresponding environment variable. The selection string is expanded to include the control node type. With those two changes, the original LSF binary (renamed bsub.orig) is called.

Remotely Booting Nodes

To reboot nodes via the network, we use the IPMI.

Booting Cluster Nodes via Preboot Execution Environment (PXE)

PXE is already used in most clusters. For reconfiguration purposes, we made use of the way the pxelinux.0 binary queries the tftp server for boot configuration files (Figure 5).

Figure 5 – Preboot execution environment

The first query is for a file named after the MAC address (01-00-1e-67-94-a0-8f). The second query is for a file named after the IP address assigned by the DHCP server coded in hexadecimal (2465220A). We use the second query for default boots and can therefore―temporarily―create a suitable file of the first type to override boot configurations for a specific job.

For this to work, we depend on a systematic PXE configuration. For each node a link nodename points to its IP address coded in hexadecimal. This second file is again a link pointing to the default configuration.

emhtest329 -> 2465220A
2465220A -> ol7u3_sda6

We allow users only specific combinations prepared as special1, special2, and so on. The PXE boot directory contains next to the default configuration:

ol7u3_sda6

Files like:

ol7u3_sda6-k229sp0
ol7u3_sda6-k514sp1
ol7u3_sda6-k514sp2
...

Each represents a boot configuration. In this case, k229 and k514 indicate different kernel versions, sp1,2,3... all have special kernel options (e.g., NOHZ_full). If the user requests a specific configuration, the corresponding file for this node has to be present. So, in our example, the node emhtest329 could be rebooted into the configuration k515sp2 because the file ol7u3_sda6-k514sp2 is present. But asking for k514x5 would fail.

Changing BIOS Options

Intel provides the syscfg utility for Intel-manufactured motherboards, which allows reading and modifying BIOS parameters from Linux. Not all OEMs provide similar tools.

Integration Into the Cluster Manager

The integration into LSF is now straightforward. The prologue is automatically executed by LSF on the master control node. Early in the prologue, before any other checking or setup is done, the reconfiguration script needs to be executed on all compute nodes of the job (not on the control node itself). If a node declares that a reboot is needed, the prologue can use IPMI to reset it. It then waits until the reboot is complete. A maximum wait time ensures that nodes failing to boot will not wreck this scheme. During epilogue, the similar jobflow reestablishes node settings to their default values and, if necessary, reboots the nodes.

Anatomy of the Reconfiguration Script

The script is executed by the prologue on each node and reacts to a number of environment variables:

Prologue                       # 1 if prologue is executed
# user requirements from bsub
Resource_List_KNL_MEMMODE
Resource_List_KNL_CLUSTERMODE
Resource_List_Sub_NUMA_Cluster
Resource_List_SPECIAL_KERNEL

Intialization shows where the critical syscfg binary is and where the files are stored. We track if the node
needs a reboot in the REBOOT variable:

SYSCFG=/usr/local/bin/syscfg
SAFEDIR=/var/lib/icsmoke3/safe
CURRENTDIR=/var/lib/icsmoke3/current
PXEDIR=/admin/tftpboot/3.0/pxelinux/pxelinux.cfg
REBOOT=no
HOSTNAME=`hostname`

The helper function helps when output from the syscfg command is not always directly usable as input. The
sed command below will transform a line from syscfg.INI in the form:

Cluster Mode=Quadrant;Options: All2All=00: SNC-2=01: SNC-4=02: Hemisphere=03:
Quadrant=04: Auto=05

into its associated numerical value. It requires the variable $I to be set correctly.

convert_syscfg()
{
 echo “$1” | sed -e “s,${I}=Cache.*,0,” -e “s,${I}=Flat.*,1,” -e
“s,${I}=All2All.*,0,” -e “s,${I}=SNC-2.*,1,” -e “s,${I}=SNC-
4.*,2,” -e “s,${I}=Hemisphere.*,3,” -e “s,${I}=Quadrant.*,4,” -e
“s,${I}=Disabled.*,0,” -e “s,${I}=Enabled.*,1,”
}

Dump the current BIOS configuration to get all current settings:

cd $CURRENTDIR
/bin/rm syscfg.INI
$SYSCFG /s INI

After a job completes, the epilogue should run on all nodes.

if [ “$prologue” != “1” ]
then

We first check if a special PXE configuration already exists. If so, it should be removed. If it’s not possible to remove the file, the script will fail with an error:

# this is the PXE link used to boot to special kernel
ADDR=”01-`sed -e ‘s,:,-,g’ /sys/class/net/eth0/address`”
ADDRFILE=”/admin/tftpboot/3.0/pxelinux/pxelinux.cfg/$ADDR”
if [ -e “$ADDRFILE” ]
then
 /bin/rm $ADDRFILE
 sleep 1 # wait for the cluster file system to catch up
 if [ -e “$ADDRFILE” ]
 then
  badmin hclose -C “wrong bootimage, fix $ADDRFILE” $HOSTNAME
  REBOOT=error
  exit 1
fi

Check current values for BIOS options against expected values in $SAFEDIR/syscfg.INI and correct any differences. If there are any differences, set the variable $REBOOT=yes.

for I in “Memory Mode” “Cluster Mode” “Sub_NUMA Cluster” “IMC Interleaving”
 do
  CURRENT=`egrep “$I” $ CURRENTDIR/syscfg.INI`
  SAFE=`egrep “$I” $ SAFEDIR/syscfg.INI`
  if [ “$CURRENT” != “$SAFE” ]
  then
   VAL=`convert_syscfg “$SAFE”`
   $SYSCFG /bcs “” “$I” “$VAL”
   REBOOT=yes
  fi
 done

Our kernel command lines contain a hint to whether the node is running the default kernel or anything special in need of a reboot.

grep -q “CRTBOOT=default” /proc/cmdline || REBOOT=yes

The rest of the script is only processed during prologue:

else

For every BIOS option, we check if the user supplied allowable values. Then the current configuration is compared against the one requested by the user. If a configuration change is necessary, we use syscfg to set the new value, and set the $REBOOT variable to indicate that rebooting is necessary, e.g.:

  case “$Resource_List_KNL_MEMMODE” in 0|1)
   I=”Memory Mode”
   CURRENT=`egrep “$I” $CURRENTDIR/syscfg.INI`
   SAFE=`egrep “$I” $SAFEDIR/syscfg.INI`
   VAL=`convert_syscfg “$CURRENT”`
   test -n “$SAFE” && test “$VAL” != “$Resource_List_KNL_MEMMODE”
 && { $SYSCFG /bcs “” “$I” “$Resource_List_KNL_MEMMODE” ;
REBOOT=yes; }
  esac

For logging purposes, we display the current values of all settings:

# display current settigns
$SYSCFG /d BIOSSETTINGS “Memory Mode”
$SYSCFG /d BIOSSETTINGS “Cluster Mode”
$SYSCFG /d BIOSSETTINGS “Sub_NUMA Cluster”

To configure a different kernel, the script first locates the standard PXE-config file for this node in $PXEDIR. That name is expanded by $Resource_List_SPECIAL_KERNEL. If the requested configuration file exists, a link with the name 01-MACADDRESS is created, and will take precedence on the next boot. The $REBOOT variable is set to “yes.”

 if [ -n “$Resource_List_SPECIAL_KERNEL” ]
 then
  echo “configuring for Kernel $Resource_List_SPECIAL_KERNEL”
  if ! `grep -q “CRTBOOT=${Resource_List_SPECIAL_KERNEL}” /proc/cmdline`
  then
   DEFAULT=`readlink -f $PXEDIR/$HOSTNAME`
   DIR=`dirname $DEFAULT`
   BASE=`basename $DEFAULT`
   if [ -e “$DIR/${BASE}-${Resource_List_SPECIAL_KERNEL}” ]
   then
    cd $DIR
    ln -s ${BASE}-${Resource_List_SPECIAL_KERNEL} “$ADDR”
    echo “created $DIR/$ADDR”
    ls -l “$DIR/$ADDR”
    REBOOT=yes
   else
    echo “can not set kernel to $DIR/${BASE}-${Resource_List_SPECIAL_KERNEL}”
   fi
  fi
 fi
fi

The script ends, producing an output of either “REBOOT=yes” or “REBOOT=no”.

echo “REBOOT=$REBOOT”

The prologue script running on the control node will parse the output and, depending on this output, issue a reboot sequence via IPMI.

Optimizing Your Cluster

Modern Intel Xeon processor-based systems, as well as the Linux kernel, provide many options for optimizing the hardware and OS for a specific application. We’ve outlined a way to perform complex optimizations on a per-job basis. There’s a price to pay in the form of added complexity and job startup times, but for Intel’s HPC benchmarking cluster, Endeavor, this feature became a very important way to boost performance over the last year. Your gains might be even higher.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

For more complete information about compiler optimizations, see our Optimization Notice.