Is Your Cluster Healthy?
Must-Have Cluster Diagnostics in Intel® Cluster Checker
Intel® Cluster Checker is a powerful tool for quickly identifying and solving issues in high-performance computing (HPC) clusters. Subtle and sometimes simple issues on a system can impact cluster performance and blunt the efforts of fine-tuning and parallelizing an application. Often, the first signs of a system issue appear when applications run too slowly―or simply stop running altogether. Intel Cluster Checker provides a methodical way to help quickly determine if the underlying reason an application is experiencing problems is actually a problem with the cluster.
Cluster Systems Expertise in a Tool
Intel Cluster Checker captures best-known methods and system diagnostics in a single tool. Introduced in 2007 as part of the Intel® Cluster Ready Program, a key goal was to provide a tool that would assist a broad range of people who design, deploy, and manage clusters.
Building and managing HPC clusters is far more complex than managing single systems. By their nature, the most powerful systems in the world, the TOP500, are custom-built systems with dedicated staff to operate them. Many of these environments grew organically over years, maintained by solution architects with deep knowledge in this space.
The problem is that the approach of large HPC data centers doesn’t necessarily scale down to smaller systems. The high level of expertise required can intimidate small and medium businesses. Even larger enterprises considering moving to HPC clusters must weigh the time and effort it takes to ramp capabilities. The learning curve can look like a mountain too steep to climb―even though the return on investment for using HPC is substantial. [Editor’s note: According to a study by Hyperion Research, every dollar invested in HPC yields $507 of revenue growth and $47 of profit. (Sources: IDC Economic Models Linking HPC and ROI and Hyperion (IDC) Paints a Bullish Picture of HPC Future)] Intel Cluster Checker provides expertise in a tool to lower the intimidation factor for people ramping HPC capabilities.
Intel Cluster Checker works like a clinical system. It looks for signs that an issue exists and then examines the signs holistically to diagnose issues―and potentially even suggest remedies. Data providers encapsulate common diagnostic tools and functions, and the tool uses these providers to collect information about the cluster. A rules-based expert system then analyzes the information to produce signs that may indicate issues. A combination of different signs can lead to a diagnosis, and the tool can often suggest a remedy. In this way, Intel Cluster Checker models an expert analysis of cluster functionality and makes it easier to resolve issues quickly.
What Does It Check?
Intel Cluster Checker includes a broad range of data providers and rules that target common issues that cause system failures or performance degradations. At a high level, Intel Cluster Checker looks at elements of individual nodes and basic functionality, stepping up cluster-wide functionality. It’s not feasible to list everything it checks, but here are some examples:
- It checks if the user running the tool has SSH keys set properly for executing message-passing interface (MPI) parallel applications.
- It verifies that the firmware version of the add-in network card is the same on each node of the system, examining library version and software consistency across the cluster.
- It finds differences in processor steppings, memory, and hardware components.
- It uncovers differences in configurations of both hardware and software components.
It also uses some common benchmarks to try to gauge how actual performance compares to expected performance. These functions are valuable at deployment time to declare a system ready for use. They also play a role in maintaining overall system health. There are hundreds of aspects of a cluster examined today, and the list of available checks keeps growing with update releases of the tool.
Over the operational lifetime of a cluster, subtle changes may be introduced with replacement parts, expansion of nodes, and reconfigurations of software or hardware. For example, a replacement network card could be plugged into a different PCIe* slot than before. New nodes added to the cluster could have a different Intel® Xeon® Scalable processor than the other nodes. Someone could have accidentally skipped updating the new BIOS settings on one of the nodes.
Over the operational lifetime of a cluster, subtle changes may be introduced with replacement parts, expansion of nodes, and reconfigurations of software or hardware.
Intel Cluster Checker helps find these issues and calls attention to them. None of them may actually be a problem for a particular system or application, but the tool highlights items to examine. Using Intel Cluster Checker can make running clusters less daunting for those who don’t have deep knowledge of cluster administration and management, and it can augment the toolset for those who do.
In addition to cluster health, Intel Cluster Checker can also verify that a cluster provides the application compatibility described in the Intel® Scalable System Framework (Intel® SSF) reference architecture. The Intel SSF reference architecture describes system requirements that define a minimum level of system characteristics. Some of these characteristics include elements of the system software for Linux*-based clusters as well as minimum requirements for system hardware. Clusters that comply with the specification provide a common platform interface that application developers can target. Applications that build on this common layer execute on any system that complies with the reference architecture. This pairing of applications and systems enables interoperability that also simplifies the ramp to using HPC clusters.
Extending and Embedding Functionality
The technologies and components that comprise clusters are constantly evolving―which increases the potential for new types of problems. Because of this, extensibility is a key feature for Intel Cluster Checker to keep pace with the scope of issues users face. Once a particular type of problem is known, capturing and adding the mechanisms to detect and resolve these new issues makes checking for them routine. Users can even create their own data providers and checks and include them in the same fashion. Intel Cluster Checker provides the capability to group data collection and analysis functions into frameworks. These frameworks allow flexibility in how the tool operates and provide a quick way to drop in new checks to extend capabilities.
Application developers can also embed Intel Cluster Checker functionality directly into their applications using an API to control data collection and analysis. Embedding functionality provides a range of options that developers can take advantage of, similar to running Intel Cluster Checker from the command line. Applications can check general health or compliance with the Intel SSF architecture. It also means developers can add customized rules that look for aspects of the system that are specific to the application’s needs. It provides a programmatic mechanism to perform system checking and debugging from the application’s point of view. The application can find underlying issues on a cluster and inform the user of potential problems. Examples of using the API are included in the online documentation for the tool.
Focus on Productivity
Variations in configurations, a mix of hardware and software components, or the state of system health can all manifest as problems for an HPC application. Using Intel Cluster Checker helps identify when systems are in a known, healthy state. That promotes a better out-of-the-box application experience for users. If problems do exist, the tool can quickly direct users to potential resolutions. Ultimately, this lowers the expertise barrier of running HPC clusters and opens doors for more users running cluster applications to achieve bigger and better results.
Intel Cluster Checker is currently available as part of Intel® oneAPI HPC Toolkit. It’s also provided on Intel® Select Solutions for Simulation & Modeling and may be included in solutions that comply with the Intel SSF reference architectures for classic HPC clusters.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.