Colorado academic researchers will soon have a new supercomputer at their fingertips.
The Summit Supercomputer is funded by the National Science Foundation (NSF) Major Research Instrumentation (MRI) Program (award #1532235). The total award amount was $3.5 M, with funding split between Colorado State University (CSU = $850K (25%)) and the University of Colorado Boulder (CU = $2.7M (75%)). Summit resources will also be available to members of RMACC (Rocky Mountain Advanced Computing Consortium). The system will be hosted in a CU Boulder data center with system administration duties handled by CU IT staff. The new system was briefly described in a recent HPC Wire press release.
Summit is scheduled to go in production in June, 2016.
Summit is a heterogeneous computing system with traditional CPU compute nodes, GPU and MIC accelerator nodes, high-memory nodes, an OmniPath Fat-Tree interconnect and 1 PB of scratch disk storage. Summit’s peak performance should be around 450 TeraFlops (TF) / sec., placing it about #175 in the Top500 list. Actual performance metrics will be determined when the system goes live.
We’ll briefly describe the Summit system here.
System Architecture Overview
Summit is a heterogeneous parallel cluster compute server. It’s designed to serve a broad and diverse clientele in research computing with integrated state-of-the-art hardware and software components. A schematic of the system architecture is shown below.
Summit is designed around several major components including:
- Multicore Intel Haswell CPU compute nodes
- Nvidia K80 GPU accelerator compute nodes
- Intel Knights-Landing Phi MIC compute nodes
- High memory compute nodes
- 1 PB scratch storage
- GPFS parallel file system
- Intel OmniPath interconnect
- Fat-Tree interconnect topology
- Heterogeneous software stack
- Recondite support staff
Summit includes 376 CPU compute nodes with 9,024 Intel Haswell CPU cores. The role of the CPU nodes is to offer a platform for general purpose HPC applications. The bulk of our client software applications are designed for CPU-only execution. Many applications were first developed decades ago when CPU cores were the only available execution platform. The CPU cores are based on current generation Intel Haswell processors with 24 CPU cores per node. Each node includes 128 GB RAM with a 5.3 GB RAM / core ratio. The RAM / core ratio is an important metric for HPC applications. Based on our experiences with earlier systems, many applications suffer performance degradation due to limited memory bandwidth. Thus, in Summit we boosted the RAM / core ratio to remove this bottleneck. Each node also includes a 200 GB SATA SSD for staging file I/O to a fast, intermediate storage device.
CPU compute nodes with Intel Haswell cores.
CPU compute node specs:
- 376 CPU nodes
- Dell Poweredge C6320
- 9,024 Intel Haswell CPU cores total
- 2X Intel Xeon E5-2680v3; 2.5 GHz
- 24 CPU cores / node
- 128 GB DDR4 RAM / node
- 5.3 GB RAM / core
- 200 GB SATA SSD / node
Summit includes 10 Nvidia GPU (graphical processing unit) nodes with 99,840 GPU cores. Each node hosts dual Nvidia K80 GPU cards with 4,992 GPU cores and 24 GB GDDR5 RAM per card. Each node also has the same CPU configuration as a standard CPU node, that is, 24 CPU core Haswell processors, 128 GB DDR4 RAM and 200 GB SATA SSD. The I/O expansion bus supports PCIe Gen3 x16 interface cards. In GPU computing, the CPU host environment is an important design element as CPU-GPU data management and data communication patterns strongly affect the overall performance of software applications.
Nvidia GPU compute nodes.
GPU compute node specs:
- 10 GPU nodes
- Dell Poweredge C4130
- 99,840 GPU cores total
- 2X Nvidia K80 GPU cards / node
- 4,992 GPU cores / node
- 24 GB GDDR5 RAM / node
- PCIe Gen 3.0 x16
- 2X Intel Xeon E5-2680v3; 2.5 GHz
- 24 CPU cores / node
- 128 GB DDR4 RAM / node
- 5.3 GB RAM / CPU core
- 200 GB SATA SSD / node
Knights Landing Phi Nodes
Summit includes 20 Intel Knights Landing Phi (KnL-F) nodes with 1,440 KnL-F cores. KnL-F processors are still under development by Intel with a scheduled release date in Q4 2016. Our current understanding of KnL-F design is that it will support 72 Silvermont/Atom cores per node and 16GB HBM (high bandwidth memory) 3D stacked MCDRAM (multi-channel DRAM) in a MIC (many integrated core) architecture. There will be an additional 384 GB DDR4 platform RAM. Each KnL-F node includes a 200 GB SATA SSD for staging file I/O.
Intel KnL-F compute nodes.
KnL-F compute node specs:
- 20 KnL-F nodes
- 1,440 KnL-F cores total
- 72 Silvermont/Atom cores / node; 1.3 GHz
- 16 GB HBM (high bandwidth memory) 3D stacked MCDRAM (multi-channel DRAM) / node
- 384 GB DDR4 platform RAM / node
- 200 GB SATA SSD / node
High Memory Nodes
Summit includes five high-memory nodes with 2 TB DDR4 RAM per node. Some applications require large amounts of memory to run efficiently or even to run at all. The HiMem nodes provide sufficient RAM to host these special applications. In addition, the nodes are designed with a high RAM / CPU core ratio, i.e. 42 GB RAM / core. This is an important design feature as it minimizes memory bandwidth contention. The combination of large capacity, high bandwidth RAM nodes benefits those software applications that need a lot of memory. Each HiMem node includes a 200 GB SAS SSD for staging file I/O.
2 TB RAM high-memory compute nodes.
HiMem compute node specs:
- 5 HiMem nodes
- Dell Poweredge R930
- 4X Intel Xeon E7-4830v3; 2.1 GHz
- 48 CPU cores / node
- 2 TB DDR4 RAM / node
- 42 GB RAM / CPU core
- 200 GB SAS SSD / node
- 12 TB SAS HDD / node
Scratch Disk Storage
Summit provides 1 Petabyte (PB) of scratch disk storage on state-of-the-art DDN SFA14KE block storage devices. The SFA14KE’s support up to 6M random 4K IOP’s (input/output operations per sec.) and 21 GB / sec. sequential R/W operations, some of the highest I/O bandwidth available today. The disk arrays are configured in a standard RAID6 format, which allows for two drive failures before a full RAID rebuild is required.
The SFA14KE offers innovative software solutions for high-speed I/O. It includes an Infinite Memory Engine (IME) burst buffer that substantially accelerates file write operations, minimizes file I/O contention and minimizes file lock contention. A ReACT intelligent cache management system that optimizes I/O operations. DirectProtect software detects and corrects silent data corruption on physical disks. GRIDScaler software that integrates GPFS (below) directly in RAID disk arrays.
The SFA14KE storage appliance provides the highest levels of I/O bandwidth, storage capacity, GPFS parallel file integration, smart cache management and optimized I/O operations.
1 PB DDN SFA14KE scratch storage device.
Scratch storage specs:
- 1 Petabyte (PB) scratch storage
- DDN SFA14KE block storage device
- GRIDScaler (GPFS integration)
- Infinite Memory Engine (IME) burst buffer
- ReACT intelligent cache management; analyze & optimize I/O operations
- DirectProtect; data corruption detection & correction
- 21 Gbytes / sec. sequential R/W
- 6M random 4K IOP’s (input/output operations per second)
- Direct native connect to OmniPath
General Parallel File System (GPFS)
Summit uses IBM’s General Parallel File System v.4.1 for high-speed parallel I/O operations. I/O bottlenecks are often as significant as compute bottlenecks in the overall performance of HPC applications. GPFS allows HPC applications to take advantage of the parallel I/O capabilities of SFA14KE scratch storage. GPFS is well suited for mixed I/O workloads, which we expect to see in our diverse user community.
OmniPath Architecture (OPA) Interconnect
Summit is one of the first large-scale systems to deploy Intel’s new OmniPath interconnect. OmniPath is an integrated system of high-speed network switches, host-fabric PCIe interface cards and OPA cabling. OPA is one of the fastest interconnects available today delivering 100 Gbyte / sec. fabric bandwidth.
Intel OmniPath switched-based interconnect.
- 100 Gbyte / sec. bandwidth
- Poweredge R430 OPA management servers
- OPA HFI (host-fabric interface) PCIe adapters
- OPA 48-port core & leaf switches
- OPA cables
The Summit interconnect employs a 2:1 oversubscribed Fat-Tree topology. This is one of the most common and well-tested interconnect topologies in the HPC community. On Summit it consists of 48-port Level-2 OPA switch Core nodes, 48-port Level-1 OPA switch Leaf nodes, and 376 compute nodes. 2:1 oversubscription represents a cost-performance tradeoff, with a reduction in the number of (relatively expensive) network switches for a small performance hit. The Fat-Tree topology is a balanced design that favors data locality, has the same aggregated bandwidth in each switching layer and identical bandwidth for any bisection of the interconnect architecture. Summit’s interconnect is scalable and easily supports anticipated future expansions of the system.
Fat-Tree topology interconnect.
Ethernet Management Network
Summit includes a fairly standard top-of-rack (TOR) 1 GB/sec. Ethernet network for system management activities. It includes a single core switch that aggregates connectivity from multiple edge switches and provides uplinks to a local campus science network. The network provides connectivity to all compute nodes, storage nodes, network switches, hardware controllers, chassis controllers and other system components.
The Summit software stack reflects the complexity of a heterogeneous cluster computing environment. The table below highlights a few of the more important software components in the stack (excluding lower level drivers, libraries and protocols).
|SW Component||SW Name|
|OS||Red Hat Enterprise Linux v.7.1|
STL (standard template library)
|Performance Analysis||Nvidia Nsight
CUPTI (CUDA profiling tools interface)
Summit software stack.
The inherent complexity of the software stack should be self-evident. It reflects the fact that the Summit cluster supports software applications running on traditional Haswell processors, accelerated codes running on GPU and MIC cards, a profusion of parallelizing compilers and messaging protocols, various debugging and performance analysis tools that cover CPU’s, GPU’s and MIC’s, and of course an extensive list of opensource and vendor applications.
File transfer to and from Summit is a mundane but extremely important consideration for end users. HPC systems can generate large files and large numbers of files that must be moved efficiently to/from the system. Globus will be a primary tool for file transfers. Globus provides a simple, intuitive drag-and-drop interface for data transfer. Early tests of Globus show file transfer rates around 700 GBytes / hr. between CSU and CU campuses. This is sufficient bandwidth for transferring large datasets in a reasonable timeframe. Of course, Summit supports all the usual Linux file transfer protocols including sftp, rcp, rsync, etc. And users can use popular client applications for file transfer such as FileZilla, PuTTY, and others.
Managing a large-scale heterogeneous cluster like Summit is certainly a challenge, requiring close collaboration between system administrators and application support specialists.
On Summit the system administrators manage Intel Haswell CPU’s, Nvidia K80 GPU’s, Intel KnL-F MIC’s, HiMem compute nodes, an OmniPath interconnect network, an Ethernet management network, DDN scratch storage and gateway servers. Each component has its own set of drivers, software libraries, and protocols making up a complex management environment. And of course the data center’s HVAC and electrical power must be monitored and controlled for efficient reliable operations. Technically competent sysadmin support is crucial for Summit.
Software application support is especially challenging mainly due to the heterogenous nature of the compute environment. General purpose computing is provided by Haswell CPU infrastructure. A large fraction of extant code is designed for execution exclusively on CPU cores. However, in the past five years or so, GPU-accelerated and MIC-based algorithms have come on the scene in a big way. Application support staff must be knowledgeable about GPU and MIC hardware architectures, programming styles, software debugging and performance analysis tools, and have troubleshooting skills for both platforms. GPU and MIC architectures are completely different from one another (eg. fine-grained vs coarse-grained parallelism), have their own programming methods, their own set of debugging and performance analysis tools and for the most part a different code base that runs on each platform. Software application support is more crucial than ever in this heterogeneous environment.
We’ll update this blogpost periodically as Summit evolves over the next few years.