Last December, I wrote a blog post about big data. In it I mentioned the central role the massive number crunching capability of the new Power8 CPU plays in meeting the demands of today’s big computing tasks. In this post, I want to focus on one of the key components of the Power8 compute solution, which is its ability to make use of hardware based computational accelerators.
Math Co-Processors
The idea of a co-processor to perform specific math computations is not a new idea. From the very beginning of the x86 architecture, for example, Intel included a companion x87 chip that could be implemented to handle floating point arithmetic calculations offloaded from the main x86 CPU. In the early days, one could buy a PC with or without what was called a math co-processor. By the early 1990s, however, with the introduction of the 80586 CPU, die sizes had become small enough to incorporate the functions formerly performed by the co-processor directly on the main chip. The day of the separate co-processor, at least in the x86 family, had come to an end.
IBM’s Power architecture never had an optional external math co-processor in the same way Intel did. The IBM architects always focused on providing maximum arithmetic calculation horsepower within the core CPU package. This compute capacity was generally available to any thread running on the CPU; it was not compute task specific and ran in series with all other necessary CPU functions. This was in contrast to the external co-processor which had the potential to do computation specific calculations in parallel with the main CPU.
This architecture changed with the introduction of the Power7+ chip. The Power7+ was implemented with a 32nm feature size, down from the 45nm feature size of the Power7, but the Power7+ retained the same die size of the Power7, resulting in almost double the number of transistors available to the chip architects. They opted to make use of some of those extra transistors to target two specific math-intensive use cases. One was the calculation of cryptographic algorithms; and the other was the calculation required to compress and uncompress real-memory pages demanded by the new Active Memory Expansion (AME) functionality implemented by the Power7 servers. The solution entailed implementation using an on-chip co-processor built and coded specifically for computing a set of common cryptographic algorithms and AME-specific memory compression and decompression algorithms.
The Power8 chip carried that design forward, adding additional on-chip accelerators for Hardware Transactional Memory (HTM), Virtual Memory Management and Partition Mobility. What’s interesting is that in addition to these on-chip accelerators, the Power8 adds a generic capability to support an x87-style external accelerator. This has significant impactions for the future of the Power architecture as it opens the possibility for third-party vendors to provide co-processors to enhance the core capabilities of the Power8 chip into any specialized direction. Thus, Power8 becomes a platform upon which any number of specialized compute engines can be built.
Coherent Attached Processor Interface (CAPI)
A key component needed to support this external co-processor is CAPI. One of the challenges to overcome in implementing a co-processor is integrating it into the architecture of its host machine. Speed is critical; getting data to the co-processor and getting answers back as fast as possible is imperative. In the past, co-processor implementations used a device-driver model to do this, but that adds layers of protocol between the main system’s memory address space and the data that is addressed by the co-processor. Ideally, the co-processor should be able to address the same memory space as the main CPU, allowing it to operate as a peer of and in parallel with the main CPU. This key memory access model is what CAPI provides, eliminating the device driver bottleneck.
Access to CAPI technology is being offered by IBM through the Open Power Foundation. While not an open-source technology, CAPI and its associated technology is available to anyone willing to pay the license fee to become a member of the foundation.
The NVIDIA Tesla GPU
One of the early adopters of CAPI is NVIDIA, a leading developer of graphical processing units (GPUs). Originally developed to facilitate computing the large volumes of data needed by graphics intensive applications such as CAD and gaming — GPUs are at heart mathematical computation engines, and by appropriate coding, they can perform any kind of mathematical calculation requested. Since 2007, NVIDIA has been doing just that — repurposing their industry-leading proprietary graphics processing technology towards general purpose number-crunching. Today the NVIDIA Tesla K40 GPU can be ordered as a CAPI attached GPU for Power8 servers. To support the Tesla GPU, NVIDIA also supplies a programming model and associated instruction set (Compute Unified Device Architecture [CUDA]) to make it possible for developers to easily and effectively harness the power of the Tesla GPU and bring it to bear on their computations.
One of the areas of interest that IBM has recently targeted for a Tesla-based solution is the acceleration of Java-coded applications. The IBM-developed CUDA4J library provides application program interfaces (APIs) that allow Java programs to control and direct work to the Tesla engine using the normal Java memory model. Early experiments with Tesla-accelerated Java applications have yielded speed improvements approaching or exceeding an order of magnitude and promise better to come.
Reaching the Summit
The Power+NVIDIA combination has drawn the attention of one of the biggest supercomputer customers around — the U.S. Department of Energy (DoE). Responsible for both the Oak Ridge and Lawrence Livermore laboratories, the DoE operates some of the largest supercomputers in the world and has done so for a long time. Just last fall, the DoE announced that the next-generation flagship computers commissioned for these labs would be based on the IBM Power9 + NVIDIA Volta technology combination. The largest of these machines, codenamed Summit, is due for delivery in 2017. Taking over from Titan, an Opteron+Tesla-based system currently ranked as the second most powerful supercomputer in the world, Summit will be more than five times more powerful.
So, as you realize that your big data is going to demand some big computing — now you know where to find it.
Related Courses
Power Systems for AIX I: LPAR Configuration and Planning (AN11G)
POWER8 Systems and AIX Enhancements (AN101G)