I hear a lot of talk about big data these days. What does that mean for the Power and AIX world? Let’s start by defining what big data means and then we’ll have a look at one solution IBM has built to deal with it.
Over time, we accumulate data — all sorts of data, and a lot of it. This past year my discussions around storage solution design now routinely use the word petabyte, where, just last year the same discussion would likely have used the word terabyte. Here is a statistic to consider: the amount of data accumulated every 48 hours today is about equal to the sum total of all information in human history generated until the year 2003. Big data, indeed!
Like mining for gold, there is valuable information hidden in all that data, but, like gold ore, that data has to be processed for the value to be extracted. This has always been true, but only in the recent past has the infrastructure existed that makes this extraction a practical possibility for the huge amounts of data we are now called on to deal with. As it is not possible to build single machines capable of reading and processing petabytes of data, the only practicable way of doing it is to divide up the work and distribute it across multiple machines. For this, we need an infrastructure that distributes and manages large quantities of data across a set of machines. We also need to be able to construct data mining code in such a way that it can run in parallel across the same set of machines; preferably in such a way that the portion of the code executing on any one of the machines operates to the greatest possible extent on the subset of the data resident locally on that machine. This is not a trivial task.
In the last decade, the increase in both data volumes and data processing capacity has fuelled a lot of great work in the area of parallel compute clusters. The IBM Data Engine for Analytics offering is a turnkey infrastructure solution that represents the current state of that art. Let’s look at the four key components of an IBM Data Engine:
1. The Power8 CPU
The heart of the IBM Data Engine is the Power8 processor. Introduced in April 2014, the Power8 is the latest in the long line of Power CPUs dating back to 1990. A single Power8 socket executes eight hardware threads on each of its 12 cores, for a total of 96 concurrent threads of execution. Benchmark testing indicates that the P8 currently outperforms the competition — be it Intel or Sparc — by a comfortable margin, but that is only the tip of the iceberg.
For the first time IBM is opening up the Power architecture via the creation of the OpenPower Foundation, allowing third-party vendors to make Power-based devices, much like the licensing model that’s successfully used by ARM to firmly establish itself in the mobile and embedded systems market. This creates the possibility for a non-IBM ecosystem to grow up around the architecture. Current OpenPower members, such as Tyan and Suzhou, are working on Power8 motherboards and CPUs. This has the potential to bring Power8 technology to the market at a competitive price point.
Perhaps more significantly, IBM has replaced the proprietary on-chip GX bus with the industry standard PCIe bus and designed an interface called the Coherent Attach Processor Interface (CAPI), which allows PCI devices to directly access virtual address space. This opens a path to the creation of third-party PCIe-based coprocessor devices able to operate as task-specific extensions to a Power8 CPU. Recently, Nvidia and IBM announced the availability of the Nvidia Tesla GPU installed in a Power8 system, and Nvidia recently contributed their NVLink fast interconnect technology to the OpenPower initiative, promising future Power8 systems even tighter integration of Nvidia GPU units and Power8 CPUs. Other vendors are working on general purpose field programmable co-processors (FPGAs) and CAPI-attached flash memory units. This ability to easily add task specific hardware capacity can make a huge difference in execution times for big data parallel processing jobs. It could turn out to be a game changer, potentially adding enormously to the already significant horsepower of the Power8 chip.
2. Red Hat Linux
In the big data world, Linux has emerged as the operating system of choice, and rather than bucking that trend, IBM has embraced it and gave the nod to Red Hat Enterprise Linux (RHEL) distribution for the core operating system in the data engine offering. Using Linux allows the data engine to leverage the substantial library of currently available open source big data utilities, as well as allowing additional proprietary value added IBM tools.
3. InfoSphere BigInsights and InfoSphere Streams
The dominant big data software stack is the open source Apache Hadoop project. Originally written in 2005, Hadoop became an Apache project in 2011, and as such, is distributed under the Apache open source license. The core components of Hadoop include a distributed file system, a compute resources manager, and MapReduce, the heart of the Hadoop computing model. MapReduce is the key element of the stack that does the task of dividing the task at hand across multiple nodes and then aggregating the answer at the end. A typical Hadoop installation includes a variety of other supporting components collectively known as the Hadoop Ecosystem.
For the IBM Data Engine, the core software component is InfoSphere BigInsights, which packages a complete Hadoop Ecosystem (specifically, Hadoop plus Pig, Jaql, Hive, HBase, Flume, Lucene, Avro, ZooKeeper and Oozie) along with a suite of IBM developed value-added utilities. These include installation tools, a web console, a text processing engine enabling identification of items of interest in documents and messages, an Eclipse plugin to aid in developing custom text analytic functions, a spreadsheet like data analysis tool, JDBC integration with Netezza and DB2, extensions to Hadoop’s job scheduler, support for LDAP based authentication mechanisms, performance improvements to processing of text-based compressed data, and adaptive runtime control for Jaql jobs.
Also included in the data engine software stack is InfoSphere Streams. Streams is designed to allow the development of parallel analysis code that’s capable of dealing with live data streams, by contrast with a Hadoop cluster in which work is generally batch oriented and operates on static data. Streams is an IBM product founded on work done at the IBM Watson Research Center. Streams clusters are capable of processing millions of events per second in streams including audio, video, geophysical, financial or medical data.
4. Platform Computing
Two key products from the Platform Computing portfolio are included in the IBM Data Engine. Platform Cluster Manager provides self-provisioning and management functions, allowing rapid deployment of servers and networks within the data engine infrastructure, as well as support for multiple tenants and a user self-service portal.
Storage within the data engine is handled by the General Parallel File System (GPFS), which is now a part of the Platform Computing portfolio and carries a new name, Platform Elastic Storage. This highly scalable and reliable cluster file system originated from the Tiger Shark file system, a research project developed in the early 1990s at the IBM Almaden Research Center. GPFS uses a distributed storage model in which storage workload is distributed across multiple nodes in the cluster, each of which have direct access to a subset of the total physical storage in the cluster. In this way, I/O operations are parallelized not only over physical storage devices, but also at a storage server level, allowing for very high throughput combined with near linear scalability. Additions to recent release of GPFS include a big data friendly tool to distribute data optimally across Hadoop style processing clusters (the GPFS File Placement Optimizer) and the ability to make optimal use of SSD storage within the cluster.
When you build a data engine, the basic building block is a Power8 S822L server. This is a 2U dual socket machine providing 24 Power8 cores running at 3.3 GHz and having 256GB of memory. Currently, data engines can be built with between one and sixteen S822L compute nodes. At least one Hardware Management Console (HMC) is necessary to manage the virtualization environment and each data engine includes a single Power8 S821L cluster management server.
Storage is provided by one or more Elastic Storage Server (ESS) appliances. Each ESS consists of two S822L Power8 machines and up to four DCS3700 storage drawers, capable of serving just under 1Pb per appliance, including 3Tb of SSD storage. The various networking needs of the IBM Data Engine are handled by three Ethernet switches, supporting a mixture of speeds ranging from 1Gb to 40Gb.
The end result is an enterprise-ready turnkey solution capable of supporting multiple concurrent Hadoop, Streams or other data analytics tasks. It arrives onsite, according to IBM, “as a complete, pre-assembled infrastructure with preinstalled, tested software.” Supplied with a wealth of tooling to enable rapid deployment of new analytic tasks, it can scale pretty well linearly, and, if contributions from the OpenPower consortium members materialize as hoped, promises to have good potential for cost-effective customization and workload-specific optimization to meet future demands. It’s an impressive package and should maintain IBM as a major player in the big data arena. But, considering that the first proposal for a parallel processing computer came from researchers at the IBM Poughkeepsie Lab almost 60 years ago, that probably shouldn’t be a surprise.
Leverage IBM Data Engine for Analytics at Your Organization:
POWER8 and AIX Enhancements Workshop
InfoSphere BigInsights Foundation
IBM Platform Elastic Storage 4.1 Administration for Linux