Old school MapReduce (MR) has been in widespread production use for several years now, and it certainly has both raving fans as well as detractors. Two of the most common and valid complaints the critics have is that MR can be both difficult to use (it requires some programming expertise) and that MR jobs take too long to complete.
Although solutions like Pig and Hive have long helped make MR more accessible to non-developers, only recently has the community been able to achieve faster run times.
With Impala, Cloudera addressed both of these concerns in one product: a simple-to-use engine that sits on top of your existing Hadoop cluster that can now return query results up to 70x faster.
What is Impala?
If you’re not excited about that last statement, you may need to check for a pulse. Cloudera Impala may well be the biggest game changer that the Hadoop community has ever seen. Learning what it is, even at a high level, could open up entirely new and lucrative possibilities for your business.
At its core, Impala is a set of isolated daemons that run on top of an existing Hadoop cluster and provide query access to the underlying HDFS data via a SQL-like query language. By compromising on some of the considerations made in MR, such as fault tolerance, and instead optimizing for raw execution time, Impala can return results much more quickly—often in the range of 10-70x faster than standard MR. Impala, like many other Cloudera-developed products, is 100% open source, free to use, and has no Cloudera-specific dependencies whatsoever.
It’s a common mistake to assume that Impala will replace MR. It may, for some current workloads, but it’s more proper to think of it as a complementary product. The lack of fault tolerance, for example, means that should a node fail during an Impala query, the whole query would fail.
By comparing the SQL-like interface, it’s also a fairly common mistake to assume that Impala is a Hive replacement. However, where Hive really does nothing except provide a SQL-like interface to mask the complexity of MR jobs, Impala treats SQL as a first-class citizen, routing queries to a whole separate set of Impala-specific daemons on each slave. The data being operated on is still in HDFS (or HBase) and accessible by Impala and MR, both of which could be running side-by-side. Impala can even run against existing Hive tables.
Architecturally, Impala requires one or more daemons to perform the execution, and a Hive metastore where table definitions are stored. A single “impalad” daemon runs on each of the slave nodes where the HDFS data is stored, and a single, optional, “statestored” daemon should be running somewhere in the same cluster. Statestore checks the health of the impalad daemons, rolling dead ones out and live ones back in. Should statestored ever go down, Impala will continue to work, just in a less-robust (and, if nodes are going down a lot, degraded) fashion.
Because the Impala daemons are so hyper-focused on speed, Impala is missing some of the functionality that MR users are used to. Namely: data deletion, full text search, indexing, UDFs, custom serialization/deserialization or “SerDes” classes, and querying of streaming data, to name but a few. It should be noted that, with the very recent release of Sentry, both Impala and Hive can now provide the fine-grained level of access control that has long been common in the RDBMS realm, including authorization at the table/column level.
Both MR and Hive will continue to have a wide range of viable use cases, namely workloads which cannot compromise on the aforementioned functionality—tasks like complex analytics, ETL, and other types of transformations. That said, Impala will no doubt begin both cannibalizing some tasks currently served by Hive and/or MR, as well as serving new tasks such as running more ad-hoc, interactive, and exploratory analytics on data. Business analysts, in particular, will very much appreciate being able to get real-time query responses. For the right workload, one could even consider using Impala to serve real-time queries from external users.
If your users are already familiar with Hive, they’ll be happy to know that Impala supports a subset of the HiveQL query language. Any query that can run in Impala can also run in Hive. As I’ve mentioned, Impala and MR jobs can run nicely side-by-side, but because Impala is very memory intensive, you may want to bump up the memory in your slaves if you choose to use them both together.
Although there are several other products from other vendors in the works, Impala is the first “low-latency SQL-on-Hadoop” product released for production use. Cloudera competitors Hortonworks, MapR, IBM, and others have several competing products in the pipeline, and each is already touting their product benefits over Impala. The first of those products is scheduled for production for general availability in the third quarter of this year. By that time, Impala will have not only soaked up more market share, but will have probably already delivered many functionality improvements on its aggressive roadmap. It will be very difficult for others to catch up, and it’s doubtful that Cloudera will let them. In reality, the products are likely to differentiate much, making Impala the default choice due to both its maturity and industry-leader backing.
Only Impala can currently boast more than 40 enterprise users, and already companies like Expedia, 37Signals, Monsanto, and others are squeezing true business value out of this engine and this engine alone.
Impala delivers one of the most highly requested attributes that Hadoop has always been lacking: faster responses to queries. In exchange for a slightly smaller feature set, it enables us near real-time querying of our data. For businesses that have been on the sidelines waiting for that speed, or for organizations that have been struggling to make their MR jobs run faster, now may be the time to jump on the Hadoop bandwagon and begin realizing the many advantages of Impala.