If you’ve spent more than five minutes researching big data, you’ve already come across technologies like Hadoop, BigQuery, EMR and RedShift, and you may even know what OLAP and NoSQL means and where they fit in. What you may not have been exposed to, however, is the next and very important level down the rabbit hole—technologies like Tableu, Microstrategy, QlikView, Hive and Impala.
Let’s take a look at each of these technologies and see where they fit and how they compare to each other.
No introductory conversation about big data can start until we’ve talked about the “Three Vs”:
- Volume – How much raw data you’re storing
- Velocity – Really, we’re talking about latency under load, but “Two Vs and one LuL” isn’t nearly as catchy
- Variety – The need to store and process widely disparate datasets like structured, semi-structured and unstructured together in the same engine
Because the constraints are so very different than ACID–compliant online transactionprocessing (OLTP) systems, most of the big data analysis systems out there (like Hadoop) are after-the-fact online analytics processing (OLAP) systems that operate on static data. The exception to this OLAP rule is the NoSQL family of engines, which deal with the Three Vs quite nicely and can also be used for real-time OLTP writes and reads.
In the early 2000s, most of the large online companies like Google, Amazon and Yahoo had hit hard limits on the existing data storage and processing technologies, and each created their own. Google spun out GFS, MapReduce, and (much more recently) BigQuery, while Amazon developed the NoSQL engine Dynamo (which went on to become DynamoDB), a managed “Hadoop as a Service” called Elastic MapReduce or EMR, and then the OLAP engine RedShift. It was Yahoo, however, that developed and open-sourced the arguably most successful and widely deployed solution out there, Hadoop.
Many of these systems started out with their own complex interfaces, and nearly all quickly began implementing SQL-like native interfaces on top. BigQuery and RedShift, which are both insanely fast and scalable, use SQL as their primary interface, while Hive (which just runs the slower MapReduce under the hood) and Impala (which runs a separate, much faster set of daemons) enable SQL interfaces to Hadoop, making the power of Hadoop accessible to wider audiences. Hive, for example, is behind an estimated 99% of all of Facebook’s interaction with their Hadoop cluster.
Although SQL-like interfaces were the first and easiest way to enable accessibility to these technologies, many organizations quickly found value with point-and-click visualization tools such as those provided by Tableau, QlikView, and MicroStrategy. Whereas most of the “core” solutions like BigQuery, Hadoop, EMR, and RedShift are either free (in the case of Hadoop) or low-cost, pay-as-you go service (for all others mentioned), these bolt-on visualization tools are licensed products that you purchase and connect to your core systems. They become your users’ main interface to BigQuery, Hadoop, and/or RedShift.
Your decision to use one of the “Big Data as a Service” technologies like BigQuery, EMR, or RedShift will largely depend on your choice of provider and your familiarity with the tool’s interface/architecture. Among these, BigQuery is probably the easiest to get your head around and overall costs the lowest. Google even offers a free test drive of the service using several public datasets so you can try before you buy.
If you’re building your own cluster on your own in-house hardware, core Hadoop cannot be beat and likely never will be. It’s unlikely that anything will ever be more cost effective than free, open-source software running on commodity hardware. If you have very sporadic Hadoop workloads, then AWS EMR gives you the best of both worlds: access to an on-demand, custom-sized Hadoop cluster for an hourly rental price.
Once you’ve decided on an analysis engine core, you may quickly find huge value in one of the visualization tools that allow you to hire a couple of low-cost interns and let them do some productive data analysis and “pretty graph building.” Using these technologies, even your experienced data scientists can do some powerful, productive analyses. Among these tools, pretty much everyone says the same thing: Tableau stands out as the best but most expensive solution, although we’ve heard great things about both MicroStrategy and the upstart QlikView as well.
No short blog post will do justice to the complexities, corner cases and comparisons of these technologies, but I hope I’ve helped you understand what each is so that your future research can be more focused.
Related Courses
Cloudera Training for Data Analysts: Using Pig, Hive, and Impala with Hadoop
Big Data on AWS
Data Science and Big Data Analytics