If you’re like most companies, you’re probably sick of hearing about “big data.” However, like most companies, you’re probably also still years away from taking advantage of it. Regardless of company size or vertical, proper implementation of big data, storage and analysis technologies can both save and make your organization a ton of cash. At the very least, these technologies can help you better understand how to best serve your customers by crawling through their behavior logs and discovering what they want as well as the best way to get it to them.
In the following article, we’ll walk you through the “what and why” of big data, and show you how your business can benefit.
People aren’t lemmings, but we do move in groups. In our digital age, those group movements leave very clear tracks in the form of logs — website logs, point-of-sale logs, chat logs, mobile app logs and purchase histories, just to name a few. The difficulty in extracting value out of those systems is what we refer to as the “Three Vs of Big Data” — velocity, volume and variety. To understand our customers, we first must store a great deal of information about them as cost-effectively as possible. We must next go back and analyze this information from all these disparate systems and come up with an actionable plan as quickly as possible. If we do everything properly, we gain valuable insights that really move the needle on customer satisfaction and raw profit.
The biggest part of “doing everything properly” is choosing the appropriate storage mechanism for this data. Ideally we want one that is cost-effective, fast and can store any type of information including structured (like row-column) and unstructured (like raw chat logs) data. Many companies have tried to force their go-to storage mechanism — their Relational Database Management System (RDBMS) — into that role. The problem with this method is that virtually all RDBMS out there (from MySQL to Oracle) quickly hit one or multiple of the “three Vs” limits.
Because these engines were not designed as distributed systems from the start, there are very hard and low limits about how much information they can store, as well as the concurrent read and write volumes they can accommodate. To get around this, some companies try to apply sharding or aggregation techniques to these structured data stores, essentially turning them into something other than an RDBMS by breaking 3rd normal form. This would be like trying to turn a hammer into a screwdriver — what you really need is another tool.
If you’re still clinging onto your RDBMS as your central “data hub,” then you’re probably long overdue for at least a partial move to a distributed storage mechanism. Systems like HDFS, NoSQL and Object Stores like Riak CS (for private cloud) or AWS’ S3 or Google’s Cloud Storage (for public cloud) are all excellent choices. In reality, you’ll probably want to leverage multiple options together — an approach commonly known as polyglot persistence — in order to best leverage the strengths of each.
One of the big areas where these distributed systems can help is with managing costs. When your division is up for a hardware refresh or you’ll soon need to purchase more licenses for your high-end file servers or databases, you’ll find that these open source alternatives can save you 90 percent or more on raw, per-terabyte storage costs. If your IT department is simply shoveling more money at your hardware and software vendors, they may not have fully investigated ways to use these lower cost systems.
Another advantage in deploying and using a large scale, distributed storage engine like HDFS or Object Stores is the tear down of data silos and fiefdoms. In any organization of decent size, independent departments have control of independent slices of the overall company’s collected data, and pulling together all the parts to form a whole can be a serious logistical challenge. If delays and complications in collecting all of the “data trees” is preventing you from drawing a coherent “data forest,” then you probably need a big-data focused re-architecture of your hierarchies, your storage systems, or most likely both.
But probably the biggest flashing alarm that you’re overdue for a big data pivot is in the type of outputs you’re deriving after analysis of all of this collected data. At its essence, this analysis can be bucketed into two types — reactive and proactive reports. Although it’s useful to know what happened last week or last quarter, the business that is only looking in the rear view mirror is eventually going to drive off of a cliff. The exciting part of big data, and the part that has the most value, is in providing predictive analytics — helping the business decide what to do next week and next quarter.
Only a well-oiled big data machine that is highly responsive, fully informed, cost effective and ever-evolving will be able to provide the business leadership with the actionable, precise insights that help to keep a company ahead of its competition year after year.
Related Courses
Big Data on AWS
Data Science and Big Data Analytics for Business Transformation
Cloudera Introduction to Data Science: Building Recommender Systems