In part 1, we discussed the challenge of administering the Hadoop platform for admin newbies. We also reviewed the rationale and advantages of leveraging commercial Hadoop distributions to manage an evolving Hadoop platform. Today, we’ll look at HDFS, MapReduce, and security best practices.
Best Practices: HDFS and MapReduce
Possibly the best (and worst) facet of open source software is the speed at which things change and evolve. This is especially true right now in the Hadoop community, with the very recent release of CDH4 (or Hadoop 2), marks significant advances in several areas.
With CDH4, we now have NameNode high availability and Federation (eliminating the previous single point of failure-SPOF of the name node, and making it scale better), MapReduce version 2 (which attempts to better scale and de-centralize many of the MR processes), and a new API (which attempts to both unify and simplify development).
As an administrator, you need to know which of the options to use, and which to keep abreast of and perhaps deploy later.
As of just a couple months ago, this means setting up NameNode high availability, where a second physical machine acts as a hot standby, allowing quick manual (or automated) failover to this standby machine in the event of a primary NameNode failure. It’s only a wee bit more difficult to set up than standard “old school” secondary NameNode, and eliminates the legacy SPOF of the NameNode.
Even if you are using NameNode high availability, the sheer amount of data you’re storing in HDFS can (and will at some point) blow out the resource capacity (RAM) in your NameNode server. To get around that, CDH4 introduced NameNode Federation, which allows administrators to break up their namespace and delegate management of individual directories to other machines running the NameNode daemon. Unless your cluster grows to very large size, it’s unlikely that you’ll need Federation, but it’s nice to know it’s there if you do. High availability and Federation can be used in concert together.
The MapReduce side, as well, comes with a ground-up re-architecture in MRv2. Unfortunately, this is still not completely ready for production (though that should change in the coming months). With MRv2, the load on the former JobTracker is reduced (by way of delegating oversight of individual MR jobs to other daemons on the slaves), and a more flexible architecture is enabled (admins no longer need to statically allocate individual Map and Reduce slots in the cluster). As the Hadoop community typically gets a little ahead of themselves, it’s probably a good idea to wait until MRv2 has been marked production ready before deploying to your cluster.
If your role as admin also includes dictating the version of API to your developers, you should make the new API (API V2) the standard. It not only simplifies the coding process (largely with more compact syntax and more descriptive naming), it also unifies the calls to objects up and down the hierarchy.
Suggested Security
Hadoop was not designed with security in mind and, unfortunately, it only deals with the “authorization” aspect of a full “authentication and authorization” stack. What this means is that, out of the box, Hadoop trusts the users are who they say they are. If I create a “dcutting” user on my client and run jobs with that user, Hadoop gives me access to everything “dcutting” has access to, from both the HDFS and MapReduce sides. In short, it’s meant to keep honest people from doing something stupid, not malicious people from doing something evil.
Although there are advanced ways to enable strong security in Hadoop, especially with Kerberos, probably one of the easiest ways to attain security is to put your entire cluster on a separate network subnet and limit access to that subnet via a Bastion Host, VPN, or other password-protected gateway.
Should a compliance requirement drive the need for inherent authentication, it’s worth mentioning that Kerberos authentication, along with enhanced monitoring and alerting, are two of the big reasons to move to a paid product like Cloudera Enterprise – which makes installation and maintenance of these products crazy simple.
Related Posts:
Become a Rock Star Hadoop Developer
Using Hadoop Like a Boss
Beyond the Buzz: Big Data and Apache Hadoop
Related Courses:
Cloudera Essentials for Apache Hadoop
Cloudera Administrator Training for Apache Hadoop
Cloudera Training for Apache Hive and Pig