Using Hadoop Like a Boss
Once you’re doing real development, you’ll want to get into the habit of using smaller, test datasets on your local machine and running your code iteratively in Local Jobrunner Mode (which lets you locally test and debug your Map and Reduce code), then Pseudo-Distributed Mode (which more closely mimics the production environment), then finally Fully-Distributed Mode (your real production cluster). By doing this iterative development, you’ll be able to get bugs worked out on smaller subsets of the data so that when you run on your full dataset with real production resources, you’ll have all the kinks worked out, and your job won’t crash three-quarters of the way in.
Read more