In one of our earlier posts I tried to point out how much scaling out can help with improving MapReduce performance on MongoDB.
Another topic which I didn’t cover then is sharded output for MapReduce results. This means that you can tell MongoDB to immediately split MR result collections across several nodes. Doing so has one big advantage: If you do cascaded MR operations you benefit from the same effect I explained earlier. Unfortunately sharded output in MongoDB version prior to 2.2 is not advised to be used by 10gen.
In one of my current projects we do some calculations in several steps based on prior MR result collections. To give you an idea of how much time you can save by sharding the MR output I made some benchmarks (I used five m1.xlarge instances for these):
And just to tease you I also have some numbers when using Hadoop on EMR (using 15 m2.4xlarge instances):
One could argue that using bigger instances for Hadoop is not a fair comparison but on the other hand the performance of EMR is so much better that this doesn’t matter significantly!
Watch out for an upcoming blog post where I will describe details on how we exactly configured the MongoDB-Hadoop connector!