Follow up on distributed MapReduce with MongoDB

In one of our earlier posts I tried to point out how much scaling out can help with improving MapReduce performance on MongoDB.

Another topic which I didn’t cover then is sharded output for MapReduce results. This means that you can tell MongoDB to immediately split MR result collections across several nodes. Doing so has one big advantage: If you do cascaded MR operations you benefit from the same effect I explained earlier. Unfortunately sharded output in MongoDB version prior to 2.2 is not advised to be used by 10gen.

In one of my current projects we do some calculations in several steps based on prior MR result collections. To give you an idea of how much time you can save by sharding the MR output I made some benchmarks (I used five m1.xlarge instances for these):

comparison of sharded vs non-sharded output

And just to tease you I also have some numbers when using Hadoop on EMR (using 15 m2.4xlarge instances):

comparison of MR with mongodb and hadoop

One could argue that using bigger instances for Hadoop is not a fair comparison but on the other hand the performance of EMR is so much better that this doesn’t matter significantly!

Watch out for an upcoming blog post where I will describe details on how we exactly configured the MongoDB-Hadoop connector!

Advertisements

One thought on “Follow up on distributed MapReduce with MongoDB

  1. Tomislav Zorc

    Another relevant fact: EMR/Hadoop is designed specifically for batch processing. It can scale up very quickly for heavy loads and shut down after the job is completed. This brings technical and financial advantages: you don’t have to change your MongoDB configuration optimized for online/low-latency access just to get your batch jobs faster and you pay only instance hours used during the job. Crunching several terabytes of data for just 50 $ is a nice thought! 😉

    So yes, the comparison in this case is not quite fair and there definitely are cases where you can get away with MongoDB M-R Jobs only. Let’s not forget about the new Aggregation Framework in 2.2.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s