A join operation using Hadoop MapReduce

Question

How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)

This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.

I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.

You're doing something wrong. That answer actually gives you the only way to do a join in MR (short of an in-memory join through distributed cache and some other sorcery), if you run out of heap, you're obviously keeping too much of the stuff in memory or your heap size is too small, try raising it with -XmxSIZE. Does each row of your data contain billions of records? — TC1, Commented May 19, 2013 at 9:46
If you read that answer, the author suggest keeping two lists in memory. In my case this list is insanely large not because of the size of each record but because of number of items in the list which surely exceed millions if not billions. — Eastern Monk, Commented May 19, 2013 at 10:07

user3374213 · Accepted Answer · 2014-06-02 21:39:46Z

Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:

When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.

When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.

This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.

mat77 · Accepted Answer · 2015-01-06 05:35:04Z

You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.

If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.

Collectives™ on Stack Overflow

A join operation using Hadoop MapReduce

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
hadoop
mapreduce
elastic-map-reduce
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged hadoopmapreduceelastic-map-reduce or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
hadoop
mapreduce
elastic-map-reduce
or ask your own question.