0

How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)

This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.

I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.

2
  • You're doing something wrong. That answer actually gives you the only way to do a join in MR (short of an in-memory join through distributed cache and some other sorcery), if you run out of heap, you're obviously keeping too much of the stuff in memory or your heap size is too small, try raising it with -XmxSIZE. Does each row of your data contain billions of records?
    – TC1
    Commented May 19, 2013 at 9:46
  • If you read that answer, the author suggest keeping two lists in memory. In my case this list is insanely large not because of the size of each record but because of number of items in the list which surely exceed millions if not billions. Commented May 19, 2013 at 10:07

2 Answers 2

0

Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:

When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.

When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.

This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.

0

You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.

If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.