map reduce for joining datasets?

Question

I have three different solutions where i store documents with document_ids (search engine, nosql database and self developed semantic indexing application).

I am running queries against all different solutions and would like to merge them using something similar to SQL JOIN. This means I can sometimes have 3 or more different datasets that I need to join on the document_id.

Do you know if Map Reduce on Hadoop or something similar is the best way to solve this problem? These datasets can contain anywhere from 1 document_id to 100 000.

Thanx for your time!

David Gruzman · Accepted Answer · 2012-08-19 14:16:58Z

0

Hadoop is good if you need to apply a lot of CPU during document processing prior to joining documents. In the same job processing document (in MAP function) you can use shuffling process as join engine relatively easy.
In the same time simple join of 100K items should not require more then modest RDBMS.

answered Aug 19, 2012 at 14:16

David Gruzman

8,0881 gold badge30 silver badges30 bronze badges

Add a comment |

Nikita Ivanov · Accepted Answer · 2012-08-18 16:09:12Z

0

For small datasets like that - almost anything would work. Especially - I would recommend in-memory systems since all your data can easily fit into memory. GridGain is one such solution (full In-Memory MapReduce, SQL, etc. support among many other things).

answered Aug 18, 2012 at 16:09

Nikita Ivanov

4143 silver badges5 bronze badges

Add a comment |

Collectives™ on Stack Overflow

map reduce for joining datasets?

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
mapreduce
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged mapreduce or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
mapreduce
or ask your own question.