0

I have three different solutions where i store documents with document_ids (search engine, nosql database and self developed semantic indexing application).

I am running queries against all different solutions and would like to merge them using something similar to SQL JOIN. This means I can sometimes have 3 or more different datasets that I need to join on the document_id.

Do you know if Map Reduce on Hadoop or something similar is the best way to solve this problem? These datasets can contain anywhere from 1 document_id to 100 000.

Thanx for your time!

2 Answers 2

0

Hadoop is good if you need to apply a lot of CPU during document processing prior to joining documents. In the same job processing document (in MAP function) you can use shuffling process as join engine relatively easy.
In the same time simple join of 100K items should not require more then modest RDBMS.

0
0

For small datasets like that - almost anything would work. Especially - I would recommend in-memory systems since all your data can easily fit into memory. GridGain is one such solution (full In-Memory MapReduce, SQL, etc. support among many other things).

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.