Loading Data in GraphDB: Best Practices and Tools
Data is just the first step on the path towards knowledge. And not to sound like an empty proverb, we can put some more context to this claim. Imagine having the bytes that correspond to a photograph stored on your system. Viewed on their own, those bytes mean nothing. Only by putting them in the right context – an image viewing program – do we obtain the knowledge that picture conveys. And if we want to interpret those bytes differently, we need yet another program.
This is good when we are working with data that can be read by multiple tools. You may think that images are great for this – after all, there are only half a dozen major image formats. However, some tools work only with vector graphics, while others refuse to visualize them. Others yet use their own proprietary formats. If that’s the case with images, imagine how bad it must be for structured data stored in databases. In fact, you probably don’t need to imagine, if you are reading this.
Data formats are hundreds, if not thousands, and getting data in just the right format that helps you uncover its hidden knowledge is a challenge. RDF is one of the best knowledge-building formats, but it is not in the mainstream yet and most data is not RDF-native. So, how to get that data in the right format?
Enter Converters
The traditional solution to the problem of converting data is to use a third-party converter. This has a few key benefits: it is platform-independent, reduces vendor lock-in and often follows an official standard.
The most popular conversion language for RDF is R2RML. R2RML covers all the bases:
TARQL is another option when you are working directly with tabular files, e.g., CSV. With it, you don’t need to write a mapping, you just prepare a SPARQL update and execute it directly on the table. The inputs are a table and a SPARQL, the outputs are triples. It can’t get much easier than that.
If direct mappers aren’t good enough because you are using some custom logic, you can always implement your own converter. If you are writing code specifically with GraphDB in mind, you can use the optimized Java API or the JS client. Many data scientists are fans of Python. We ourselves often use Pandas and Rdflib. This can be bundled with a Kafka pipeline for making sure that the messages being relayed to GraphDB are repeatable and trackable. Finally, if you are using C#, we have tested integrations with dotNetRdf.
Mapping in Style with OntoRefine
TARQL and R2RML are great tools, but using them, you have to write a mapping mostly by hand, modifying a text file, and that can be laborious. We have a solution that helps you with that hurdle. GraphDB OntoRefine is based on the popular OpenRefine tool initiated by Google, now governed by the Apache Foundation. This means that vendor lock-in is avoided as transformation logic can be scripted in GREL (Google Refine Expression Language). It allows quick ingestion of structured data in multiple formats (TSV, CSV, *SV, XLS, XLSX, JSON, XML, RDF as XML and Google sheet) into GraphDB. It is integrated with GraphDB and supports many different formats.
The main strength of OntoRefine is the powerful UI and flexible transformation capabilities. It starts with data preview, similar to the import facility of Microsoft Excel. You can manipulate a sample file, modify the input and create a mapping. The mapping can be translated to JSON or to SPARQL. All operations can be extracted in a JSON format and applied to any other input. If you reuse the same patterns in other files and map them to the same RDF ontology, you can use the OntoRefine client or API to carry out the same transformations.
OntoRefine outperforms TARQL and R2RML and can ingest data directly into GraphDB. You don’t need an intermediate step. This adds a layer of flexibility. You can use the data from OntoRefine for all sorts of SPARQL queries and easily combine it with data already persisted in the database or in another OntoRefine project.
Streaming Data with Kafka
A common question is how to keep data fresh. The converters described above work very well for bulk ingestions, but they operate in memory. They are not great for discerning what’s already in the database. Even OntoRefine has to do that with a SPARQL FILTER clause, which isn’t efficient.
However, with update data coming piecemeal, we can employ GraphDB’s smart updates functionality. You can create a SPARQL template, which gets executed against a specific Kafka message, conveyed by our Kafka sink addon. Data listed within the incoming Kafka message would be stored as RDF.
In many cases, though, that won’t be enough as the Kafka message has to be RDF-formatted. Fortunately, starting in GraphDB 9.11, we offer something more – arbitrary SPARQL execution against JSON messages.
Recommended by LinkedIn
The SPARQL
And the JSON
With this approach, provided that you don’t need to call third party services and you have JSON data coming in, you can greatly simplify your whole ingestion pipeline, skipping the need for a custom ETL step of any sort.
Keeping the Data Where It Is with Virtualization
The most radical option is to just keep the data where it is. Maybe you don’t need all the benefits of RDF, like visualizations, inference and secondary indexing. Perhaps all you need to do is the occasional query. GraphDB has the right tool ready for this case: Ontop.
Ontop is, in essence, a mapping interface. You provide it with a mapping file and some configurations related to the security and locations of an SQL database, and it translates SPARQL to SQL queries. The data is kept in the remote SQL database. The full expressivity of SPARQL queries is supported. The data is not cached, except for any caching that happens on the SQL layer. This means that SQL updates can be served seamlessly, provided that the mapping is kept up-to-date. You can use R2RML or Ontop’s bespoke mapping language, OBDA.
If you change your mind and decide you do want to store the data, Ontop can be used as a tool to ingest data from SQL into GraphDB. This would greatly improve the performance of non-trivial queries with this data. It is also useful when you want to utilize some of GraphDB’s advanced capabilities such as the semantic similarity index, autocomplete, RDF rank, secondary indices via FTS connectors, etc.
Conclusion
The world doesn’t run on RDF – yet. But maybe you want a part of it to do so. With the tools described here – R2RML, TARQL, OntoRefine, Smart updates and Ontop, you have gotten one big step closer to that goal.
You don’t need to search for the golden needle in the haystack, nor spend time developing an ETL process from scratch. We have already done the research and are more than ready to give you a hand on your RDF journey.
Do you want to have a smooth RDF journey?
Radostin Nanov, Solution Architect at Ontotext