Crypto 2.0 Musings - Decentralised Query Network
How do you ask the open world a question? It could be as simple as - what do you know about Alex Batlin, say for KYC purposes. And when you get an answer, how do you know if you can trust it?
In the old days, you would ask your friends, who may in turn ask their friends. This network of friends would either typically share information directly, or recommend books and news articles, that you would need to read. You could also go to one or more libraries, look up an index of books with useful sounding headings, fetch and read them. All very time consuming.
Then came the internet. Instead of relying on libraries, folks can simply share information via websites. But how do you know which website to go to? Well, you build an index e.g. Google Search, just like you have one in a library, but not only do you index the headline and location of the book, you also index all of the words, so now you can find information by searching for keywords in the entire body of text, instead of manually scanning yourself just the headlines. Much faster.
Whilst it's now much easier to find information about something or someone using the internet, it's still up to you to interpret the found data and answer a question. Let's say you want to find out if Alex Batlin is on a Politically Exposed Persons (PEP) List in any country in the world. You could search for 'Alex Batlin PEP list', find a bunch of links, fetch link's contents, read it and determine the answer. But would it not be much easier if you could query the internet, just like you query a database, with some sort of query language? Well turns out you can if you use semantic web technology.
Semantic web technology allows you to publish either raw structured data using a variety of markup languages like XML or JSON, or it can be embedded in a web page, by marking up HTML elements using something like the RDFa standard. This way when a human reads the HTML via a web browser, they see nicely laid out information, but when a machine parses the same web page, it extracts marked up structured data. SPARQL query tools can then be fed a query which specifies the web pages to fetch, they then fetch the referenced content, parse out the marked up data, merge it if multiple sources were supplied, load it into memory, execute the query and return the results. Instead of always fetching serialised data, you can also directly query a database that supports SPARQL. The reason you can merge results, is because returned data is linked i.e. it is a graph, and each piece of data has a unique identifier, so there is no ambiguity if it is the same as any other instance of it e.g. can you be sure if entity = 'Batlin' in one database is the same as entity = 'Batlin' in another, after all one may be talking about the human being called Alex Batlin, and the other refers to a company that Alex founded, in which case linked data cannot be merged.
Using semantic web technology brings us one step closer to being able to ask the world a question - you can publish marked up data on the web, and query that data using SPARQL tools to answer a specific question. However, you still need to specify a list of either web pages to fetch, or SPARQL capable databases to query. In other words you still need someone, like a search engine, to at very list keep an index of all data sources to query. That puts them in a very powerful position, especially as you come to rely on them for business critical activities. What if they decide to exclude you, or an important data source, or they start to charge too much, or they get hacked since they are a critical single point of failure, or simply decide to switch to a different business model, your business is then in jeopardy. See my A Question of Trust blog.
One way to overcome these centralisation risks, is to take the same approach as blockchain technology has done i.e. use peer-to-peer (p2p) networks. It's possible to create a p2p network, let's call a Decentralised Query Network (DQN), of query nodes that anyone can join without ability to censor them, through use of pseudonymous addresses, again just like blockchain networks. If you don't know who is behind the address, you can't censor them, and that's important from a freedom of speech point of view, but also from a business process point of view - you don't want to operate without knowing all the facts.
Every peer node in such a network would be able to submit a SPARQL query to the network, without needing to specify a set of sources. Each peer can simultaneously execute a query against it's own internal database or set of known web resources, and pass the query to it's peer set. Before returning the results, it may wait for it's peers to return their results, merge the data to save on network traffic, possibly cache the data - as long as there is a means to later invalidate caches, and return it up the call stack. In such a network, there is no need for a central registry i.e. it is de-centralised from a data registry point of view, and federated from data store point of view.
Since RDF graphs are decomposable down to simple triple facts i.e. subject, predicate, object; and both subject and predicate are always URIs i.e. unique, with only object being either data or URI, it's trivial for nodes to create bloom filters in order to quickly determine if they have some data that is relevant to the query. If data confidentiality is an issue, data can be encrypted with the query issuer's public key.
A possible improvement over standard SPARQL specification is to dynamically dereference discovered resources to run further sub-queries, see SPARQL-LD: A SPARQL Extension for Fetching and Querying Linked Data. Another idea is to create a URI scheme for content-addressable web, see IPLD. This way the SPARQL 1.1 Graph Store HTTP Protocol can be adapted to fetch graph based on hash key, not graph IRI.
Now that you've got the data, how do you know you can trust it? You want to make sure it's integrity has not been compromised in transit, and that's it's provenance is non-repudiable, since you may need to rely on this data to make business decisions. Again we can borrow ideas from blockchain technology, and use digital signatures. So how do you sign a linked data graph?
One easy way to do so is by serialising the data to a file, hashing the file and signing the hash. To do so, everyone would need to agree on a single serialisation markup language for linked data e.g. RDF triplets say serialised using XML. You also need to agree on a single way to order graph serialisation, since two isomorphic, or content identical, graphs can have same RDF facts serialised in different triplet orders i.e. same content but different hashes. RDF graphs also support a notion of blank nodes i.e. they do not have a globally unique URI, so you would need a way to uniquely identify them in a merged graph. Finally, once you have established a hash for a graph, and signed it, you need to somehow attach it to the graph.
Luckily, there are a number of research papers and specifications, which address most of these challenges. The following address graph hashing:
- Hashing of RDF Graphs and a Solution to the Blank Node Problem - The ability to calculate hash values is fundamental for using cryptographic tools, such as digital signatures, with RDF data. Without hashing it is difficult to implement tamper-resistant attribution or prove- nance tracking, both important for establishing trust with open data. We propose a novel hash function for RDF graphs, which does not require altering the contents of the graph, does not need to record additional in- formation, and does not depend on a concrete RDF syntax. We are also presenting a solution to the deterministic blank node labeling problem.
- Hashing and canonicalizing Notation 3 graphs - This paper presents a hash and a canonicalization algorithm for Notation 3 (N3) and Resource Description Framework (RDF) graphs. The hash algorithm produces, given a graph, a hash value such that the same value would be obtained from any other equivalent graph. Contrary to previous related work, it is well-suited for graphs with blank nodes, variables and subgraphs. The canonicalization algorithm outputs a canonical serialization of a given graph (i.e. a canonical representative of the set of all the graphs that are equivalent to it). Potential applications of these algorithms include, among others, checking graphs for identity, computing differences between graphs and graph synchronization. The former could be especially useful for crawlers that gather RDF/N3 data from the Web, to avoid processing several times graphs that are equivalent. Both algorithms have been evaluated on a big dataset, with more than 29 million triples and several millions of subgraphs and variables.
- Computing the digest of an RDF graph - Efficient algorithms are needed for computing the digest of a Resource Description Framework (RDF) graph. These may be used to assign unique content-dependent identifiers and for use in digital signatures, allowing a recipient to verify that RDF was generated by a particular individual and/or has not been altered in transit. Ideally the digest algorithms should still permit verification of the RDF graph even when it has been transported via intermediaries who may serialize in different formats and alter the blank node identifiers.
- RDF Dataset Normalization - RDF describes a graph-based data model for making claims about the world and provides the foundation for reasoning upon that graph of information. At times, it becomes necessary to compare the differences between sets of graphs, digitally sign them, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF datasets such that these operations can be performed.
- Signing RDF Graphs - Assuming P<GI<NP, the creation and verification of a digital signature of an arbitrary RDF graph cannot be done in polynomial time. However, it is possible to define a large class of canonicalizable RDF graphs, such that digital signatures for graphs in this class can be created and verified in O (n log (n)). Without changing its meaning, an arbitrary RDF graph can be nondeterministically pre-canonicalized into a graph of this class, before signing. The techniques in this paper are key enablers for the use of digital signature technology in the Semantic Web.
One way to attach signatures to the graph is to you use NQuads, N-Triples plus graph IRI, to scope graph facts, and then make statements about the graph IRI, such as it's signed hash. The other option is to use reification, as per Signing Individual Fragments of an RDF Graph paper.
I have suggested borrowing blockchain ideas so much so that, in fact, using Ethereum's DEVp2p wire protocol is not a bad idea for DQN. DEVp2p is designed to be a generic abstraction that is able to support specific sub-protocols e.g. Ethereum (eth), Whisper (shh) and Swarm (bzz). By defining another DEVp2p sub-protocol to handle queries, let's call it Query (qna), we can piggy back off an established network and technology. Another advantage of using Ethereum, is that it's possible to use Ethereum Name Service, instead of DNS, to issue domain namespaces to form URIs without centralised risks. Furthermore, Ethereum itself is a giant database, as long as smart contract getters can be set-up to return JSON-LD style responses, it can act as a data provider in its own right, and can be used as an accounting system to incentivise peers, the next important topic.
In my opinion, Bitcoin's genius was not in just figuring out a new technology, but also inventing a new business model. In a similar way, in order to make the Decentralised Query Network (DQN) a success, a way to monetise participation must be designed into the network.
Google uses advertising to drive revenues, on the back of it's platform. It keeps all of the network effect value created by it's contributors. This drives competitors to set up alternative search engines, rather than collaborate, because essentially a search engine is a distribution channel. This creates data silos, and forces users to aggregate results, or only access sub-set of known data, both inefficient. In principle DQN dis-intermediates the likes of Google as a distribution channel, and instead allows data providers to collaborate on the same distribution channel, but compete on data quality and quantity. End users should get better data at a cheaper price, and spend less time aggregating and reconciling it to answer questions.
So one option for monetisation is to create a new Query coin, if using Ethereum this could be an ECR20 token. When data providers respond to queries, they earn coins. When advertisers wish to place adds on DQN, they need to buy those coins at an exchange, similar to Lunyr. To protect against fixed supply currency issues, I would suggest incorporating the ideas I developed in my Coin Monetary and Fiscal Policies blog. Some data maybe public and available without paywalls, but paid-for-data can be controlled by either checking if the query issuer holds a data license with the serving node, by a lookup with a registry smart contract and then encrypting the results, or possibly paid for in Ether on-demand i.e. pay-per-result. This works well for paywalled data and queries issued by humans, as adds can be displayed, but not so well for machine-to-machine queries. Perhaps those queries would need to be pay-per-query. More thought is required here, any suggestions most welcome.
Once the monetisation is solved, the next question is how do you go to market? Well, a lot of linked data is there but hidden. For example, search engines like Google encourage retailers to markup their sites with schema.org based taxonomy. They then parse the web pages, extract the data, index it and return it when you search for best priced goods and services. Whilst manually marking up data is hard, e-commerce platforms do so in a transparent way - the user generates the content through forms, and resulting web pages are marked up. Same principle can be applied to blogging (Wordpress), social media, news etc. platforms. In fact, by building in Ethereum + DQN into content management systems, you create a network of websites, all together collaborating to answer people's queries. Since SPARQL queries are too complex for most people, I would suggest using NLP to generate queries from natural language questions, something already being done for SQL. Now that's what I call web 3.0!
Head of Product Solutions at BCB Group
7yGreat Article - this feels like the missing jigsaw piece in Web3 - add DRM and Individuals/Companies being able to control data access and we'd have a totally new paradigm for business
Sailing around the world
7yVery nice! I was tangentially involved in some original semantic web work with Ian Davis back on 2001-2004, where we developed some open-source C# libraries for RDF processing, and built an early queryable triple store, and have been thinking about how to integrate concepts of RDF (triples etc.) into Oracles on distributed ledgers. Would be great to whiteboard / share ideas :)