Freeing the data buried in journals and patents
Research chemists working in areas such as medicinal chemistry, pharmacology and chemical biology depend on access to an array of documents and data. A recent webinar presentation titled Expanding connectivity between documents, structures and bioactivity (part of Elsevier’s “Dear data: What compound to make next?” webinar series), featured presenter Christopher Southan Ph.D. FBPharmacolS, FRSC, an expert in bioinformatics, cheminformatics, proteomics, genomics, drug discovery and protein chemistry. To illustrate an important data relationship chain, he offered up the shorthand of D-A-R-C-P: Document > Assay > Result > Compound > Protein. Unfortunately, a researcher working in drug development, for example, might struggle to find good sources for DARCP.
“There is a legacy problem of lost connectivity,” explains Southan in the webinar, which you can watch here. “Most of the quantitative bioactivity data that we want is entombed in journal and patent PDFs. Most of the papers are behind subscription firewalls. There’s a big push for open access publications these days, and it is making a difference, [but] it’s not really penetrating into medicinal chemistry that much.”
“Patents, paradoxically, are more open than the literature,” he notes. “As soon as they’re published, they’re there. But they are particularly challenging for curating chemistry and relationship extraction, and only a small proportion are, at the moment, openly curated.”
In his presentation, Southan goes on to highlight his own “personal choices of quality, curated, open sources of DARCP.” These include resources such as IUPHAR/BPS Guide to Pharmacology (an expert-curated resource of pharmacological targets and the substances that act on them); ChEMBL; PubMed; and BindingDB (the first public molecular recognition database).
“Publicly-curated DARCP resources are to be, in my view, congratulated for their continuous expansion,” he says, adding that they do need more resourcing, though. “As a community, we still have our collective feet nailed to the floor from decades of DARCP value entombed in PDFs, behind firewalls and buried in patents. So far, automated extraction remains way behind the specificity and quality judgement of a biocurator – a human expert. AI is trying to close this gap, but we’ll see.”
Recommended by LinkedIn
An information solution that has managed to overcome many of the difficulties of this legacy of “lost connectivity” is the commercial database, Reaxys. Anindya Ghosh Roy, Product Manager at Elsevier, also participated in the webinar, sharing how Reaxys meets researcher needs, addressing use cases like novelty checks, structure activity relationships, toxicological profiles and pharmacokinetics, to name a few.
“There is a big gap between the public and the commercial databases. The gap is indeed very broad. Just from the sheer numbers, Reaxys has a lot more data compared to ChEMBL, for example.” And of course a lot of work has gone into bringing together that data, which includes over 8 million substances with bioactivity data, more than 6 million assays, 36,000+ biological targets and much more.
“One of the key challenges we face on a regular basis is striking a balance between content coverage, the quality of the extracted data, the time to customer (which is how fast the data is made available to end users), and the discoverability (which is how easily the extracted information can be found by the users in a usable form),” he reveals.
“This balance is hard to maintain as the requirement changes with different use cases and therefore there is no single approach that can counter this. Therefore, to extract the data, we employ a hybrid approach, where we harvest the power of both technology and human intervention.”
This provides the best of both worlds. Every year Reaxys optimizes its processes and develops new technology required by researchers.“From a user’s perspective, all the data we extract can be easily discovered and accessed on our extremely user-friendly UI,” says Roy.
You can watch the full webinar here to learn more.
Interested in finding out more about AI and drug discovery in life sciences research? Also check out the second webinar in this series, Drug Discovery with AI at AstraZeneca - from Generative Models to Reaction Prediction.
Podcaster l Content Creator I Art, Technology & Stock Market enthusiast.
1ySuper proud Anindya Ghosh Roy!! Keep up the awesome work!! 👏 👏 👍 👍
Honorary Professor at the University of Edinburgh and owner of TW2Informatics Consulting
1yGood to see this write-up, thanks. Areas of this topic were also covered in my 2020 effort (PMID: 32280387)