Role of deep learning in biology
A popular artificial intelligence method provides a powerful tool for the survey and classification of biological data. But for the illiterate, technology presents significant difficulties.
Four years ago, Google scientists showed up at neuroscientist Steve Finkbeiner's door. The researchers were based at Google Accelerated Science, a research division in Mountain View, Calif., which aims to use Google technologies to speed up the scientific search. They were also interested in applying a 'deep-learning' approach to the mountains of imaging data generated by Finkbeiner's team at the Gladstone Institute of Neurological Disease in San Francisco, California.
Deep-learning algorithms take raw features from a collection of very large, annotated data sets, such as images or genomes, and use them to build a predictive tool based on the patterns buried inside. Once trained, algorithms can apply that training to analyze other data, sometimes from wildly different sources.
The technique can be used to tackle "really difficult, difficult, complex problems, and be able to see the structure in the data - the amount of data that is too large and too complex for the human brain to comprehend". , says Finkbeiner.
He and his team produce reams of data using a high-throughput imaging strategy known as robotic microscopy, which they developed for studying brain cells. But the team couldn't analyze its data at the speed it had acquired them, so Finkbeiner welcomed the opportunity to collaborate.
"I honestly couldn't say at the time that I had a clear understanding of what questions could be addressed with deep learning, but I knew we could analyze it at about two to three times the rate. We're generating data," he says. ,
Today those efforts are starting to pay off. Finkbeiner's team, along with scientists from Google, trained a deep algorithm with two sets of cells, one artificially labeled to highlight features that scientists can't normally see, the other unlabeled. When he later uncovered the algorithm for images of unlabeled cells that he had never seen before, Finkbeiner says, "it was surprisingly good at predicting what the labels should be for those images". . A publication detailing that work is now in press.
Finkbeiner's success highlights how deep learning is in biology, one of the most promising branches of artificial intelligence (AI). Algorithms are already infiltrating modern life in smartphones, smart speakers, and self-driving cars. In biology, deep-learning algorithms dive into data in ways humans cannot, detecting features that might otherwise be impossible to capture. Researchers are using algorithms to classify cellular images, make genomic connections, advance drug discovery, and even find links to various data types, from genomics and imaging to electronic medical records.
Over 440 articles on the BioRxiv preprint server discuss deep learning; PubMed listed over 700 references in 2017. And the tools are on the verge of becoming widely available to biologists and clinical researchers. But researchers face challenges in understanding what these algorithms are doing and making sure they don't mislead users.
smart algorithm training: Deep-learning algorithms (see 'Deep thoughts') rely on neural networks, a computational model first proposed in the 1940s, in which layers of neuron-like nodes mimic how the human brain analyzes information. Is. Until about five years ago, machine-learning algorithms based on neural networks relied on researchers to process raw information into a more meaningful form before feeding it into computational models, says Casey Green, a computational biologist at the University of Pennsylvania in Philadelphia. . But the explosion in the size of data sets – from sources such as smartphone snapshots or large-scale genomic sequencing – and algorithmic innovations have now made it possible for humans to take a step back. This advancement in machine learning - the 'deeper' part - forces computers, not their human programmers, to find meaningful relationships embedded in pixels and bases. And as the layers in neural networks filter and sort information, they also communicate with each other, allowing each layer to refine the output from the previous one.
Eventually, this process allows a trained algorithm to analyze a new image and correctly identify it, for example, a Charles Darwin or a diseased cell. But as researchers distance themselves from the algorithm, they can no longer control the classification process or even explain what the software is doing. While these deep-learning networks can be surprisingly accurate at making predictions, Finkbeiner says, "It is still sometimes challenging to figure out exactly what this network sees that enables it to make such good predictions." ".
Nevertheless, many subdisciplines of biology, including imaging, are reaping those predictions. A decade ago, software for automated biological image analysis focused on measuring single parameters in a set of images. For example, in 2005, Anne Carpenter, a computational biologist at MIT and Harvard's Broad Institute in Cambridge, Massachusetts, released an open-source software package called CellProfiler to help biologists quantitatively measure individual characteristics: The number of fluorescent cells in a microscopy field, for example, or the length of a zebrafish.
But deep learning is giving his team a chance to move forward. "We're moving toward measuring things that biologists don't realize they want to measure from images," she says. Recording and combining visual features such as DNA staining, organelle texture, and the quality of spaces in a cell can generate thousands of 'features', any of which may reveal new insights. The current version of CellProfiler includes some deep-learning elements, and his team hopes to add more sophisticated deep-learning tools over the next year.
"Most people have a hard time wrapping their heads around this, but an image of cells actually contains as much information as there is in a transcriptomic analysis of a cell population," says Carpenter.
That type of processing allows Carpenter's team to take a less-supervised approach to translate cell images into disease-relevant phenotypes—and capitalize on it. Carpenter is a scientific advisor to Recursion Pharmaceuticals in Salt Lake City, Utah, which is using its deep-learning tools to target rare, single-gene disorders for drug development.
mining genomic data
When it comes to deep learning, just no data will do. The method often requires large-scale, well-annotated data sets. Imaging data provide a natural fit, but so does genomic data.
Recommended by LinkedIn
One biotech firm using such data is Verily Life Sciences (formerly Google Life Sciences) in San Francisco. Researchers at Verily, a subsidiary of Alphabet, Google's parent company, and Google have developed a deep-learning tool that identifies a common type of genetic variation, called a single-nucleotide polymorphism, that compares favorably with conventional tools. is more accurate. Called DeepVariants, the software translates genomic information into an image-like representation, which is then analyzed as images (see 'Tools for deep diving'). Mark DePristo, who heads deep-learning-based genomic research at Google, hopes that deep variant will be particularly useful for researchers studying organisms outside the mainstream – those with low-quality reference genomes and genetic There is a high error rate in identifying the variants. Working with deep variants in plants, his colleague Ryan Poplin has achieved error rates closer to 2%, compared to the more-typical 20% of other approaches.
equipment for deep diving: Deep-learning tools are evolving rapidly, and laboratories will need dedicated computational expertise, collaboration, or both to take advantage of them.
First, take a colleague with deep-learning expertise to lunch and ask if the strategy might be useful, advises Steve Finkbeiner, a neuroscientist at the Gladstone Institutes in San Francisco, California. With some data sets, such as imaging data, an off-the-shelf program may work; For more complex projects, consider a collaborator, he says. Workshops and meetings can provide training opportunities.
Access to cloud-computing resources means researchers may not need an on-site computer cluster to access deep learning – they can run computation elsewhere. Google's TensorFlow, an open-source platform for building deep-learning algorithms, is available on the software-sharing site GitHub, as is an open-source version of DeepVariants, a tool for accurately identifying genetic variation.
Google Accelerated Science, a Google research division based in Mountain View, Calif., collaborates with a number of scientists, including biologists, says Michelle Dimon, one of its research scientists. The projects require a compelling biological question, large amounts of high-quality, labeled data, and a challenge, says Dimon, that will allow the company's machine-learning experts to make unparalleled computational contributions.
Those willing to accelerate into deep learning should check out 'Deep Review,' a comprehensive, crowdsourced review led by computational biologist Casey Green of the University of Pennsylvania in Philadelphia.
Brendan Frey, chief executive of Canadian company Deep Genomics in Toronto, also focuses on genomic data, but with the goal of disease prediction and treatment. Frey's academic team at the University of Toronto developed algorithms trained on genomic and transcriptomic data from healthy cells. Those algorithms built predictive models of RNA-processing events such as splicing, transcription, and polyadenylation within those data. When applied to clinical data, the algorithms were able to identify the mutations and mark them as pathogenic, Frey says, even though they had never seen the clinical data. At Deep Genomics, Frey's team is using the same tools to identify and target disease mechanisms uncovered by the software, to develop therapies derived from short nucleic-acid sequences.
Another discipline with massive data sets is well suited for deep learning in drug discovery. Here, deep-learning algorithms are helping to solve classification challenges, through molecular characteristics such as size and hydrogen bonding to identify criteria to rank those potential drugs. For example, AtomWise, a biotech company based in San Francisco, has developed an algorithm that converts molecules into a grid of 3D pixels, called voxels. This representation allows the company to account for modeling features such as the 3D structure of proteins and the geometry of small molecules, carbon atoms, with atomic precision. Those features are translated into mathematical vectors, which the algorithm can use to predict which small molecules might interact with a given protein, says company CEO Abraham Heffets. "What we do is the target [proteins] with no known binder," he says.
Atomwise is using this strategy to power its new AI-powered molecular-screening program, a library of 10 million compounds to provide academic researchers with 72 potential small-molecule binders for their proteins of interest. scans.
Deep-learning tools can also help researchers classify disease types, understand disease subpopulations, find new treatments, and match them with suitable patients for clinical testing and treatment. Finkbeiner, for example, is part of a consortium called North ALS, which is an effort to combine a range of data – genomics, transcriptomics, epigenomics, proteomics, imaging, and even pluripotent stem-cell biology – from 1,000 people with the neurodegenerative disease amyotrophic lateral sclerosis. (also called motor neuron disease). "For the first time, we'll have a data set where we can apply deep learning and see if deep learning can uncover the relationship between things we can measure in a dish around a cell. , and what's up with that patient," he says.
Challenges and Precautions
For all its promise, deep learning poses significant challenges, the researchers warn. Like any computational-biology technique, the results that an algorithm produces are only as good as the data that goes into it. Overfitting a model to its training data is also a concern. Furthermore, for deep learning, the criteria for data quantity and quality are often more stringent than some experimental biologists expect.
Deep-learning algorithms require very large data sets that are well annotated so that algorithms can learn to distinguish features and classify patterns. Large, clearly labeled data sets – with millions of data points representing various experimental and physiological conditions – give researchers the most flexibility for training algorithms. Finkbeiner noted that algorithm training in his work improved significantly after about 15,000 examples. Those high-quality 'ground truth' data can be extraordinarily difficult, says Carpenter.
To avoid this challenge, researchers are working on more training methods with less data. Advances in underlying algorithms are allowing neural networks to use data more efficiently, says Carpenter, enabling training on only a handful of images for some applications. Scientists can also take advantage of transfer learning, the ability of neural networks to apply the classification skills acquired from one data type to another. For example, Finkbeiner's team has developed an algorithm that was initially taught to predict cell death based on morphological changes. Although researchers trained it to study images of rodent cells, it achieved 90% accuracy when first exposed to images of human cells, improving to 99% as experience gained.
For some of its biological image-recognition work, Google Accelerated Science uses algorithms that were initially trained on hundreds of millions of consumer images mined from the Internet. The researchers then refine the training, using several hundred biological images similar to the ones they want to study.
Another challenge with deep learning is that computers are both unwieldy and lazy, says Michelle Dimon, a research scientist at Google Accelerated Science. They lack the judgment to distinguish biologically relevant differences from normal variation. "Computers are shockingly good at finding batch variance," she notes. As a result, obtaining data that will be fed into a deep-learning algorithm often means applying a high bar for experimental design and control. Google Accelerated Science requires that researchers randomly control cell-culture plates taking into account microenvironmental factors, such as incubator temperature, and use twice as many controls as a biologist. "We make it harder to pipette," quipped Dimon.
Dimon says this threat underscores the importance of working closely with biologists and computer scientists to design experiments that involve deep learning. And this careful design becomes even more important with one of Google's latest projects: Contour, a strategy for clustering cellular-imaging data in ways that group trends (such as dose responses) into specific categories (such as living or dead). ) instead of highlighting it. ,
Read More: Machine Learning Classroom Training
Although deep-learning algorithms can evaluate data without human preconceptions and filters, Green cautions, this does not mean they are unbiased. The training data can be skewed - as is the case, for example, when only genomic data from northern Europeans is used. Deep-learning algorithms trained on such data will receive embedded biases and reflect them in their predictions, which can lead to uneven patient care. If humans help to validate these predictions, it provides a possible check on the problem. But such concerns are troubling if the computer is left alone to make important decisions. "Thinking of these methods as a way of enhancing humans is better than thinking of these methods as a way to replace humans," Green says.
And then there's the challenge of understanding how these algorithms are creating the features, or features, that they use to classify the data in the first place. Computer scientists are attacking this question by altering or altering individual characteristics in a model and then examining how these changes alter the accuracy of predictions, says Polina Mamoshina, a research scientist at Insilico Medicine in Baltimore, Maryland. Which uses deep learning to improve drug discovery. , But different neural networks working on the same problem will not approach it in the same way, warns Green. Researchers are increasingly focusing on algorithms that make both accurate and explanatory predictions, they say, but for now, there is no doubt.