Analyst's take on Coronavirus Genome Sequence
I’m no Biologist.
But as an Investment Analyst and an AI guy, having built a biological product in finance in the past, I had the strong urge to do some tiny bit on my part to accelerate the research process to put an end to the Coronavirus (Disease Name: COVID-19, Virus Name: SARS-CoV-2) So Instead of wearing a lab coat and taking control of a microscope(hats off to those in the frontline), I chose to analyse the virus genome sequence through bioinformatics/computational biology…. wearing a Bambu hoodie.
To begin with, I needed a genome sequence of the virus. Quick refresher. What’s a genome? You can think of it as a species’ recipe book(genetic instructions). It includes all of the hereditary information needed to build that organism, allowing it to grow and reproduce. Basically it is the thing that makes you, YOU.
A. Getting Genome Sequences
From the Gene and National Center for Biotechnology Information we can obtain the nucleotide(building block of DNA & RNA), and genome sequences of all species(even extinct dinosaurs! ID: U41319.1) loitering around on the planet Earth.
For Coronavirus(Covid-19), I collected almost all of the 45 genome sequences submitted by various countries in the past two months. Always ensure you get the ‘complete’ genome and not ‘partial’.
Let’s pick one from Yunnan submitted by the ‘Yunnan Center for Disease Control Prevention’ and ‘Institute for Acute Communicable Disease Prevention & Control’
1 attaaaggtt tataccttcc caggtaacaa accaaccaac tttcgatctc ttgtagatct 61 gttctctaaa cgaaatttaa aatctgtgtg gctgtcactc ggctgcatgc ttagtgcact 121 cacgcagtat aattaataac taattactgt cgttgacagg acacgagtaa ctcgtctatc 181 ttctgcaggc tgcttacggt ttcgtccgtg ttgcagccga tcatcagcac atctaggttt (Trimmed 495 lines in between) 29761 acagtgaaca atgctaggga gagctgccta tatggaagag ccctaatgtg taaaattaat 29821 tttagtagtg ctatccccat gtgattttaa tagcttctta ggagaatgac aaaaaaaaaa 29881 aaaaaaaaaa aaaaaaaaaa aaa
What may seem as gibberish strings are the ones that infected 120,000 people worldwide. You are now seeing ‘The Coronavirus’ in the string format. The weird A, C, G and T are nothing but (A) adenine, (C) cytosine, (G) guanine and (T) thymine. Fundamental units of the genetic code. I won't bore you with the details, but think of them as like Musical Notations on paper.
B. Matching Time! Finding Similarity among Virus Genome from different countries
How do we do it? Multiple Sequence Alignment: 'The Lego Block Way'
Since its initial outbreak, the virus has hopped borders into 94 countries! Now I’m curious as to whether they underwent any mutations during their trip. Specifically nonsynonymous substitutions can alter biological traits, allowing them to adapt to different environments. Which we dont want!
Using BioPython, a package for computational biology, I absorb the sequences and convert them to a single joint FASTA format(a text based format that I'll stack them up like Lego Blocks on top of each other). Now I'll align the virus sequence in order to get the maximum matching among them! For pairwise(2 sequence), one can use Needleman Wunsch algorithm(Older. Align whole sequence) or Smith Waterman algorithm(Newer. Allows for a bit of fuzziness).
These are used in Biology to find how two sequences of RNA align. In our case, we have multiple sequences, so we'll have to use heuristic methods as finding alignment in multiple is computationally expense.
C. Phylogenetic Analysis: Evolutionary Check
Using the above aligned file, we'll now conduct a phylogenetic analysis to understand the evolution of the coronavirus to see whether their biological traits vary in different countries. Basically to see if they all are similar or were there any notable mutations.
With genome sequences from China, Taiwan, Japan, US, Australia, Nepal, Sweden, Finland and South Korea I constructed the phylogentic tree.
The preliminary analysis tells us that the virus genomes obtained from different countries aren't that diverse or split apart, though one could observe slight variations in them. Notable observations can be found in Cluster 1(Finland, South Korea, US) and Cluster 6(China, US, Japan, Taiwan) where the virus genome carries the late Jan 2020 timestamp with no trace of single Dec 2019 sequences unlike all other clusters.
{Research Outcome: There exists a possiblity of Mutation}
This could also potentially explain/unravel the sudden massive spike in coronavirus cases in South Korea & Japan.
Another point to note is, in Cluster 5, where all of the China's early December 2019 phase genome resides, one could see recent February 2020 sequences obtained from the US being snucked in. Does it mean that the aggressive virus prevelant in early stages of spread is activated in the US as well? If so, the US need to be more cautious and act proactively towards containment!
D. Genome Comparison Study: Covid-19(2020) vs SARS(2002)
Coronaviruses are a larger category of viruses that includes the latest Covid-19(2020), MERS(2012) and SARS(2002). But the Covid-19 killed 3 times as many people in 8 weeks than SARS did in 8 months!
So let's try to analyse how similar the Covid-19 is to SARS?
Using 'Coronavirus Genome' from Yunnan and 'SARS Genome' from Sciences Centre we'll do a Pairwise Sequence Alignment with Needleman algorithm.
{ Research Outcome: Covid-19 shares 79.4% of its Genetic Code with SARS! }
I repeated the same analysis for MERS(Middle East respiratory syndrome) as well.
Unlike the stronger genetic overlap between Covid-19 and SARS, MERS is a bit weakly related with Covid-19.
{ Analysis Outcome: Covid-19 shares just 55.4% of its Genetic Code with MERS }
E. Detecting the Zootonic Source
Coronaviruses family is zoonotic in nature, that is they can be transmitted between species, animals to people. Infact, the earlier SARS was transmitted from civet cats to humans and MERS from camels to humans. So analysing this could give us a greater understanding of the transmission.
From similarity analysis amongst the genome of all species, I tried to identify the genome sequence with significant match to the coronavirus. Bat appears to be the hub of the virus!
{ Analysis outcome: Closest match to Human coronavirus is found in a bat coronavirus with a startling 96% similarity }
F. Coronavirus Family
When we say Coronavirus, it is a group of viruses where Covid-19 is one them. I wanted to analyse Covid-19 further and see how it gels with its family members and ancestors.
There are seven strains of Coronavirus. HCoV-229E, HCoV-OC43, SARS-CoV, HCoV-NL63, HKU1, MERS-CoV and SARS-CoV-2.
I absorbed the nucleitodes and stacked them up to prep for the alignment.
My machine is no super computer, so had to leave it for alignment for an hour. Then generated a radial to view the evolutionary relationships of the coronavirus family.
We can now see how Covid-19's genetics is intermingled with SARS, and albeit having weaker realtion with MERS, all three fall in the same broader cluster. Also one can observe the closeness between OC43 and HKU1 but being disimilar to the 229 and NL63 coronavirus strain. So a modified version of remdesivir drug could possibly be the key to attack the resistant strain.
E. Covid-19: Going Forward
As per the analysis conducted we observed that Covid-19 and SARS possess 79.4% genetic similarity, which gives me a lot of comfort. If the halted R&D work on the past SARS are rekindled, all the insight and vaccine development process could be of direct use in Covid's case resulting in potential accelerated breakthrough.
What could be even more helpful is to get hold of ReFRAME Collection and try target drug repurposing. Basically it is a collection of thousands of compounds that has been already tested for safety on humans. One could evaluate these for an attack on Covid-19.
To shed some light, think of it as an Universe of stocks(compounds) from the bank's Global Investment Committee that has been compliance approved(tested on humans). So you aren't making a fresh pick from scratch out of 100 thousand listed global stocks, but rather from within the approved investable universe you are evaluating the top ten stocks(combination of compounds) on a relative basis to achieve the required risk return mandate(vaccine).
This way instead of spending years in discovering a compound from scratch, we instead swiftly evaluate the existing compounds using machine learning to target the Covid-19.
So now is the time to act fast. For the people and economy. We all need to understand that any kind of measures taken by policy makers will only be short lived. Albeit one has to appreciate the FED for sending across a reassuring signal, but any incremental rate slash is not going to fix the supply chain disruptions. Either way with 2 more 50bps to hit the floor, we have to preserve ammunition for post fallout scenario to revive the economy.
Spending time on computational techniques during the last couple of weekends over virus genome sequences has given me a lot of insight and confidence on the vaccine journey. Believe it is a matter of time before FAANG starts to embark on its usual multiple expansion trip defying fundamentals.
It is the era of BioInformatics. Bolstered with machine learning we can soon make Covid-19 the thing of the past.
Credits: Thanks to U.S. Department of Health & Human Services(HHS), NIH, National Center for Biotechnology Information(NCBI) and all the institutes across the world for providing access to the virus genome sequences.
Special thanks to Olga Blinkova(Viral Genome Curator, NIH) for rolling back the constrained sequence extraction functionality across the NCBI's global website to facilitate the completion of my research.
Disclaimer: The research and analysis expressed in this article are my own and is not intended to take the place of a scientific finding or replace a systematic and critical review of the broader scientific literature, and may not be accurate. It is a summary outcome of the personal analysis work conducted with the hope to distill the evolution of the virus.
P.S. I'm no Biologist. Just an Investment and AI guy on a quest to solve global problems with critical thinking and computational techniques. Remember, Investments is the only field that demands the individual to have an understanding of all the fields.
CSPO® CSM® CSSGB® ITIL® | Digital Transformation | Mobile Apps | Product Owner | Strategic Leadership | Product Development | Agile | Leader | Team Building | Agency Realm | B2B & B2C
4yBio-Informatics! Nice Article, Rohith. Awesome!
FICC, Treasury
4yThis is so cool! thumbs up
Risk management specialist
4ySuper..
Vice President - Digital Wealth Product Management at OCBC Bank
4yMight be of Interest: GovTech Singapore, A*STAR - Agency for Science, Technology and Research. Bill & Melinda Gates Foundation, Chan Zuckerberg Initiative