What have we learned about COVID-19 from the genome sequencing research?

By Susan Hu

Based on nearly 20,000 reported genome sequences of the COVID-19 virus (SARS-CoV-2) from labs around the world, Graphen has been monitoring and analyzing the genome evolution and propagation pathway of the virus since March 2020. The research has identified about 500 strains of viruses so far.

Most labs reported the whole genome sequencing of the viruses they have collected from confirmed patients. Because genomic variation happens when the virus reproduces, the locations of variants can be considered as evidence of how they evolve. Graphen utilizes next-gen AI to classify the location of the mutation, analyze and visualize the evolution of the virus.

Based on our research with daily updates, here's what we've learned about COVID-19 and its eight major strains.

Alt Text The overall visualization with 8 cluster labelled as of April 20, 2020.

When did the first COVID-19 case happen?

COVID-19 had its first publicly known outbreak in Wuhan, China in late December 2019. However, based on our research we have reasons to believe that the virus may have appeared even prior to that.

We've noticed viruses that are widely spread as of late April, 2020, can have a maximum of 12 to 14 mutations. Every week these genes have been evolving with one more mutation point that can be passed down. From our analysis, we see the virus strain from December 24, 2019 already had three to five mutation points at the time. It's possible that the earliest COVID-19 cases may have appeared in mid to late November, 2019.

Different variants of the virus and their current location

From the data Graphen has accessed and analysis we've conducted, the current COVID-19 viruses can be divided into eight major families according to their distribution and location.

The first ancestor of the virus is the starting point of all mutations and the earliest widely spread strains which we identify as two A virus strains; the A family then evolves into two groups of offspring: the E and F families. The other group is the B, C, and D families; the C family then mutated into G and H.

The A family appeared as early as late December, 2019 while the H family was not discovered until February 19, 2020. It took 60 days for the original virus A to mutate into H.

Among these eight virus strains, the B family swept across China, the C family spread to Europe, the D family occupies the U.K. and the Netherlands, E occupies the West Coast of the U.S., F spreads to Spain, South Korea, Austria and China, the G family is mainly found in Europe, and H crosses the ocean to the East Coast of the U.S.

You can see a clearer breakdown of the viruses from the table below:

Alt Text From the graph that visualizes the pathway of the viruses, we believe that the major virus strains in some regions only mutated before they are pandemic in the local area, and it is impossible to prove the virus originated in China.

However, each strain of new coronavirus can be "retrospect" to the earliest two virus strains collected in China at the end of December, 2019 and the beginning of January, 2020. Therefore, we cannot find any reason to believe that the virus was not spread from China.

Alt Text Graph of the COVID-19 virus A Family

Where did the US COVID-19 viruses come from?

The US's cases can be identified among three strain families: B, H and G.

Family B: Originated and outbroke in China

Since February, 2020 we've had very few virus strains uploaded from China. Of the over 20,000 virus strains in the database, only around 400 are from China, with B accounting for the vast majority. More than half of the cases in Wuhan came from the B family, it then spread to other parts of China and around Asia.

Unexpectedly, a female patient who died on February 6, 2020 in Santa Clara, California, was recently confirmed her cause of death: COVID-19. The New York Times reported that the patient had no recent history of traveling abroad. However, this woman was an auditor of a large semiconductor company with offices around the world, including Wuhan. She often had contact with colleagues from all over the world. Was that how she came into contact with COVID-19?

As Graphen traced the virus genealogy, it was discovered that Santa Clara only uploaded two strains of viruses, both of which belong to the B family. But is B the original virus causing community infection? More virus strains need to be uploaded from there to confirm.

Family E: Wide spread in Canada and the U.S west coast

The first COVID-19 virus was collected from Seattle, Washington on January 19, 2020 from a 35-year-old man who had just returned from a visit to his family in Wuhan. After 4 days of fever and cough, he wore a mask, went to a small local clinic and was confirmed of COVID-19.

California, another large state on the West Coast, had its first case at the end of January, 2020. Only after the first case appeared, the virus seemed to have disappeared for 3 weeks, and the pandemic did not break out until around February 20, 2020.

However, according to the figures in April, 2020, of every 100,000 people, Washington State has 155 infections, and California has only 80 people, which is much lower than the New York State's no. of 1,248. Does this mean that the E family is less contagious?

We found no evidence genetically. It's possible that commuting habits on the West Coast may have slowed down the spread of the virus: on the East Coast, especially New York City, more people take the public transportation while on the west coast, most people drive their own cars, and is more likely to maintain social distance.

Family H: Spread from France to the U.S. East Coast

The first confirmed case in New York appeared on March 1, 2020: a 39-year-old woman who had just returned to Manhattan from Iran. Two days later, a 50-year-old lawyer returned from Miami was also diagnosed.

Although the U.S. government closed to boarder to travelers from China at the end of January, 2020 in an effort to block viruses from China, it was not known that the virus had invaded Europe until March 11, 2020 when the US completely banned entry from Europe.

The viruses quietly bypassed the Atlantic and landed on the east coast.

From Graphen's research, it was observed that the largest virus strain on the east coast first appeared in northern France on February 21, 2020 and landed in New York 10 days later. The virus on the east coast is highly self-evolving, with up to 86% of viruses in New York State being its offspring. It later sweeps across the east coast, which also causes the number of diagnoses in New York being much higher than on the west coast.

The likely origins of the US COVID-19 viruses

When an overwhelming virus strain appears in a region, it often means that the virus has strong activity and spreads quickly as soon as it hits the ground. Many US media believe that the US COVID-19 virus came from Europe instead of from China as President Trump claimed.

After comparing the gene sequences, our analysis shows that the US virus came from both Europe and China: the former accounted for 53% of the total US virus (H family), and the latter accounted for 28% (E family).

Graphen has created two visualization for the current virus strains in the US. You can check out the link below, choose North America / USA to see more. https://www.graphen.ai/covid/types.html

Acknowledgement: We appreciate GISAID for hosting the EpiCoV database and the worldwide labs who shared their sequenced virus info that made this research possible.

If you want to learn more about Graphen's research on COVID-19, please reach out to me at susanhu@graphen.ai.