Artificial intelligence has been used to discover 161,979 new species of RNA virus, providing unprecedented insight into a diverse and fundamental branch of life present worldwide. The study, published in Cell and conducted by an international team of researchers, is the largest virus species discovery paper to date.
“We have been offered a window into an otherwise hidden part of life on Earth, revealing remarkable biodiversity,” said Professor Edwards Holmes from the University of Sydney. “This is the largest number of new virus species discovered in a single study, massively expanding our knowledge of the viruses that live among us.” He added, “To find this many new viruses in one fell swoop is mind-blowing, and it just scratches the surface, opening up a world of discovery. There are millions more to be discovered, and we can apply this same approach to identifying bacteria and parasites.”
Although RNA viruses are commonly linked to human diseases, they are also found in extreme environments across the world and may play important roles in global ecosystems. The newly discovered viruses were found in environments such as the atmosphere, hot springs, and hydrothermal vents. “That extreme environments carry so many types of viruses is just another example of their phenomenal diversity and tenacity to live in the harshest settings, potentially giving us clues on how viruses and other elemental life-forms came to be,” said Professor Holmes.
The research team developed a deep learning algorithm called LucaProt to analyse vast amounts of genetic sequence data, including virus genomes up to 47,250 nucleotides long. “The vast majority of these viruses had been sequenced already and were on public databases, but they were so divergent that no one knew what they were,” said Professor Holmes. “They comprised what is often referred to as sequence ‘dark matter’. Our AI method was able to organise and categorise all this disparate information, shedding light on the meaning of this dark matter for the first time.”
LucaProt was trained to identify viruses based on genetic sequences and the secondary structures of the protein that all RNA viruses use for replication. This method significantly sped up virus discovery, which would have been time-intensive using traditional techniques.
Co-author Professor Mang Shi from Sun Yat-sen University said, “We used to rely on tedious bioinformatics pipelines for virus discovery, which limited the diversity we could explore. Now, we have a much more effective AI-based model that offers exceptional sensitivity and specificity, and at the same time allows us to delve much deeper into viral diversity. We plan to apply this model across various applications.”
Dr Zhao-Rong Li, from the Apsara Lab of Alibaba Cloud Intelligence, added, “LucaProt represents a significant integration of cutting-edge AI technology and virology, demonstrating that AI can effectively accomplish tasks in biological exploration. This integration provides valuable insights and encouragement for further decoding of biological sequences and the deconstruction of biological systems from a new perspective.”
Professor Holmes concluded, “The obvious next step is to train our method to find even more of this amazing diversity, and who knows what extra surprises are in store.”