A formidable artificial intelligence model called Evo can design DNA sequences to manipulate cell functions, create new genes and protein sequences, and even develop an entirely new CRISPR gene-editing system, allowing it to act as a potentially powerful tool for disease diagnosis and therapeutics.
The multimodal machine learning model, reported in Science, has been trained on 2.7 million evolutionarily diverse microbial genomes in order to decode and design DNA, RNA, and protein sequences from the molecular to genomic scale with unparalleled accuracy.
It is the first foundation model trained at scale on DNA and has been described by the Arc Research Institute in Palo Alto where it was developed as the “Rosetta Stone” of biology.
Evo’s ability to predict, interpret, and engineer genomic sequences on a vast scale represents major progress in understanding and engineering biology across multiple modalities and scales of complexity.
It unlocks deeper insights into biological processes and represents a leap for biotechnology that could fundamentally change how synthetic biology is performed.
“The ability to predict the effects of mutations across all layers of regulation in the cell and to design DNA sequences to manipulate cell function would have tremendous diagnostic and therapeutic implications for disease,” predicted Christina Theodoris, PhD, from the Gladstone Institute of Data Science and Biotechnology in San Francisco in a related Perspective article.
The large language of life (LLLM) genomic foundation model was able to design an entirely synthetic CRISPR gene editing tool, including a guide RNA that improves the cutting ability of Cas9 enzyme genetic scissors.
It could also design DNA sequences over one million base pairs in length, which reaches the size of many real genomes.
Evo uses deep learning techniques to efficiently process long sequences of genetic data. This has allowed it to develop a comprehensive understanding of the interplay of the genetic code.
Eric Nguyen, PhD, from the Arc Institute, and co-workers trained the large-scale biological sequence model on billions of DNA nucleotides across a range of prokaryotic and phage genomes to learn the grammar of DNA in a way not possible through the study of a single organism.
Through this, it can predict how small DNA changes can affect the evolutionary fitness of an organism, and generate realistic, genome-length sequences more than one megabase in length that greatly surpass prior models.
The model operates at a single-nucleotide resolution to preserve the ability to learn from the complex informational landscape present in DNA. It is designed to capture two key aspects of biology, combining the multi-faceted nature of the central dogma of biology with the multiscale nature of evolution.
The central dogma integrates the interplay of DNA, RNA, and proteins into a unified code that results in a predictable flow of information while the multi-scale nature of evolution incorporates the vastly different length scales of biological function represented by molecules, pathways, cells, and organisms.
Evo is equipped with seven billion parameters and uses frontier, deep-learning architecture to model biological sequences at a single-nucleotide resolution.
Through this, it can unravel the intricate coevolution between coding and noncoding sequences and design sophisticated biological systems like CRISPR-Cas complexes and transposable elements.
“Further development of large-scale biological sequence models like Evo, combined with advances in DNA synthesis and genome engineering, will accelerate our ability to engineer life,” the researchers maintained.