DNA is often compared to a written language. The metaphor leaps out: Like letters of the alphabet, molecules (the nucleotide bases A, T, C and G, for adenine, thymine, cytosine and guanine) are arranged into sequences — words, paragraphs, chapters, perhaps — in every organism, from bacteria to humans. Like a language, they encode information. But humans can’t easily read or interpret these instructions for life. We cannot, at a glance, tell the difference between a DNA sequence that functions in an organism and a random string of A’s, T’s, C’s and G’s.
“It’s really hard for humans to understand biological sequence,” said the computer scientist Brian Hie (opens a new tab), who heads the Laboratory of Evolutionary Design at Stanford University, based at the nonprofit Arc Institute (opens a new tab). This was the impetus behind his new invention, named Evo: a genomic large language model (LLM), which he describes as ChatGPT for DNA.
ChatGPT was trained on large volumes of written English text, from which the algorithm learned patterns that let it read and write original sentences. Similarly, Evo was trained (opens a new tab) on large volumes of DNA — 300 billion base pairs from 2.7 million bacterial, archaeal and viral genomes — to glean functional information from stretches of DNA that a user inputs as prompts. A more complete understanding of the code for life, Hie said, could accelerate biological design: the creation of better biological tools to improve medicine and the environment.
To read more, click here.