JPM23: Not just for chat generators, Nvidia turns AI language models toward genomic, protein data

Artificial intelligence-powered generators such as ChatGPT may be all the rage—with the instant gratification of typing in a prompt and getting content in reply—but companies like Nvidia hope that these types of large language models can be made to read the codes that make up the building blocks of the human body.

Instead of training a program to mimic human conversations, Nvidia’s collaboration with the software developer InstaDeep and the Technical University of Munich worked to feed AI models genetic data: the Gs, Ts, As, Cs and more that eventually get translated into our proteins.

Using the DNA of hundreds of people, as well as Cambridge-1, the most powerful supercomputer in the U.K., the researchers found it was feasible to develop a generalizable program—a “genomic language model”—that could be applied to a variety of different tasks, instead of requiring scientists to build fit-for-purpose AIs to chase answers for each major biological question.

Cambridge-1 came online in 2021, with the help of AstraZeneca, GSK, the NHS, King’s College London and Oxford Nanopore. Designed to deliver 400 petaflops of performance, it’s one of the top 50 fastest supercomputers on the planet.

At the same time, Nvidia has been working with the synthetic biology company Evozyne to build a large language model focused on constructing never-before-seen proteins. 

Built on Nvidia’s BioNeMo framework—part of its cloud-based offerings for drug discovery—the pair showed they could use the program to add dozens of amino acid mutations to a human metabolic protein known as PAH, changing its shape into a more efficient form. 

For example, by adding 167 mutations, leaving only half of the protein’s original sequences in place, researchers found a shape that could potentially enhance its function by 15%. Another iteration with 51 mutations showed 85% sequence similarity, but was able to boost function by two-and-a-half times, the company said.

The work builds on computer models examining protein folding that have recently made major breakthroughs. DeepMind’s AlphaFold AI helped solve a decades-old biological puzzle in 2020, showing it could predict the final shape of a protein from a string of amino acids down to the width of a single atom. 

And last year DeepMind expanded its work to span nearly every protein known to science, by providing visual models of nearly 200 million proteins cataloged from animals, plants, bacteria and more.

Nvidia and Evozyme think their program can set the stage for designing proteins that could one day be tuned to treat congenital diseases, or used to sequester carbon dioxide from the atmosphere.