Humanity's guide to the human genome just got a major upgrade

Scientists have released the first draft of the human pangenome reference, a new blueprint of human DNA built on data from a much more diverse cohort than the original.

“We’re going to be able to understand forms of genetic variation in genes that we never could characterize properly before,” senior author Evan Eichler, Ph.D., based at the University of Washington School of Medicine in Seattle, said during a press briefing. Eichler and the other scientists behind the effort are members of the Human Pangenome Reference Consortium, an international group backed by the National Institutes of Health’s National Human Genome Research Institute. 

Several papers describing the developments were published May 11 in Nature. The initiative builds on past attempts to make the human genome reference more accurate. When the results of the Human Genome Project were published in 2003, the sequence was about 90% done. While many updates and corrections have been made over the years as technology has improved, only in March 2022 was it officially completed, thanks to an undertaking by researchers with the Telomere-to-Telomere Consortium. (Many of those same scientists are also part of the Human Pangenome Reference Consortium.)

The latest updates are a significant step toward making scientists’ guide to the human genome better represent the global population. While the original map ushered in a wave of biomedical progress, it was built on genomic data from just 20 people—most of it from one individual. The latest draft incorporates two genetic sequences each from 47 people, all of them genetically diverse. That number is slated to grow to 350 by mid-2024.

“We refer to [the original human genome sequence] constantly when we talk about genes,” senior author Benedict Paten, Ph.D., of the UC Santa Cruz Genomics Institute, said. “The problem with the current reference is that it is both incomplete and it lacks diversity. Concretely, it lacks the things that make us different genetically, and thus the interesting things.”


If you took the genetic sequences of two individuals and laid them side by side, 99.6% of the data would match up. But the differences are what’s important. They explain why one person might be more likely than the other to develop a heart condition or cancer, for instance—or, if both have the same disease, could predict whether one might benefit more from a specific treatment. Having data from too few individuals creates blind spots. A clinician might miss that someone is vulnerable to a genetic disease if that person’s particular gene variation hasn’t been documented as a risk because the data used to build the reference were too narrow. 

Take the complex gene for lipoprotein(a), a type of low-density cholesterol. Clinical studies have shown that higher levels of lipoprotein(a) correspond with a greater risk of having a heart attack. Mutations in the gene for lipoprotein(a) are associated with the biggest genetic risk for coronary heart disease in African Americans—but because the gene hasn’t been completely sequenced, there are many cases that lacked a clear cause.

“Now that we can actually sequence that gene in its entirety, and we can understand the variation in that gene, we can start to go back to unexplained cases of patients with coronary heart disease and risk and associate it with variation that now comes out of the pangenome,” Eichler explained.

Another example is complement factor 4A and 4B, or C4A and C4B, which is involved in the immune response. Structural variations in the genes for C4A and C4B that result from a process called gene conversion have been closely linked to schizophrenia. The pangenomic reference gives insight into those variations, so clinicians and scientists can develop more accurate associations in patients, Eichler said.

“We know the signal is there for these few examples, but there’s a lot more left to be discovered,” he added. “We now have the framework to actually do that discovery.”

DNA sequencing technology has progressed significantly in the 20 years since the Human Genome Project published the original human genome map. One major improvement making the new reference possible is long-read DNA sequencing, which interprets long stretches of DNA at a time. Others include advanced computational techniques that enable researchers to align single sequences from many different individuals and assemble them together to create the guide.

“[These tools] are allowing us to create genomes that are essentially complete, and when you have the complete sequence, you have the complete genetic variation,” Eichler said.