The idea of a universal translator is nothing new. As our age of communication continues to pursue the Star Trek future, with apps and devices translating in real time, we know what’s important is the endgame. Sharing information, experiences, stories—all irrespective of one’s language, upbringing or culture: that is to say, distances of any type. Why should data be any different?
The goal of breaking down research silos and pooling knowledge is nothing new, either. But the National Center for Advancing Translational Sciences, the youngest of the National Institutes of Health, is tackling a project much broader in scope.
It aims to bring together disparate data sets from across the healthcare enterprise: genetics, proteins, imaging and cell processes, along with health and economic records, diagnostics, environmental data and clinical trial outcomes.
Each of these is telling the same story—the patient’s story—in a different way. While a doctor may describe a person’s disease based on the presenting signs and symptoms, a molecular biologist could characterize the same ailment through a genetic rearrangement.
“The concept here is that there are about 20 different languages that the biomedical research world uses,” NCATS Director Chris Austin, M.D., said at an event hosted by the Center for Data Innovation. “Everything from the geneticists to the cell biologists to the pathway people, to the pathologists, pharmacologists, physicians and the drug development people—they all have these idiosyncratic languages. And it's impossible to go directly from one to the other, so many translational projects just fall into the crevices and never get out.”
The ultimate goal of the project is to drive a wholesale regrouping of patients based on the entirety of data available. This would refine our current definitions of disease and provide doctors with clearer routes for helping patients who may or may not respond to different treatments.
Its basic architecture comprises dozens of annotated sources of knowledge gathered from collaborating institutions across the U.S., including regulatory-compliant clinical data and patient summaries.
These individual sources each propose responses to a user’s inquiry, which are gathered and linked within a virtual knowledge graph. Then, reasoning engines focused on different aspects of the problem—such as the disease model, genetic variants or cell pathways—work to apply their own analyses and chart potential insights through all the sources within the graph.
Though the sources themselves do not communicate with each other, the user is able to deliver feedback to help train the system’s reasoning engines, after evaluating their ranked responses and supporting pieces of evidence.
“We've been working on this for about three years, and we have just finished the feasibility stage to show that this is actually possible,” Austin said. “It's a 20-dimensional problem—20 dimensions—which really gets the data people’s blood flowing. I can't even conceptualize what that means, but they get very excited when you say that. And it has shown a remarkable ability to give insights.”
It’s much different than simply interpreting a question in a format that a source can easily understand and fulfill, such as asking for directions to the library. (“¿Dónde está la biblioteca?”)
It’s more about coming to simultaneous realizations that no one participant could reach on their own, multiplied by hundreds of players at once. (“Which children with asthma respond well to which treatments, even when exposed to high levels of air pollution, and why?”)
Pediatric asthma and the rare disease Fanconi anemia, which can be triggered by a specific mix of narrow genetic defects and environmental exposures, were chosen as the project’s early test cases, with each requiring complex reasoning and multiple data sources to provide deeper answers.
But the key feature of any translator is that it works both ways, and this is no exception. Potentially, you could also start with the chemical features and mechanisms of a particular drug, and rationally compare them across all the known diseases in the system for possible successes, Austin said.
“If you ask how we did this, it is a massive team effort,” he said. “It's about 100 people from about 20 different universities and research institutions all over the country. And it is so exciting to the data scientists and the physicians and the geneticists and the others working on this that now, fully half of the people on this project we don't even pay. They're so excited they work on it for free. These are academic researchers, who—well, I'm not familiar with that behavior most of the time. And yet, it is happening here.”
Still, NCATS announced a new funding opportunity last month, totaling about $13.5 million for up to 15 projects, to help establish a translator consortium that will set community standards for reusing data, integrate new knowledge sources, and build tools that help tell the story of disease and how it unfolds.