We've all heard of Big Data—capital B, capital D—and the myriad ways the life sciences industry can exploit it. Algorithms can be used to make sense of vast datasets, helping pathologists analyze tissue samples to diagnose disease, or crunching genomic data to predict a person’s risk for aneurysm. But what about “small data?” Can we apply the tools we have used for big data in situations where little or no data are available?
First things first: With the utility of big data, why are biotech companies even thinking about “small data?”
When big data was “all the rage,” many players started out by “exploiting the wealth of large databases, public repositories, patent data, [scientific] literature data”—essentially big datasets that were already out there. This work was important in getting the field started, Andrew Hopkins, CEO of Exscientia, a company using an artificial intelligence-based platform in drug design, told FierceBiotech.
However, working with existing datasets—however large—is not a direct path to innovation: “We quickly understood that if you are using machine-learning models dependent on having a large amount of data, the downside is most projects tend to have been projects that have already been well worked on. The cutting edge in drug discovery is in first-in-class, novel targets, where there is actually very little data,” he said.
Small data, he said are “the next frontier of problems we really need to solve."
“Ultimately, every drug discovery project starts off as a small-data project,” he said. “And of course, that is where the commercial imperative is as well, to develop innovative medicines.”
He’s talking about coming up with new compounds and then optimizing them—figuring out which compound to make next and which experiment to conduct next to eventually arrive at a drug candidate. Because after all, an active compound does not a drug make. It has to tick a number of other boxes—selectivity to its target, solubility so it actually gets absorbed into the body, and so on—before it can be moved forward as a drug candidate.
“At the start of a drug discovery project, you might have five compounds that are active, but you don’t know their selectivity or their solubility—you don’t know a lot of things,” said Willem van Hoorn, chief decision scientist at Exscientia.
One way to learn all those things is to conduct “a gazillion of experiments,” van Hoorn said. “You will get there, even if you do experiments at random. But it’s not a very efficient way.”
Or, “people often try to force the use of a sophisticated model like deep learning... that works in a big data environment, in a lower data environment," said Therence Bois, co-founder and director of operations at InVivoAI, a Montreal-based startup focusing on deep learning algorithms for low-data situations.
“The key thing is understanding what the problem is and designing the appropriate technology to solve it,” said Hopkins and van Hoorn.
“The real power of algorithms comes from exploiting very large datasets—the larger it is, the more powerful it is,” Hopkins said. Those types of datasets aren’t available in drug discovery, so a different approach is needed in low-data settings.
Both Exscientia and InVivoAI work with active-learning models, which—as the name suggests—don’t just make predictions based on existing data.
“With deep learning, you generally get models that perform very, very well for the data they were trained on. But give them a new set of compounds or a new set of samples, they perform quite poorly,” said Daniel Cohen, co-founder and CEO of InVivoAI.
Active learning models can take in new data, learn from them and become better models: “We can get better predictive models, and ultimately, better compounds, by actively interfacing with medicinal chemistry teams,” Cohen said. They can generate new compounds, new structures that perform better than the ones used to train the model.
“Using active learning is kind of putting a human into the loop. After generating molecules in silico, we can use active learning to test the molecules in a real wet lab and get more data to put back into the initial model,” Bois added. This can help avoid hurdles, such as having a model generate a compound that, in theory, would work very well, but is too complicated or too costly to synthesize in real life.
Exscientia’s using active learning to identify and prioritize the compounds its models believe will provide more information to quickly get through optimization.
“Which compound should we make next? Which one will give me the greatest learning, the greatest information to optimize my project faster?” Hopkins said. “We are asking the question: how can you learn as fast as possible?”
The next compound the model predicts may not yet be a drug candidate, but it might be one that yields data to make the model better—a step in the right direction, van Hoorn said.
The British company counts Celgene, GlaxoSmithKline, Sanofi, Roche and Evotec among its partners.
InVivoAI, too, is working with partners to address different challenges in a small-data environment. And when it says small data, it doesn’t just mean small datasets. It also means noisy data, heterogeneous data, or a dataset in which not many compounds were sampled. Basically, a dataset characteristic of early-stage drug discovery, Cohen said.
One of its projects involved working with cells taken directly from 20 patients, which meant the team could only screen so many compounds. Using a virtual library of only 1,200 compounds, InVivoAI developed a model that would predict how each of the compounds would work in each of the 20 cell lines.
“It was a challenge, but it worked quite well,” Cohen said. “The next step is to generate entirely new compounds optimized for activity on a patient-by-patient basis.”
Eventually, InVivoAI plans to test the compounds it generates in vitro and then in vivo.
“A lot of people propose interesting computational models, but they don’t actually test that out. If you create a model, can you prove you can go synthesize them and prove they’re behaving the way you think they’re behaving?” Cohen said. Because that’s what the biopharma industry wants to see: computational approaches leading to real outcomes, a.k.a. drug candidates.
“At the end of the day, to create something new, we need to go beyond current models with a handful of millions of data points,” he said. “We need to go beyond the known universe to get new stuff.”