Big Data sheds light on pharma's 'Small Data' problems

Lee Feigenbaum, VP of Marketing, demoing his company's software

By Lee Feigenbaum, Cambridge Semantics

The best thing about the Big Data hype in pharma is how effectively it's shed light on all of the Small Data problems the industry is facing. The roots of the Big Data movement in pharma were innocent enough: challenges in storage, data access, and data analytics that organizations started seeing with shifts toward high-throughput screening and massive genomics data sets. But as Big Data became more and more mainstream, the range of business challenges that got slapped with the "Big Data" label started ranging further and further afield. Industry analysts noticed this quickly, redefining Big Data in terms of the three (or four) Vs--not just volume but also variety, velocity, and variability. Others have been quick to follow. At a recent conference on data-driven drug development, speaker after speaker stood up to talk about their approach to Big Data, and each speaker immediately qualified that they were speaking about the variety of data, rather than the volume of data.

There's a good reason for this. While it's true that voluminous Big Data problems are sexy and grab headlines easily with exotic talk of petabytes and exabytes, the number of people across a pharma company who actually deal with these volumes of information as part of their day-to-day job is vanishingly small. Put another way, while Big Data is a real problem, it's not a Big Problem. What is a Big Problem, on the other hand, is the challenge of dealing with the diverse variety of (small) data that's needed for decision-making throughout the drug discovery, development, and commercialization life cycles.

You might see analysts refer to this as the variety axis of Big Data, but the challenge is really around getting unified information access.

One aspect of this challenge that every pharma organization faces is in harmonizing data as it is aggregated. For example, any references to ALS, Lou Gehrig's disease, or amyotrophic lateral sclerosis need to be known as the same disease so that data about the disease from one source (e.g., pathway data) can then be integrated against other information from another source (e.g., affected population data).

Another aspect of this challenge is the extent of data diversity that faces pharma today. Any unified approach to data must take as broad an interpretation of relevant information as possible. That means information needs to include traditional structured data (e.g., pathway, target, and genomics databases, CDRs and CTMSs, or manufacturing, finance, and CRM systems), completely unstructured text content (e.g., trial protocol documents, in vivo assay write-ups, clinical case reports, or product perception in social media sites), and all sorts of semistructured sources in between (e.g., CRO-generated spreadsheet data or public NCBI XML data).

That kind of broad and deep view of data grants scientists, business analysts, safety officers, managers, directors, and executives access to the critical data that informs their decision-making, wherever the data may be. The process of harmonizing data may be internal, but the data itself may come from just about anywhere--CROs and CMOs, content vendors, even public data--and access needs to be timely. Decision makers can't afford to wait three months for an IT project to gain access to data needed for a decision due this week.

By allowing business users to get immediate and integrated access to all data relevant to critical business decisions, regardless of its location and format, pharma companies can gain a significant competitive advantage. For example, to maintain robust pipelines, Big Pharma continues to look for earlier- and earlier-stage drug candidates to license. But the earlier in development a compound is, the riskier a licensing deal can be. Mitigating this risk requires knowing as much about the candidate drug as possible: about its indication, about its mechanism of action, about competing products and development programs, about the IP landscape, about leading researchers in the area, about expected safety and efficacy targets, about relevant manufacturing or reimbursement concerns, etc. This is a typical "small data" problem. The information needed to form a complete understanding of the drug-development landscape is scattered across journal articles, grant and IP databases, regulatory filings, clinical trial results, and research presentations. Requirements also vary from one licensing opportunity to the next, meaning that there's no possibility to build a one-size-fits-all solution. The total data involved in this sort of competitive intelligence analysis may be relatively small--certainly no more than a few GB of data--but both the diversity of data and the value of this Small Data problem are enormous.

The most important data-related challenge facing pharma is to use data--any data--to make more and more critical business decisions. Most of these decisions don't need Big Data: they need the right data--whether Big or Small--and they need it at the right time.

Lee Feigenbaum is a leading expert in Semantic Web technologies and their applicability to enterprise IT challenges. As VP of Marketing at Cambridge Semantics, Lee helps ensure that the Anzo product suite continues to address customers' ever-changing and diverse data challenges. Prior to co-founding Cambridge Semantics, Lee spent over 5 years as an engineer with IBM's Advanced Internet Technology Group.