Industry Voices: In Pursuit of Scientific Hive Mind

Apr 17, 2012 9:04am

By David Steinberg

A vocal and active scientific community is clamoring for unrestricted access to scholarly works, a movement known as "open access", while journal publishers largely resist the call. For a reasoned and nuanced commentary, check out EMBO Director Maria Leptin's 3/16 editorial in the journal Science. Oh wait, you can't--it's behind AAAS' pay wall. But why is open access important? And even if it becomes a reality, what will we do with all of that information once we get it? While perceptions of fairness, equity and morality drive much of the open access groundswell, perhaps the greatest imperative is that of maximizing Public Knowledge. Fueled by this wealth of information, new social platforms and semantic search/mining technologies are poised to change the way that science is conducted, communicated and understood.

University of Bristol Professor of Theoretical Physics John Ziman wrote in 1968: "Science is Public Knowledge … Its facts and theories must survive a period of critical study and testing by other competent and disinterested individuals, and must have been found so persuasive that they are almost universally accepted … [science's] goal is a consensus of rational opinion over the widest possible field." As the scope of human scientific endeavor explodes, Ziman's mandate becomes increasingly difficult to carry out. Since the first scientific journal, Journal des Scavans, was published in 1665, researchers have penned over 50 million scientific papers (1.5 million in 2009 alone) and there are now over 23,000 scientific journals in circulation.

The vast majority of this work remains locked behind pay walls, where it's left to a small group of editors and referees to manually review, curate and error-check the output--a virtually impossible task. In a sweeping 2012 study, Amgen scientists demonstrated that only 6 out of 53 preclinical cancer studies from notable journals could be replicated, according to a recent Nature article. Of 49 highly cited clinical studies evaluated in 2004, only 44% were reproduced, according to finding in the Journal of the American Medical Association published in 2005. And outright mistakes are surging: The Wall Street Journal recently showed that the rate of retractions in scholarly publications has increased 15-fold since 2001.

Publishers don't seem to get it, though, as a recent blame-deflecting editorial by Nature Publishing Group makes clear: "What can journal editors and referees do? Sloppiness is sometimes caught, but so much must be taken on trust." We need dramatic changes if we are to avoid getting lost and overwhelmed in a torrent of disjointed and sometimes incorrect scientific data and insight. Certainly the first step is to open up all scholarly publications so that the scientific community can participate as exactly that-- a community--in the vetting and dissemination of new research. But even if all of this work were completely transparent, broadly accessible, and 100% open, the technical challenges involved in bringing the right information to the right people at the right time are tremendous.

Imagine, though, a vast, collective "hive mind" that consumed, processed and disseminated all scientific knowledge instantaneously and ubiquitously, incorporating not only published works, but also raw data, ancillary analysis, and digital signatures of collaborators, resources and other critical elements. Like the proverbial 100^th monkey phenomenon, once the hive mind understood something, we'd all understand it. In 2012 Internet terms, the hive mind would combine the best of human intelligence (insight, opinion and crowd wisdom) and machine intelligence (big data, machine learning and semantic data mining) in a massive, automated, self-organizing, dynamic, machine- and crowd-sourced wiki of all scientific knowledge organized into topics and relationships, rigorously vetted and reviewed. It would be readable by machine for querying and analytics, and knowledge would be pushed to interested researchers, physicians, journalists, students and consumers when and where they needed it.

While this may seem far-fetched, the building blocks are already being assembled. Google has publicly acknowledged the limits of its PageRank and other algorithms, and is aggressively pursuing a "knowledge graph"-based approach of creating a massive semantic database of entities and their relationships (e.g., storing information like the geography, depth, and surface area of every lake on the planet). Wikipedia itself is already starting to adopt elements of scientific peer review and dynamic scholarly publishing, while Microsoft founder Paul Allen and others are "teaching Wikipedia to write itself", funding the Wikidata project to standardize Wikipedia's structure and make it machine-readable.

In the meantime, researchers like Andrew McCallum of the University of Massachusetts and David Blei of Princeton are applying an approach called Probabilistic Topic Modeling to sift through millions of scientific publications to discover relationships, themes and emerging fields that humans may not yet even have recognized. Novel, alternative, scientific impact ranking systems based on Tweets, Facebook likes, hyperlinks, downloads, etc. (altmetrics) are redefining reputation-based metrics.

Specialized applications of these approaches will fundamentally change the way science is conducted, communicated and understood across many fields. In healthcare, the hive mind will automatically match patients with the most relevant medical professionals, clinical trials and drug research. Knowledge-empowered patients will arrive at their physicians' offices armed with up-to-the-minute disease knowledge provided by a system that "knows" their personal demographics and health history. In education, dynamic, living repositories of key research fields will enable new Ph.D. students to collapse years of background work into weeks, and static textbooks will give way to self-organizing, multi-dimensional wikis that update in real time. In research, industry and academic scientists seeking outside collaborators will have direct access to state-of-the-art information in any area and be instantaneously connected with the world's leading experts in that field. The hive mind will democratize science for a new generation of researchers operating out of small colleges, garages and high schools, and level the playing field in the developing world as up-to-date scientific information becomes rapidly and broadly available. Investors, philanthropists and patient groups will see the real story behind a disease, not just the sensationalized headline. Eventually, the hive mind will become a reality, and deep scientific knowledge will pervade everything we do. But all of this hinges on availability of the data, so first things first--publishers, it's time for open access.

David Steinberg is a partner at PureTech Ventures, a Boston-based venture creation company. Mr. Steinberg has co-founded 6 companies, most recently Vedanta Biosciences, which is developing therapies based on the biology of the human microbiome, and Knode, an automated platform for identifying and collaborating with experts in biomedicine.