Computer scientists warn about the coming tsunami of genomics data

Study co-authors Sauriba Sinha and Gene Robinson

Over the last decade, the amount of genomic sequencing data available has doubled every 7 months, rivaling Twitter ($TWTR) and YouTube for data storage demands and analysis capacity. But the deluge of DNA sequencing data is just beginning, and the exabyte-sized tsunami to come will require a new approach to the technology used to store the data--and learn from it, according to experts from the University of Illinois and Cold Spring Harbor Laboratory.

Twitter and YouTube offer excellent comparisons, according to the team of computational experts. They each bring together massive amounts of data from a huge and diverse network of sources--just like the genomics field. But they've done it by utilizing a common format, unlike sequencing, which should learn from the social media leaders.

"The sequence data have to be analyzed through sophisticated and often computationally intensive algorithms, which find patterns in the data and make connections between those data and various other types of biological information, before they can lead to biologically or clinically important insights," says co-author Saurabh Sinha, a professor of computer science at Illinois. "All of this makes the goal much more challenging than just sequencing DNA and storing that information."

Using various processing methods at the time the data are being generated could also prove invaluable as investigators learn more and more.

"In the future, we may have to take the hard decision of storing only the processed form and not the original, and that, too, in heavily compressed forms, to drastically reduce the storage needs," Sinha adds.

Just how big will DNA sequencing data get? The computer experts say that by 2025, 10 years from now, we'll have exabytes of data, counting the gigabytes by the billion.

"Genomics will soon pose some of the most severe computational challenges that we have ever experienced," says Gene Robinson, a professor of entomology and the director of the Carl R. Woese Institute for Genomic Biology. "If genomics is to realize the promise of having a transformative positive impact on medicine, agriculture, energy production and our understanding of life itself, there must be dramatic innovations in computing. Now is the time to start."

Their views have been published in PLOS Biology.

- here's the release
- read the report in the Washington Post
- get the PLOS Biology item