Genomics boom puts strain on data storage, electricity grid

By Nick Paul Taylor Nov 2, 2015 12:13am

A widely seen and much-celebrated chart maintained by the National Institutes of Health shows how the fall in the cost of sequencing has outstripped the steady decline of Moore's law in recent years. The chart is a visual representation of the race toward the $1,000 genome, but now the industry is feeling the flip side of the graph: It can no longer rely on Moore's law to meet its storage needs.

This is a reemergence of an old problem. In the past, constraints forced IT operations to eke as much power and storage as possible out of their systems, but the unstoppable decline in costs rendered such tricks unnecessary. Now, Computer Weekly is reporting that the sudden boom in genome data has forced IT chiefs to dig into their old toolkits in search of techniques that can improve efficiency. The repackaging of job requests, rethinking of file format choices and heavier reliance on RAM are all being considered as research centers try to close the gap between Moore's law and sequencing cost.

One option is to rewrite code so that it is better suited to the demands of population-scale genomics, but this carries its own risks. "The problem with recoding is that it's slow and risky. If you change code you may not be sure you get the same results unless you rerun all your old research. Changing the statistical guts of these codes is often not feasible," Robert Esnouf, head of the Research Computing Core at the Wellcome Trust Centre for Human Genetics at the University of Oxford, said. Esnouf sees changes such as adjusting cache sizes to improve efficiency as more practical.

Population-scale sequencing has forced the rethink. The figures for individual research centers, let alone the industry as a whole, are staggering. Output at the Broad Genomics group is now up to an estimated 10 minutes per human genome at 30X coverage, meaning each base is sequenced 30 times, on average. And the German Cancer Research Center, which is another buyer of Illumina's ($ILMN) HiSeq X system, expects its data output to increase to 12TB a day. Twitter's ($TWTR) global operation reportedly generates 11TB. The surge in output is putting a strain on computer systems designed originally for lower-volume operations.

Infrastructure outside of research centers is struggling, too. Esnouf is currently working to link his operation to the University of Oxford's Big Data Institute ahead of its opening late next year, but this has created a new problem.

"There's not enough electricity in Oxford," he said.

- read Computer Weekly's feature