AstraZeneca sets out to build 5-petabyte, 2M-genome database

Big Data Computer Warehouse

AstraZeneca ($AZN) is redefining what big looks like in genome research. While a range of initiatives have been branded population-scale sequencing programs, none wear the label quite as comfortably as AstraZeneca’s scheme. At 2 million whole genomes, the database will house more people than the population of the 50 smallest countries.

If the database were a country, it would be the 144th most populous on earth, wedged just between Macedonia and Latvia. And, with AstraZeneca performing whole genome sequencing on everyone in the database, it will have a sizeable digital footprint. AstraZeneca expects to gather 5 petabytes of data in total.

To put that figure in context, it is 25% of the capacity of all the hard drives produced in 1995. Or, as AstraZeneca EVP Mene Pangalos, put it: “If you put 5 petabytes on DVDs, it would be four times the height of [310-metre London skyscraper] the Shard.” To generate all that data, AstraZeneca is putting “hundreds of millions of dollars” into genome research, Pangalos said at a press conference attended by Nature News.

Some of the cash will land in the bank account of Human Longevity, Inc, the J. Craig Venter-founded sequencing shop that is one of a limited number of organizations with the capacity to carry out such a project. AstraZeneca plans to send 500,000 samples to Human Longevity for sampling over the course of the collaboration, giving Venter’s suite of Illumina ($ILMN) HiSeq X sequencers plenty of material to process. The Big Pharma will also gain access to Human Longevity’s fast-growing database.

AstraZeneca is betting that data generated through these initiatives will help it identify rare genetic variants, something that should become easier to do as the size of the genome repository increases. The expectation is that the identification of the variants and unearthing of other insights will improve multiple aspects of AstraZeneca’s R&D operation, from the discovery of novel targets to the selection of participants for clinical trials.

