1000 Genomes Project data available on Amazon Cloud
Project is Exemplar of New White House Big Data Initiative
The world's largest set of data on human genetic variation — produced by the international 1000 Genomes Project — is now publicly available on the Amazon Web Services (AWS) cloud, the National Institutes of Health and AWS jointly announced today.
The public-private collaboration demonstrates the kind of solutions that may emerge from the Big Data Research and Development Initiative announced today by the White House Office of Science and Technology Policy (OSTP) during an event at the American Association for the Advancement of Science in Washington, D.C.
"The explosion of biomedical data has already significantly advanced our understanding of health and disease. Now we want to find new and better ways to make the most of these data to speed discovery, innovation and improvements in the nation's health and economy," said NIH Director Francis S. Collins, M.D., Ph.D. Dr. Collins is among agency leaders speaking in support of the initiative at the launch event.
The Big Data initiative will initially engage at least six federal science agencies — including the NIH, the National Science Foundation, and the Department of Defense and the Department of Energy — committing more than $200 million to a collaborative effort to develop core technologies and other resources needed by researchers to manage and analyze enormous data sets.
Among the NIH components participating in the Big Data initiative are the National Human Genome Research Institute (NHGRI) and the NIH National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine. NHGRI played a lead role in organizing and funding the international 1000 Genomes Project. NCBI, along with the European Bioinformatics Institute, Hinxton, England, began making 1000 Genomes Project data freely available to researchers in 2008.
Since the project's launch, the data set has grown enormously: At 200 terabytes — the equivalent of 16 million file cabinets filled with text, or more than 30,000 standard DVDs — the current 1000 Genomes Project records are a prime example of big data that has become so massive that few researchers have the computing power to use them.
To help solve that problem, AWS has just posted the 1000 Genomes Project data for free as a public data set, providing a centralized repository on the Amazon Simple Storage Service. The data can be seamlessly accessed through services such as Amazon Elastic Compute Cloud and Amazon Elastic MapReduce, which provide organizations with the highly scalable resources needed to power big data and high performance computing applications often needed in research. Researchers pay only for the additional AWS resources they need to further process or analyze the data.
The public-private collaboration to store the data in the AWS cloud allows any researcher to access and analyze the data at a fraction of the cost it would take for their institution to acquire the needed internet bandwidth, data storage and analytical computing capacity.
"Improving access to data from this important project will accelerate the ability of researchers to understand human genetic variation and its contribution to health and disease," said NHGRI director Eric D. Green, M.D., Ph.D. NHGRI is a major funder of the 1000 Genomes Project, along with Wellcome Trust of London and BGI-Shenzhen of China.
Cloud access also enables users to analyze the data much more quickly, as it eliminates the time-consuming download of data and because users can run their analyses over many servers at once. "Putting the data in the cloud provides a tremendous opportunity for researchers around the world who want to study large-scale human genetic variation but lack the computer capability to do so," said Richard Durbin, Ph.D., co-director of the 1000 Genomes Project and joint head of human genetics at the Wellcome Trust Sanger Institute, Hinxton, England.
Initiated in 2008, the 1000 Genomes Project is an international public-private consortium that aims to build the most detailed map of human genetic variation available, ultimately with data from the genomes of more than 2,600 people from 26 populations around the world. The project began with three pilot studies that assessed strategies for producing a catalog of genetic variants that are present at 1 percent or greater in the populations studied. Data from the pilot studies were released on AWS in 2010. The data now being released in the cloud include results from sequencing the DNA of some 1,700 people; the remaining 900 samples will be sequenced in 2012 and that data will be released to researchers as soon as possible. The new results identify genetic variation occurring in less than 1 percent of the study populations and which may make important genetic contributions to common diseases, such as cancer or diabetes.
"It took more than 10 years and billions of dollars to sequence the first human genome. Recent advances in genome sequencing technology have enabled researchers to tackle studies like the 1000 Genomes Project by collecting far more data faster. This has created a growing need for powerful and instantly available technology infrastructure to analyze that data," said Deepak Singh, Ph.D., principal product manager, Amazon Web Services. "We're excited to help scientists gain access to this important data set by making it available to anyone with access to the Internet. This means researchers and labs of all sizes and budgets have access to the complete 1,000 Genomes Project data and can immediately start analyzing and crunching the data without the investment it would normally require in hardware, facilities and personnel. Researchers can focus on advancing science, not obtaining the resources required for their research."
The 1000 Genomes Project welcomes working with other cloud computing providers who are interested in hosting the data. Cloud access to the 1000 Genomes Project data through AWS is at http://s3.amazonaws.com/1000genomes/.
"Providing cloud access will expand the universe of researchers who have access to the data, which fulfills a central goal of the 1000 Genomes Project to make the data as widely available as possible to accelerate medical discoveries," said Paul Flicek, D.Sci., team leader for vertebrate genomics at EBI-Hinxton and co-leader of the 1000 Genomes Project Data Coordination Center (DCC). "Cloud availability will also enable other uses with constraints on computing power, such as for bioinformatics education."
The 1000 Genomes Project data are also freely available through the 1000 Genomes website, at www.1000genomes.org, and from each of the two institutions that work together as the project DCC: the NCBI at ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes, and EBI, with DCC support from the Wellcome Trust, at ftp://ftp.1000genomes.ebi.ac.uk.
The availability of 1000 Genomes Project data in the AWS cloud represents the fruition of a lengthy collaborative effort between NCBI and AWS, in which their joint expertise enabled the development of systems that would meet the unique needs of the science community in relation to sequence data.
"The resulting systems accommodate the types and sizes of files necessary for transferring, storing and accessing massive amounts of sequence data," said Stephen Sherry, Ph.D., chief of the NCBI reference collections section and co-leader of the 1000 Genomes Project DCC. "They also provide a framework that has allowed software providers to add tools that improve the scientific community's ability to use these data to make discoveries."
In addition to funding data generating projects, NIH also funds many projects to develop new computational tools for analyzing genomic data. For example, NHGRI just provided approximately $1.5 million to fund the development of Galaxy, an open source software suite for data analysis in the life sciences developed at The Pennsylvania State University, University Park, Pa., and Emory University, Atlanta, into a community resource. Tools such as Galaxy may be uploaded into the AWS cloud to analyze 1000 Genomes Project data.
As part of the Big Data initiative, NIH will join with the National Science Foundation to fund the development of core technologies for data collection, management, analysis and extraction. NIH is particularly interested in imaging, molecular, cellular, electrophysiological, chemical, behavioral, epidemiological, clinical and other data sets related to health and disease. Participating NIH components include NHGRI, the National Cancer Institute, National Institute of Biomedical Imaging and Bioengineering, National Institute on Drug Abuse, National Institute of General Medical Sciences, National Institute of Neurological Disorders and Stroke, and National Library of Medicine.
"This timely initiative will generate tools and approaches for maximizing the return on our national investments in large-scale data collection," said Karin Remington, Ph.D., director of the Division of Biomedical Technology, Bioinformatics, and Computational Biology at NIH's National Institute of General Medical Sciences and co-chair of the initiative's senior steering group. "It will also spur creation of the educational and infrastructure resources needed to enable broader use of such data, including in new areas of inquiry."
About the National Human Genome Research Institute: NHGRI is one of the 27 institutes and centers at the NIH, an agency of the Department of Health and Human Services. The NHGRI Division of Extramural Research supports grants for research and training and career development at sites nationwide. Additional information about NHGRI can be found at its website, www.genome.gov.
About the National Center for Biotechnology Information: NCBI creates public databases in molecular biology, conducts research in computational biology, develops software tools for analyzing molecular and genomic data, and disseminates biomedical information, all for the better understanding of processes affecting human health and disease. NCBI is a division of the National Library of Medicine, the world's largest library of the health sciences.
About Amazon Web Services: Launched in 2006, Amazon Web Services (AWS) began exposing key infrastructure services to businesses in the form of web services — now widely known as cloud computing. The ultimate benefit of cloud computing, and AWS, is the ability to leverage a new business model and turn capital infrastructure expenses into variable costs. Businesses no longer need to plan and procure servers and other IT resources weeks or months in advance. Using AWS, businesses can take advantage of Amazon's expertise and economies of scale to access resources when their business needs them, delivering results faster and at a lower cost. Today, Amazon Web Services provides a highly reliable, scalable, low-cost infrastructure platform in the cloud that powers hundreds of thousands of enterprise, government and startup customers businesses in 190 countries around the world. AWS offers over 28 different services, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3) and Amazon Relational Database Service (Amazon RDS). AWS services are available to customers from data center locations in the U.S., Brazil, Europe, Japan and Singapore.
About the 1000 Genomes Project Collaborators: Organizations that have committed major support to the 1000 Genomes Project are 454 Life Sciences, a Roche company, Branford, Conn.; Affymetrix, Inc., Santa Clara, Calif.; BGI-Shenzhen, Shenzhen, China; Complete Genomics, Inc., Mountain View, Calif.; Illumina Inc., San Diego, Calif.; Life Technologies Corp., Carlsbad, Calif.; the Max Planck Institute for Molecular Genetics, Berlin, Germany; the Wellcome Trust, London, U.K., the Wellcome Trust Sanger Institute, Hinxton, Cambridge, U.K.; NCBI, and the NHGRI, which supports the work being done by the Baylor College of Medicine, Houston; the Broad Institute, Cambridge, Mass.; and Washington University, St. Louis. Researchers at many other institutions are also participating in the project, including ones in Bangladesh, Barbados, Canada, China, Colombia, Denmark, Finland, Germany, the Gambia, Nigeria, Pakistan, Peru, Puerto Rico, Spain, Switzerland, the U.K., the U.S., and Vietnam.
About the National Institutes of Health (NIH): NIH, the nation's medical research agency, includes 27 Institutes and Centers and is a component of the U.S. Department of Health and Human Services. NIH is the primary federal agency conducting and supporting basic, clinical, and translational medical research, and is investigating the causes, treatments, and cures for both common and rare diseases. For more information about NIH and its programs, visit www.nih.gov.