After scientists at the St. Jude Children’s Research Hospital started making anonymized pediatric cancer patient data freely available to the public in 2010, they soon realized that the volume of the data is simply too large for easy access. So they went to explore technical solutions and began working with Microsoft and DNAnexus. Now, a cloud-based platform created by the partnership is up and running.
Meet St. Jude Cloud, which the collaborators said is the world’s largest public repository of pediatric cancer genomics data. For now, it stores over 5,000 whole-genome, 5,000 whole-exome and 1,200 RNA-Seq datasets generated from three St. Jude-supported genomics initiatives.
It’s also more than just a data storehouse. The platform provides a suite of analysis tools and visualization capabilities that aim to help researchers develop new treatments for pediatric diseases.
“St. Jude Cloud is a powerful resource to drive global research and discovery forward,” said Jinghui Zhang, Ph.D., chair of the St. Jude Department of Computational Biology and co-leader of the St. Jude Cloud project. “Providing genomic sequencing data to the global research community and making complex computational analysis pipelines easily accessible will lead to progress in eradicating childhood cancer.”
Zhang and her team at St. Jude worked with Microsoft and DNAnexus to develop a genome alignment and variant calling pipeline—an analytical technique that can identify where genomes differ—which is the key component to the Microsoft Genomics service the tech giant recently launched for genomics research. Data analyzed through the pipeline also became the foundation for St. Jude Cloud.
All the data on St. Jude Cloud lives on Microsoft Azure, which can handle large-scale datasets as populational genomics information. On top of that, DNAnexus builds the interface—a secure online ecosystem where researchers can access the data and tools.
Besides three basic ways to inspect existing data—by disease, publication and curated dataset—the platform also allows more advanced sample collection, such as by gene mutation or expression level. Researchers can also upload and run their own data using the bioinformatics tools.
Because the data and analysis run in the cloud are powered by rapid computing capabilities that don’t require downloading, researchers can move their projects much faster. The hospital said a St. Jude scientist was able to replicate within a few days experimental findings from a B-cell leukemia study that had originally taken the team more than two years to make.
The rationale is that the more genomic data researchers can access and compare, the more accurately they can rule out the biological noise and pinpoint the real genetic factors behind tough diseases like cancer. By 2019, St. Jude expects to have 10,000 whole-genome sequences on St. Jude Cloud.