The human genome consists of 7 billion DNA base pairs. Sequenced as a string of text using the letters A,T,C and G that refer to the bases, it identifies each of us uniquely. Just storing one of those strings for research purposes requires around 100 Gigabytes.
In recent years technology advances have made collecting genome data more affordable, and modern storage techniques make relatively light work of storing genome data.
But analysis of the genome data and modern medical research ratchets up the computing demands significantly. Researchers are analysing the individual 100 Gb genomes, but also exploring collections of data measured by the terabyte, looking for variations that are the key to understanding how the gene works, what might be disease markers, and exploring the 3D structure of the genome in healthy and cancer cells – important for diagnosis and targeted treatment.
The computing demands of this exacting and complex work are immense and have previously been the province of specialised high performance computing (HPC) hardware – often leaving scientists queuing for access to the machines.
Ground-breaking work at the Australian National University (ANU) in partnership with Microsoft partner, BizData, has shown how cloud computing can be harnessed to tackle genome analysis, simulations and visualisations, potentially opening the door for widespread clinical application. It also reinforces ANU’s global reputation as a leading research university.
Dr Sebastian Kurscheid is a biological data science research fellow in the Department of Genome Science at ANU’s John Curtin School of Medical Research and explains that the cloud computing research project was designed to; “Use the advances in computer and communications technology to support the advances in life science research.”
In essence it promises to accelerate genomic research – and also liberate it from laboratories, potentially bringing it into the clinical realm.
This would accelerate the focus on the health aspects of genome research. “There are questions about how medical genomics has a more increasing relevance to clinical practice. It’s already important in the field of rare diseases but it’s becoming more and more relevant also in more common diseases,” says Dr Kurscheid.
By demonstrating how cloud computing and the right analytical framework can be used to interpret the genome as a precursor to diagnosis and treatment, he is hopeful that in a few years’ time clinicians will be able to make direct use of the data themselves. “Provided on a cloud platform so that you don’t really need local IT teams, local IT infrastructure that is costly to buy and to maintain. Azure again provides a secure environment that could be beneficial for small clinics that don’t have the resources but still need to offer their patients these services.”
Dr Kurscheid says that moving analysis to the cloud freed him from much of the technical administration previously needed when using the HPC to analyse the genome data sets, and he’s optimistic that the project heralds a new wave of open science with researchers around the world able to more easily share data and access research techniques – all of which have the potential to accelerate scientific breakthroughs.
Journey to cloud
Previously the Department of Genome Science had access to local servers (30-40 nodes) managed by the ANU Bioinformatics Consultancy unit, as well as a number of shared HPC environments, including HPC at the National Computing Infrastructure.
Researchers also had large workstations (16 cores, 10TB storage, 128 GB RAM) provisioned to run tertiary analysis in R, while scientists conducting simulations and visualisation of the genome utilising Virtual Reality required significant graphical processing unit resources.
But still there was the need for greater power and speed that could be dialled up on demand. Genome research features intense peaks of computing demand punctured by lengthy troughs – cloud computing’s elasticity and scalability make better economic sense than having an HPC idling on standby until it is required.
“Initially when we explored using the cloud for heavy compute research workloads we expected to achieve comparable performance. What we didn’t expect in some of our projects with research departments is that we could help achieve four times better performance for a quarter of the cost of managing your own hardware.” says Nadav Rayman, director at BizData.
“The reason for this is quite simple. The sheer variety that Azure offers allows us to help researchers pick the optimal resources for the specific needs of a particular project. It is this variety, available on-demand in Azure, that is impossible to mimic in even some of the biggest private data centres that have been set up for High Performance Computing in Australia.”
“I am coming more to the conclusion that it would be more efficient for us to use cloud computing – to only pay for what we need. We might only need 15 weeks out of the year, for really high-performance computing,” says Dr Kurscheid.
At the same time Dr Kurscheid recognised there would be great benefit from establishing access to state-of-the-art compute infrastructure. Opting for an open environment allows ANU to open-source its analytics pipeline as well as make accessible its data in order to accelerate Australian and international research efforts. He sought a platform that could help open up the field of genome research in a controlled, self-contained environment.
“This platform being the cloud and in this example, the Microsoft Azure cloud,” he adds.
Dr Kurscheid explains that; “The type of data that I predominantly work with is what’s called high throughput sequencing data and that is something that has become very common in genome science and many other disciplines in the life sciences that use these technologies to essentially profile and investigate DNA both on a qualitative level, when we look at mutations but also in terms of a quantitative level if we want to learn something about the underlying structure of the DNA in the cell and that’s what we focus on here in our department.”
BizData, a leading data analytics and big compute Microsoft partner, delivered a solution to take the Department’s existing genomics workflows, and provide access to powerful computing resources on demand, capable of handling significant data volumes.
According to Rayman; “Leveraging the cloud is an important aspect to helping our research community use more of the funding dollar to focus more on analysis and less on procuring hardware.”
Working with a previously analysed 2 Terabyte data set (to ensure that the results matched when analysis was transferred to cloud) the team took all the bio-informatics pipeline components and prepared that to run on Azure. “It was achieved in two weeks – and was a very successful story considering the technical challenges we had at the beginning.”
Dr Kurscheid is keenly aware of the technical overhead associated with complex scientific research. He says during the three and a half years of his research program, he’s had to devote one third of that time to getting the technical elements in place so that he could run the HPC analytics over the many data sets generated in the department.
He believes that moving to a cloud-based solution would have saved him nine months, freeing him up for more research in a field where “Every minute that I can spend on intellectually tackling problems will lead to some advance.”
“I think we’ve gone a long way already from the raw data to some analytical output. I would like to take that even further for essentially the whole workflow,” says Dr Kurscheid.
It’s not a trivial exercise for a research lab where every experiment and associated workflow is different – but the underlying approach demonstrates how cloud computing can deliver the platform for making genome-based analysis and diagnosis far more accessible, potentially opening up new clinical applications.
According to Rayman; “The nature of research is that you are always pushing the boundaries of techniques and are driven by the need to experiment. That’s where having a flexible environment in the cloud really shines, as you can draw on a variety of computing resources and open source tools with fewer limitations.
“Our focus at BizData has been to deliver a seamless experience for researchers using the Microsoft Cloud.. For example, today we enable a researcher to take an existing pipeline (for example in Snakemake or Galaxy) that they have already built and allow them to run secondary analysis in the cloud with as much computing power as needed, without changing a line of code. We also make it easy to analyse and collaborate on the research outputs, without having to wait for large volumes of data to download again.”
There is also a benefit for ANU. According to Dr Kurscheid; “This demonstrator is a good example of how we at ANU, at John Curtin, could use Azure to do our research and help publish it which hopefully creates more impact by stimulating more scientific discourse and more reuse of the data and citations which is good for the university.”
Critically, it also has the potential to accelerate genome research and spur its clinical application.
As Dr Kurscheid notes; “The general infrastructure is available for going from raw data – as primary as it gets – to a highly analysed and visualised result and that would probably be used for some work that we are currently finalising that’s actually looking at the 3D structure of the genome in cancer cells. I’m envisaging that if we conduct all this analysis using Azure then also doing some really nice visualisation and exploratory analysis using the platform.”
Outside of the research sphere he suspects there will be the opportunity for new clinical applications as genome analytics platforms become more accessible and affordable.
“Part of the long-term vision is that in the medical field genomics becomes more widely available – it’s already important in rare diseases. As it becomes more common smaller hospitals or pathology services might see demand for this.
“I think that making these workflows and tools and analysis pipelines publicly available in a manner that is adaptable for others would support the broader uptake of genomics in the medical field.”