Genome-wide association studies (GWAS) play a crucial role in medical research. By examining millions of genetic variants across the entire genome in large populations, these studies can identify genetic variations that contribute to a particular disease or trait. GWAS have already led to breakthroughs in disease prevention, personalised medicine and drug development.
However, Dr Denis Bauer, who leads the Transformational Bioinformatics Group at the Commonwealth Scientific and Industrial Research Organisation (CSIRO), notes that traditional GWAS evaluate disease association for each genomic location individually, which can be limiting for complex diseases.
“These diseases, such as dementia, represent the largest burden on the healthcare system and involve genes that interact with each other to create disease risk,” she explains.
“Statistical models struggle to evaluate the joint contribution of variants across the genome. So other common approaches compromise by investigating interactions between locations that have already shown association with the disease. Unfortunately, this approach runs the risk of not including the real drivers of disease that may have no effect individually but jointly contribute to disease development.”
This limitation in traditional GWAS is one of the main reasons CSIRO created VariantSpark. The scalable machine learning framework, which recently became available on Microsoft’s Azure Marketplace. VariantSpark enables researchers to quickly and accurately analyse high-dimensional genomic data – data sets with a large number of variables – to find novel disease genes or predictive biomarkers.
“In complex diseases, we are hunting very subtle signals, which means we need very large data sets to make robust statements,” says Dr Bauer. “VariantSpark can scale to mega-biobanks with millions of samples and is 90 per cent faster than traditional compute frameworks.”
“This puts researchers on the right track for finding evidence of epistasis, the non-additive gene-gene interactions that are postulated to drive complex diseases. It also boosts their ability to find predictive biomarkers that allow disease to be diagnosed early to potentially stop or delay progression.”
Another reason CSIRO created VariantSpark was to help its research collaborators analyse their increasingly large and complex genomic data sets.
“We were involved in analysing a cohort of several thousand individuals, and all the other tools failed on the size. So we either needed to compromise by analysing only a subset of the data, or innovate,” says Dr Bauer.
“We wanted to make VariantSpark publicly available because if we have problems processing large volumes of data or deeply interrogating complex cohorts, a lot of other researchers probably have that problem too.”
While VariantSpark can scale to handle large and complex data sets, Dr Bauer notes that the solution also caters to researchers with smaller volumes of data.
Bringing compute to the data
Dr Bauer was one of about 20 people who were involved in the development of VariantSpark, which was first released as a free open-source solution in 2017. It leverages Apache Spark, a unified analytics engine for big data and machine learning.
“Basically, we came up with VariantSpark when Apache Spark was released,” she says. “And we approached Microsoft way back when cloud was still in its infancy, and they were the first cloud provider to support us by giving us credits to play around on Azure.
“Originally, it only clustered genomes. Then we implemented random forests to enable it to point to individual genomic locations and explain how they work interactively.”
Dr Bauer says VariantSpark has also expanded from capturing binary disease status (case-control) to continuous disease status. Likewise, it can process continuous features like brain images, rather than only processing data on the three possible genetic states (that is, reference, alternate and heterozygous) at a specific genomic location or variant. Importantly, VariantSpark calculates p-values, which allows researchers to differentiate disease associated variants from statistical noise.
VariantSpark can run on a variety of setups, from an individual’s desktop to local, high-performance compute clusters and cloud-based distributed computing.
In March 2023, CSIRO expanded the accessibility of VariantSpark to Azure Marketplace. This free software as a service version enables users to run the solution as efficiently as possible on Azure by using Databrick’s managed Spark cluster service with Azure Blob Storage and Azure Resource Manager (ARM).
The ARM template comes with baseline security settings using a virtual private network (VPN), which can be further customised based on the deploying organisation’s IP address. The entire architecture is tied to a resource group, making it easier to clean up the deployment once used.
“Researchers who are already set up on Azure can now use VariantSpark within the same secure VPN that they have set up for their data already,” explains Dr Bauer. “This allows encrypted data to be accessed through linking VariantSpark with the local key manager without having to exchange [data] with another system.”
It means that researchers can avoid the security and privacy risks of moving around genomic data sets, which contain sensitive personal information. The solution also facilitates reproducible research by enabling users to access data on Azure that has already been analysed by other like-minded users.
“Reproducible research is the main goal in science – something is not accepted until someone else has reproduced it – and that’s exactly what we’re empowering researchers to do,” says Dr Bauer.
Taking the next step with Microsoft
VariantSpark has already attracted interest on Azure Marketplace from researchers in Australia, the United States, the United Kingdom, the Netherlands, and China.
Over time, Dr Bauer and her team plan to work with Microsoft to introduce a new capability for VariantSpark that allows researchers to not only analyse data sets that are available on Azure, but also ingest data sets located on other cloud providers.
CSIRO is exploring collaboration with Microsoft – which uses the Fast Healthcare Interoperability Resources (FHIR) data standard for its Azure Health Data Services – to enable VariantSpark to analyse patient annotations directly out of FHIR-enabled electronic medical records.
“On top of that, CSIRO has developed the Ontoserver, an ecosystem of traversing complex medical terminologies, and making sense of information captured by different ontologies,” says Dr Bauer. “The next step in offering the full benefit of cloud for researchers will be to join VariantSpark with Ontoserver in the FHIR environment supported by cloud providers like Microsoft.”