Using GCP Genomics and BigQuery to Annotate Clinically Significant Single Nucleotide Polymorphisms (SNPs)
The past twenty-five years has seen a rapid decrease in the cost of genetic sequencing, from $2.7 billion dollars for the human genome project (completed 2003), to roughly $1000 dollars today. This decrease in cost has led to the development of the personal genomics testing via companies like 23andMe and AncestryDNA, who provide these testing services directly to the public, albeit, using a cheaper variant testing method rather than whole genome sequencing.
Historically, the volume of information provided by genetic testing presented a hurdle for any analyst, as only universities, private research institutes or computer science labs had the necessary equipment to process the large datasets in reasonable amounts of time. The advent of cloud computing, and in particular cloud-scale data warehouses such as Google BigQuery or Redshift, now puts this capability in the hands of small startups and individual users at home.
I recently obtained a genetic test which included access to the raw Single Nucleotide Polymorphism (SNP) data. The test itself was for heritage, however armed with the data and a GCP account, I wanted to see what I could glean from the results.
Brief Biology Explanation
For those of you who didn’t take high school biology (or are a bit rusty!), I’ve provided a little bit of a primer to get you familiar with the underlying science.
Your genome is the genetic material that makes you, you, minus all the environmental and social effects. It includes not only your DNA, the twisted helix you see everywhere, but also the genetic material of mitochondria (of Star Wars fame) and chloroplasts. For the sake of brevity and this exercise, we’ll focus on only the DNA part, though most heritage tests will provide mitochondria tests for matrilineal ancestry.
DeoxyriboNucleic Acid (DNA, shown to the right) comprises two strands called polynucleotides, polynucleotide is just a fancy way of saying many-nucleotides. A nucleotide comprises of one molecule called a nucleobase (base for short) which can be any one of adenine (A), cytosine (C), guanine (G) or thymine (T) as well as a sugar and phosphate group. In DNA the bases pair selectively, A will only bind with T and C will only bind with G. A pair of bound bases is called a base-pair (bp).
The selective pairing of bases means that a sequence on one strand, such as ACTG, will have a complementary pair on the other strand: TGAC. These pairs of nucleobases (with the sugar and phosphate groups) are what makes up DNA, and DNA makes up your chromosomes, which are tightly-packed bundles of DNA that exists within your cells. Human cells ordinarily contain 23 pairs of chromosomes (22 called autosomes and a pair of sex chromosomes – XX or XY).
Returning to the situation at hand, a single nucleotide polymorphism (greek for ‘many’-‘shape’ roughly) is when a single base-pair exhibits multiple variations at a particular site in the DNA strands – think A-T instead of G-C. This can be benign, pathogenic, beneficial or unknown. Each variation is called an allele, typically a particular allele is the more ‘normal’ common pairing and the other variations are rarer. It’s worth also noting that A-T is not the same as T-A, as the strands have direction and are anti-parallel (parallel but heading in different directions).
A genetic test such as those provided by the commercial testing companies will detect and identify these SNPs and use them to provide an indication of potential medical impacts and/or to trace lineage (as you inherit your DNA from your mother and father).