Posted by Attila Berces
Richard Holland raised some interesting points in his blog post “Sequence once, read often?”. I particularly like the argument that DNA itself is the best way to store DNA data. Assuming the plummeting cost of whole genome sequencing there will be no need to store all sequencing data and the conclusion that supports the “sequence once read once” paradigm is compelling.
However, there are alternatives to the “sequence once read often” and “sequence once read once” paradigm. The validity of both paradigms depends on the time frame and the type of analysis we carry out. The “sequence once read once” paradigm will take a few years if not a decade or two to reach. In contrast the “sequence once and read often” paradigm is suitable for the level of technology we have right now. In some visionary clinics in the US whole genome sequencing is becoming standard practice for everyone who has an appointment for any reason. In this environment, having the sequence on-file means that they can retrieve the data and search for particular mutations every time it is necessary for personalized treatment. In contrast the turnaround time for whole genome sequencing including analysis still takes longer than the time frame this data would be useful except for some chronic conditions.
As an alternative to these two paradigms I envision a “sequence once analyze often” paradigm. The major difference between “reading” and “analysis” is that “reading” assumes that we have the correct consensus sequence while “analysis” means reanalyzing the underlying (short) read data. Why do we need such distinction? Let me explain this using the example of the Human Leukocyte Antigen (HLA) typing, which is among the most common molecular diagnostic tests with millions of samples tested annually.
The HLA region of the genome consists of large segmental duplications. These segments are significantly longer than the read length of current mainstream sequencing technologies. In addition, this region contains the most polymorphic exons combined with highly conserved regions. The combination of these two effects renders not only the reference-based alignment but also de novo assembly useless. For this reason the HLA region is often excluded from association studies. The false variant rate is simply too high and the results are unreliable.
In contrast, there is an extensive genotype database for HLA since over 20 million people have been HLA typed as bone marrow donor candidates. This gives an opportunity to carry out direct genotyping from next generation sequencing data. Since the NGS reads are directly mapped to the genotype database and not to a reference genome, the problems associated with read alignment and assembly can be circumvented. Direct genotyping is significantly more accurate and precise than reference based alignment or assembly to determine the genotype.
HLA is not the only example where this concept can be applied. Killer Cell Ig-Like Receptors (KIR), metabolizing enzymes and transmembrane proteins in general are examples of this category. These are all particularly important from the drug metabolism and likely to play a role in adverse reaction to drugs. Currently 69 ongoing clinical trials investigate HLA association along the safety and efficacy profile of the drug.
In addition direct genotyping is a way to search for known deleterious mutations as it is described in Slazberg and Pertea’s Do-It-Yourself Genetic Testing article. In order to take advantage of the benefits of these analysis one needs to store the underlying (short) read data. For this reason, considering the current state of sequencing technology, sequencing once, storing original read data and analyzing often seems to be a viable alternative.