Workflow Wednesdays - Part 3. Read preprocessing - Read quality control 2. - QC results - Omixon

Text output

Both FastQC and PRINSEQ can generate some basic statistics for each fastq file. For FASTQC these are the following: “Total Sequences” (i.e. the number of reads), “Filtered Sequences”, “Sequence Length” (read length range) and “%GC” (average GC-content). Additionally, tables for “Overrepresented sequences” and “Kmer content” are generated. PRINSEQ calculates the following measures:

stats_info: number of bases, number of reads;
stats_len: minimum, maximum, range, mean, standard deviation, mode and mode value, and median for read length;
stats_dinuc: dinucleotide odds ratio for AA/TT, AC/GT, AG/CT, AT, CA/TG, CC/GG, CG, GA/TC , GC and TA;
stats_tag: probability of a tag sequence on both ends, number of predefined MIDs;
stats_dup: number of exact duplicates, 5’ and 3’ duplicates, reverse complement duplicates, total nr. of duplicates;
stats_ns: number of reads containing N, maximum number and percentage of N/read.

Graphical output

FastQC generates the following graphs:

Per base sequence quality
Per sequence quality scores
Per base sequence content
Per base GC content
Per sequence GC content
Per base N content
Sequence Length Distribution
Sequence Duplication Levels
Kmer Content

The standalone version of PRINSEQ generates the following types graphs:

Length Distribution
GC Content Distribution
Base Quality Distribution
Occurence of N
Tag Sequence Check
Sequence Duplication
Sequence Complexity
Dinucleotide Odds Ratios

As an exhaustive user manual and example datasets are available for both tools, I won’t present every graph from every tool, just some examples.

Per base sequence quality graph of one of the Illumina read files, generated by FastQC. Note, that this is a very early sequencing run (from 2008), the general
quality of the newer Illumina reads is usually around 33-35, or even higher. Of course, the read length improved significantly since 2008 too.

Per base quality graph of the IonTorrent fastq file, generated by PRINSEQ. Nowadays, Ion Torrent reads usually have a slightly lower quality than Illumina or 454 reads.

Per base sequence content of the 454 read file. You can easily spot, that the first 8-10 bases of all the reads are exactly the same. This usually happens, when the adaptors are not trimmed. The last few bases seem a little biased too, but that’s usually due to the lower number of long reads (basically, this is kind of like a sampling error: the higher base positions are represented by much less reads than the base positions in the beginning or middle).

Training

Workflow Wednesdays – Part 3. Read preprocessing – Read quality control 2. – QC results

Text output

Graphical output