Text output
Both FastQC and PRINSEQ can generate some basic statistics for each fastq file. For FASTQC these are the following: “Total Sequences” (i.e. the number of reads), “Filtered Sequences”, “Sequence Length” (read length range) and “%GC” (average GC-content). Additionally, tables for “Overrepresented sequences” and “Kmer content” are generated. PRINSEQ calculates the following measures:
- stats_info: number of bases, number of reads;
- stats_len: minimum, maximum, range, mean, standard deviation, mode and mode value, and median for read length;
- stats_dinuc: dinucleotide odds ratio for AA/TT, AC/GT, AG/CT, AT, CA/TG, CC/GG, CG, GA/TC , GC and TA;
- stats_tag: probability of a tag sequence on both ends, number of predefined MIDs;
- stats_dup: number of exact duplicates, 5’ and 3’ duplicates, reverse complement duplicates, total nr. of duplicates;
- stats_ns: number of reads containing N, maximum number and percentage of N/read.
Graphical output
FastQC generates the following graphs:
- Per base sequence quality
- Per sequence quality scores
- Per base sequence content
- Per base GC content
- Per sequence GC content
- Per base N content
- Sequence Length Distribution
- Sequence Duplication Levels
- Kmer Content
The standalone version of PRINSEQ generates the following types graphs:
- Length Distribution
- GC Content Distribution
- Base Quality Distribution
- Occurence of N
- Tag Sequence Check
- Sequence Duplication
- Sequence Complexity
- Dinucleotide Odds Ratios
As an exhaustive user manual and example datasets are available for both tools, I won’t present every graph from every tool, just some examples.