It”s often a good idea (especially for large datasets) to use a subset of the reads for some preliminary analyses. A smaller dataset can be useful for figuring out the alignment parameters or checking for alignment problems.
The easiest way to get a subset of a fastq file is to take the reads in the beginning of the file. Of course, this is not a random subset of the reads, but it can still be useful and it”s very easy to do.
Just open a linux terminal and use the “head” command to get the first N lines of a file. Make sure to use a number that is divisible by four, otherwise you”ll end up with an incomplete record for one of the reads.
head -n 400 SRR515927.fastq > SRR515927_subset.fastq
Now, if you check the length of the new file (remember the word count [wc] command?), you can see that it contains 400 lines (i.e. 100 reads).
$ wc -l SRR515927_subset.fastq 400 SRR515927_subset.fastq
Remember, that this is often a bad representation of your dataset!
There are quite a few possible methods for getting a random subsample from fastq files. I will show you a few:
Subsampling using the HTSeq Python package (source SeqAnswers):
After installing HTSeq, you have to create a python script file containing the following lines:
import sys, random, itertools import HTSeq fraction = float( sys.argv[1] ) in1 = iter( HTSeq.FastqReader( sys.argv[2] ) ) in2 = iter( HTSeq.FastqReader( sys.argv[3] ) ) out1 = open( sys.argv[4], "w" ) out2 = open( sys.argv[5], "w" ) for read1, read2 in itertools.izip( in1, in2 ): if random.random() < fraction: read1.write_to_fastq_file( out1 ) read2.write_to_fastq_file( out2 ) out1.close() out2.close()
To get approximately 10% of the reads, you can use the following command:
python htseq_subsample.py 0.1 SRR022913_1.fastq SRR022913_2.fastq SRR022913_1_htseq.fastq SRR022913_2_htseq.fastq
Let”s check the length of the original and new files!
$wc -l *_htseq.fastq 2479600 SRR022913_1_htseq.fastq 2479600 SRR022913_2_htseq.fastq
$wc -l reads/SRR022913_1.fastq 24818636 reads/SRR022913_1.fastq
You can also use the R BioConductor ShortRead package (source):
library(ShortRead) f1 <- FastqSampler("SRR022913_1.fastq", n=1000) f2 <- FastqSampler("SRR022913_2.fastq", n=1000) set.seed(123L); p1 <- yield(f1) set.seed(123L); p2 <- yield(f2) writeFastq(p1,"f1.fastq") writeFastq(p2,"f2.fastq")
Other solutions:
If you”re interested in other solutions, check out this SeqAnswer and this BioStar threads. You can also try the Seqtk toolkit, there”s also a Perl solution available here.