See also
fastq_mergepairs command
fastq_mergepairs options
With next-generation amplicon sequencing,
the PCR reaction often creates artifacts which are not correctly constructed,
i.e., do not contain an intended full-length biological sequence. Such artifacts
can be due to primers binding to a different region of the genome, secondary
structure formation in the reaction, and so on. Usually, these amplicons are
shorter or longer than expected and can therefore be filtered by setting a
length range for the merged read. This is supported by the -fastq_mimmergelen
and -fastq_maxmergelen options.
Length range for 16S V4
Currently, a popular method is 2
x 250 reads of the 16S V4 hypervariable region. This region has
well-conserved length, unlike other 16S hypervariable regions. You can
therefore set a narrow range of length to exclude artifacts; With the
typical primers V4F (GTGCCAGCMGCCGCGGTAA) and V4R (GGACTACHVGGGTWTCTAAT) I
set a length range of 230 to 270. These values exceed the known variation in
length to allow novel outliers.
How to determine the length
range for a primer pair
The
search_pcr command can generate amplicon sequences given primer
sequences and a database of known genes (or genomes). The length range in
the predicted amplicons can be determined by the
fastx_info command. Be careful to include
or exclude the primer sequences in the total length depending on whether
they will appear in the reads (typically they do, but there are many
variations in the library preparation protocols). I recommend using a range
which exceeds the measured minimum and maximum because there may be novel
outliers which are not in the database.