See also
Quality control for
OTU sequences
fastq_mergepairs command
fastq_mergepairs options
Reviewing a fastq_mergepairs report to check for
problems
Trouble-shooting problems with fastq_mergepairs
A paired read assembler makes two types of error: false negatives and false
positives. This page explains how you can check for these errors. See also
quality control for OTU sequences.
Merging errors may not matter
If you are using a UPARSE
or UNOISE pipeline, then errors by the merger may not matter. Typically, the
errors are pretty random so the large majority are
singletons and should therefore be discarded before making OTUs. Wrong
alignments often have many mismatches, causing low Q scores and hence
expected errors > 1 so that the mis-aligned
reads will be discarded during quality filtering. Errors by the merger
matter only if they are reproduced frequently enough, and have high enough Q
scores, that they propogate through the complete pipeline and generate
spurious OTUs. You can verify this by doing quality
control on your OTU sequences.
1. False negatives
A false negative is a read pair that
has a valid alignment, but is discarded. If you have a low rate of merged
reads, and you believe the amplicons are short enough that the pairs should
overlap, then you may have a problem with false negatives. The way to check
this is to review the merge report to
identify the most common reason(s) that pairs are discarded. The simplest way to investigate is to use
the -fastqout_notmerged_fwd
and -fastqout_notmerged_rev options to get the pairs which did not merge,
then (if needed) use fastx_subsample to get a small subset for manual
investigation. If there are several different reasons, then you can
use the tabboutout file to find
some examples of pairs that failed for a given reason. Take the example pairs and experiment with different parameters. You
can use ublast as an independent check for
alignments, and / or align the unpaired reads to a reference database of
full-length sequences (see trouble-shooting problems with fastq_mergepairs
for more discussion). These tests should show if there are good alignments
that fastq_mergepairs is missing, and suggest changes to the parameters
which might improve the merge rate.
2. False positives
A false positive caused when the
alignment is wrong. There are two cases.
(A) The pair does not in
fact overlap, but a spurious alignment is found due to a spurious similarity
between short segments at the end of R1 and the start of the
reverse-complemented R2.
(B) The pair does overlap, but the
alignment is wrong (too long or too short because it is offset compared to
the correct position). This tends to happen more often when the true
overlaps are short.
The way to check for false positives is to align
the merged sequences to a reference database of full-length sequences. An
incorrect alignment is then visible as a gap. If the pair does not in fact
overlap, then the alignment will show a gap in the merged sequence. If the
alignment is offset compared to the correct alignment, then this can create
a gap in the merged sequence or in the full-length sequence, depending on
whether the alignment is too long or too short. Keep in mind that gaps
may be valid biological insertions or deletions. The clearest cases are
where the segments on both sides of the gap have high identity, especially
in a highly conserved gene such as 16S.
It is usually impractical to
manually review a large number of aligments to look for such gaps. You can
deal with this by spot-checking, or by writing scripts to search for
suspicious gaps which you can review manually. If you are running an OTU (or
denoising) pipeline, then you could spot-check the sequences for some
low-abundance OTUs, because these where most problems are found.