See also
fastq_mergepairs command
fastq_mergepairs options
Reviewing a fastq_mergepairs report to check for
problems
Using the tabbedout
file to investigate merging problems
Validating merged reads to check for problems
The primary
tools for trouble-shooting merge problems are the report files generated by the
-report, -tabbedout and -alnout options.
The report file gives a summary
which will show if many pairs failed to merge, and if so will give the most
common reasons why they failed, e.g. because no alignment was found or there
were too many mismatches (differences) in the alignment.
To investigate
examples, take a small subset of pairs which failed to merge and examine the
tabbedout and alnout files. Example command lines:
usearch -fastq_mergepairs *R1*.fastq -fastqout_notmerged_fwd
fwd.fq -fastqout_notmerged_rev rev.fq
usearch -fastx_subsample fwd.fq
-reverse rev.fq -fastqout subf.fq -output2 subr.fq -sample_size 100
usearch -fastq_mergepairs subf.fq -reverse subr.fq -tabbedout out.txt
-report report.txt \
-alnout aln.txt
The format of the tabbedout file is not documented in
detail (and is subject to change in different usearch builds), but is fairly
self-explanatory. Each read pair is one line in the file. The read label is
the first field. Subsequent fields are separated by tabs. Each field reports
the results of one step in the merging process:
M00967:15:000000000-A2G1J:1:1101:18083:3926 aln=123-128-121 diffs=15 toomanydiffs result=notmerged
This shows that the pair failed to merge because there were too many (15) mismatches in the alignment.
The aln= field has three values separated by dashes: number of unaligned bases in the forward read, alignment length, and number of unaligned bases in the merged read. In the example above, there are 15 mismatches in the alignment (diffs=15) which exceeds the maximum (toomanydiffs) and the pair is therefore discarded (result=notmerged). You can find the alignment in the alnout file, or extract just this read pair (fastx_getseq) so that you get just one alignment.
Trouble-shooting pairs that do not align
If the reads do
not align, this may be because there are too many differences in the
alignment due to read erors, the alignment is very short, or there is no
overlap because the amplicon is too long. Long amplicons may be caused by
PhiX reads or by amplifying a different gene or region; you can check for
this by aligning the forward and reverse reads to a database of known genes
(e.g. SILVA for 16S).
You can check whether a plausible alignment exists by using ublast to align the reads. Use fastx_getseq to get the pair into FASTQ files containing just one read (fwd.fq and rev.fq). Use fastx_revcomp to get the reverse read on the same strand, and use the -strand plus option of ublast to exclude matches on the wrong strand. For example,
usearch -fastx_getseq R1s.fq M00967:15:0000 -fastqout fwd.fq
usearch -fastx_getseq R2s.fq M00967:15:0000 -fastqout
rev.fq
usearch -fastx_revcomp rev.fq -fastqout rev_rc.fq
usearch -ublast fwd.fq -db rev_rc.fq -strand plus -evalue 1e-2 -alnout
hits.txt