Reviewing a fastq_mergepairs report to check for problems
Using the tabbedout file to investigate merging problems
Validating merged reads to check for problems
The primary tools for trouble-shooting merge problems are the report files generated by the -report, -tabbedout and -alnout options.
The report file gives a summary which will show if many pairs failed to merge, and if so will give the most common reasons why they failed, e.g. because no alignment was found or there were too many mismatches (differences) in the alignment.
To investigate examples, take a small subset of pairs which failed to merge and examine the tabbedout and alnout files. Example command lines:
usearch -fastq_mergepairs *R1*.fastq -fastqout_notmerged_fwd fwd.fq -fastqout_notmerged_rev rev.fq
usearch -fastx_subsample fwd.fq -reverse rev.fq -fastqout subf.fq -output2 subr.fq -sample_size 100
usearch -fastq_mergepairs subf.fq -reverse subr.fq -tabbedout out.txt -report report.txt \
The format of the tabbedout file is not documented in detail (and is subject to change in different usearch builds), but is fairly self-explanatory. Each read pair is one line in the file. The read label is the first field. Subsequent fields are separated by tabs. Each field reports the results of one step in the merging process:
M00967:15:000000000-A2G1J:1:1101:18083:3926 aln=123-128-121 diffs=15 toomanydiffs result=notmerged
This shows that the pair failed to merge because there were too many (15) mismatches in the alignment.
The aln= field has three values separated by dashes: number of unaligned bases in the forward read, alignment length, and number of unaligned bases in the merged read. In the example above, there are 15 mismatches in the alignment (diffs=15) which exceeds the maximum (toomanydiffs) and the pair is therefore discarded (result=notmerged). You can find the alignment in the alnout file, or extract just this read pair (fastx_getseq) so that you get just one alignment.
Trouble-shooting pairs that do not align
If the reads do not align, this may be because there are too many differences in the alignment due to read erors, the alignment is very short, or there is no overlap because the amplicon is too long. Long amplicons may be caused by PhiX reads or by amplifying a different gene or region; you can check for this by aligning the forward and reverse reads to a database of known genes (e.g. SILVA for 16S).
You can check whether a plausible alignment exists by using ublast to align the reads. Use fastx_getseq to get the pair into FASTQ files containing just one read (fwd.fq and rev.fq). Use fastx_revcomp to get the reverse read on the same strand, and use the -strand plus option of ublast to exclude matches on the wrong strand. For example,
usearch -fastx_getseq R1s.fq M00967:15:0000 -fastqout fwd.fq
usearch -fastx_getseq R2s.fq M00967:15:0000 -fastqout rev.fq
usearch -fastx_revcomp rev.fq -fastqout rev_rc.fq
usearch -ublast fwd.fq -db rev_rc.fq -strand plus -evalue 1e-2 -alnout hits.txt