fastq_mergepairs command

Performs merging of paired reads. (This is sometimes called 'assembly' of paired reads, but I find this term confusing because assembly usually refers to making longer contigs, so I prefer to call it merging).

Typical usage:

usearch -fastq_mergepairs *_R1_*.fastq -fastqout merged.fq -relabel @

The -fastq_merge_maxee option can be used to set an expected errors threshold, but for OTU clustering quality filtering is generally performed as a post-processing step:

usearch -fastq_filter merged.fq -fastq_maxee 1.0 -fastqout filtered.fq

This is because filtered reads are used to construct OTUs but unfiltered reads are used to construct the OTU table, so merged reads before and after filtering are both needed. See UPARSE tutorials for some examples.

Merged reads are written to -fastqout (for FASTQ) and / or -fastaout (for FASTA). Reads which failed to merge are written to ‑fastqout_notmerged_fwd, -fastqout_notmerged_rev, -fastaout_notmerged_fwd, -fastaout_notmerged_rev.

The -tabbedout option reports one line per read pair giving information about how it was processed, e.g. if an alignment was found and the expected errors value for the pair.

The -report filename option gives summary information, click here for an example report. This example shows that there are several anomalously short pairs with merged lengths in the range 20-30, much shorter than the mean (330nt) which suggests using ‑fastq_minmergelen to filter them out.

Several forward FASTQ filenames may be given following the -fastq_mergepairs option. This allows you to use shell wildcards to merge several pairs of files in a single step. If you use this feature, you will typically want to use the -relabel @ feature (see below) to label the merged reads with the sample name.

The FASTQ filename for the forward reads (R1s) is specified by the -fastq_mergepairs option, and the reverse read filename (R2s) is specified by the ‑reverse option. If the -reverse option is not given, the reverse read filename is constructed by replacing _R1 with _R2 in the forward filename.

The -relabel string option specifies that the read labels should be changed in the output files. Labels are made by appending an integer 1, 2, 3... to the string. Only reads that are successfully merged are counted, so there are no gaps in the numbering. The special value @ indicates that the string should be constructed from the file name by truncating the file name at the first underscore or period and appending a period. With a typical Illumina FASTQ file name, this gives the sample name. So, for example, if the R1 file name is Mock_S188_L001_R1_001.fastq, then the string is Mock and the output labels will be Mock.1, Mock.2 etc.

The -sample string option specifies that sample=string; should be added to the read label (supported in v.8.1.1800 and later).

Forward and reverse reads must be in 1:1 correspondence and must appear in the same order in both files. The labels for the forward and reverse read in a given pair must be identical, or identical except for a single position where a '1' appears in the forward read label and a '2' appears in the reverse read label.

The -alnout option gives a file name for human-readable alignments of the overlap region.

Option		Description
fastq_minovlen k		Minimum length of the overlap. Default: 16. Note: overlaps shorter than the -minhsp option will fail to align, so this option also has the effect of imposing a minimum overlap. You should therefore set -minhsp to the shortest overlap that you expect in your data. Values less than 8 may cause performance problems, and may cause spurious overlaps.
fastq_minmergelen L		Minimum length of the merged read. Default: no minimum.
fastq_maxmergelen L		Maximum length of the merged read. Default: no maximum.
fastq_maxdiffs n fastq_maxdiffpct n		fastq_maxdiffs sets the maximum number of mismatches allowed in the overlap region. Default: 5. fastq_maxdiffpct sets the maximum fraction of mismatches allowed in the overlap region, expressed as an integer percentage. Default 5.
fastq_maxgaps n		Maximum number of gaps allowed in the alignment of the overlapping region. Default is 0, because gaps are very rare with Illumina paired reads so gaps are more likely to indicate a misalignment than a read error. Also, the merge is ambiguous in a gapped column: should the base be included or not? There is no Q score to decide which read is better in this case.
fastq_merge_maxee e		Discard pairs if the number of expected errors is > e after merging. By default, no expected error filtering is performed.
fastq_trunctail q		This option is provided for older Illumina data with "#" tails, it is not recommended for newer reads. It truncates the forward and reverse reads at the first Q<=q, if present. This truncation is performed before aligning the pair. This option is provided for older Illumina reads where Q=2 was used to indicate a bad tail. For such reads, it is recommended to use ‑fastq_truntail 2 or higher, as low-quality tails often caused alignments to fail. Default: no quality truncation as this is not needed with newer reads.
fastq_minlen L		Minimum length of the forward and reverse read, after truncating per ‑fastq_truncqual if applicable. Default: no minimum.
fastq_nostagger		Do not merge a pair where the alignment is "staggered" like this: --FORWARD REVERSE-- Staggered alignments are generated when the template sequence is shorter than the read length, causing the read to extend into the opposite sequencing primer and adapter. See read layouts. \By default, pairs with staggered alignments are merged and trimmed. Trimming removes letters in "overhangs" that align to terminal gaps, in the above example RE would be trimmed from the reverse read and RD would be trimmed from the forward read.
fastq_eeout		Append "ee=xxx;" annotation to the read labels giving the expected errors after merging.
fastqout_notmerged_fwd fastqout_notmerged_rev fastaout_notmerged_fwd fastaout_notmerged_rev		Filenames for forward and reverse reads that are not merged (FASTQ or FASTA format).
-alnout		Filename for alignments of the overlap region.