Read preparation: strip primer-binding sequences

See also
OTU / denoising pipeline
Read preparation

Usually, the primer-binding sequences in your gene (e.g., 16S) are included in the reads. I generally recommend that primer-binding sequences should be stripped out, because the PCR reaction tends to cause substitutions in those sequences (most often to remove mismatches).

With typical paired reads, the forward read (R1) starts with the forward primer, and the reverse read (R2) starts with the reverse primer.

You can check by using the search_oligodb command , e.g.:

usearch -search_oligodb reads.fq -db primers.fa -strand both \
-userout primer_hits.txt -userfields query+qlo+qhi+qstrand

You will probably see many hits to a primer starting at the first letter in the reads. If the files are big, you can use the fastx_subsample command to get a small subset so that the output files are easier to review.

Primer stripping should be done before quality filtering (because every base causes an increase in expected errors ) and before finding unique sequences (because variation in the primer-binding region will split over biological sequence over several uniques, degrading the calculation of unique sequence abundance ).

If the reads start with a binding sequence for a primer of length L, then a simple way to strip is to delete the first L letters of the reads using the fastx_truncate command . For example, the currently popular V4 primers V4F (GTGCCAGCMGCCGCGGTAA, length 19) and V4R (GGACTACHVGGGTWTCTAAT, length 20) can be stripped by:

for fq in *R1*.fastq
do
usearch -fastx_truncate $fq -stripleft 19 -fastqout $fq.stripped
done

for fq in *R2*.fastq
do
usearch -fastx_truncate $fq -stripleft 20 -fastqout $fq.stripped
done

If you have paired reads, it is usually more convenient to strip after merging and combining all reads into a single file (merged.fq), like this:

usearch -fastx_truncate merged.fq -stripleft 19 -stripright 20 -fastqout stripped.fq