Orient nucleotide sequences in a FASTA or FASTQ file to the same strand as a database.
The -fastaout option is a FASTA filename for the oriented sequences.
The -fastqout option is a FASTQ filename for the oriented sequences (requires FASTQ input).
The -notmatched option is an output file for sequences with undetermined orientation. It will be FASTA or FASTQ, depending on the input file format.
The -tabbedout option is a tabbed text file giving the orientation of each input sequence. Fields are: query_label, strand, plus_count, minus_count. Strand is + or -.
For each input sequence, the orient command attempts to determine whether it
is on the same strand as the database sequences (which are assumed to all be
on the same strand), or reverse-complemented. If the latter, the sequence is
reverse complemented so that the output sequences are all on the same
strand.
The command uses a simple word-counting strategy by finding
the strand that gives more word (k-mer) matches. If too few words match, or
the number is too close on both strands, the sequence is discarded. The
-tabbedout file reports the results for each sequence.
You can use
any 16S database you like, doesn't really matter. Safest would be to use a
large database, say Greengenes or SILVA, to make sure you have good
coverage, but word-counting works well down to pretty low identities so a
smaller database like the RDP training set or NCBI BLAST 16S reference
should be fine.
Multithreading is supported.
Example
usearch -orient reads.fastq -db 16s.udb -fastqout reads_plus.fq