orient command

Orient nucleotide sequences in a FASTA or FASTQ file to the same strand as a database.

The -fastaout option is a FASTA filename for the oriented sequences.

The -fastqout option is a FASTQ filename for the oriented sequences (requires FASTQ input).

The -notmatched option is an output file for sequences with undetermined orientation. It will be FASTA or FASTQ, depending on the input file format.

The -tabbedout option is a tabbed text file giving the orientation of each input sequence. Fields are: query_label, strand, plus_count, minus_count. Strand is + or -.

For each input sequence, the orient command attempts to determine whether it is on the same strand as the database sequences (which are assumed to all be on the same strand), or reverse-complemented. If the latter, the sequence is reverse complemented so that the output sequences are all on the same strand.

The command uses a simple word-counting strategy by finding the strand that gives more word (k-mer) matches. If too few words match, or the number is too close on both strands, the sequence is discarded. The -tabbedout file reports the results for each sequence.

You can use any 16S database you like, doesn't really matter. Safest would be to use a large database, say Greengenes or SILVA, to make sure you have good coverage, but word-counting works well down to pretty low identities so a smaller database like the RDP training set or NCBI BLAST 16S reference should be fine.

Multithreading is supported.

Example

usearch -orient reads.fastq -db 16s.udb -fastqout reads_plus.fq