Making an OTU table
An
OTU table is made by the
otutab command. The query set, i.e. the FASTA file
or FASTQ file containing the reads, must have sample identifiers in the
labels.
Why different ways to do it?
Usearch
supports
different ways to put sample names into sequence labels to provide some
degree of backwards compatibility with earlier versions and to allow flexibility in the formatting of sample names which
were probably designed without thinking about the software package. For
example, QIIME does not allow an underscore in the sample identifier, which
is too restrictive in my opinion.
How to check that your sample names are formatted
correctly
Use the
fastx_get_sample_names command.
Sample identifier syntax
The sample name
can be specified by putting sample=xxx; into the label. The semi-colon marks the
end of the sample identifier, so semi-colons are not allowed but any other
character may be used. If sample= is not found,
the sample identifier is assumed to start at the beginning of the label and
continue to the first character in the label which is not alphanumeric or an
underscore, unless the sample_delim option is specified (see below). Put another way, any character which is not a letter, number of
underscore marks the end of the sample label.
The following labels have sample identifier S01. FASTA labels start with > at
the beginning of the line, FASTQ labels start with @.
>S01.123
>S01.123;size=14;
@M00967:43:000000000-A3JHG:1:1101:18327:1699;sample=S01;
In the first and second example, the period (.) is the first non-alphanumeric character so the .123 is not part of the sample identifier.
The -sample_delim option
This option
specifies a string of one or more characters that marks the end of a sample
identifier. If this option is used, the sample idenfier must begin with the
first character in the label and continues until the first match to the
delimiter string. For example, if you have reads that were processed with
QIIME, then read labels start with the sample identifier which is followed
by an underscore (_) and an integer read number. Input in this format can be
processed like this:
usearch -otutab qiime_reads.fq -sample_delim _ -otutabout otutable.txt
How to get sample names into your labels
The simplest method is to use the fastx_relabel
command or the -relabel option of
fastq_mergepairs,
fastq_filter or
fastx_uniques. If you process one file at a time,
you can do something like this:
usearch -fastx_uniques reads.fastq -relabel SampleName. -fastaout uniques.fa
Note the period following SampleName.
If -relabel @ is specified, the sample name is constructed from the FASTQ filename by truncating at the first underscore or period. With typical Illumina FASTQ filenames, this is the sample name.
Alternatively, you could write you own script to do this
task.