Barcode sequences and OTUs
What is a barcode sequence?
A barcode is a full or partial gene sequence used for classification, diversity estimation and/or
phylogeny. The gene is usually chosen to be well conserved and present in a large group of organisms.
The canonical example of a barcode is
16S ribosomal RNA
(16S rRNA) which is used to define species in bacteria and archaea (prokaryotes).
16S rRNA is highly conserved, making it possible to design
so-called universal PCR primers which bind to a large majority of prokaryotes. For example,
V1-V9 primers bind to conserved sequence before hypervariable region V1 and after hypervariable
region V9, giving a PCR product which is close to a full length 16S rRNA gene (approx 1,200nt).
Shorter sub-regions of 16S rRNA such as V3-V5 have often been used in metagenomic next-generation
sequencing to match shorter read lengths. In some cases (e.g. V3-V5, approx 500nt), the barcode is
generally sufficient to identify species, while shorter barcodes
(e.g. V4, approx. 250nt) may be effective at resolving different genera but have trouble
distinguishing species.
Defining species by a barcode
Microbial species are often defined by a specifying (1) the barcode sequence of a type strain
and (2) a sequence identity threshold such that a strain with identity above this
threshold is considered to belong to the same species, and conversely to a different
species if the identity is less. For example, bacterial species are
conventionally defined by 97% 16S rRNA identity. This is a practical approach for defining
species for organisms which do not reproduce sexually, so the usual definition of
species as a potentially inter-breeding group cannot be applied. With RNA
viruses, RdRp with a 90% identity threshold is sometimes used as part of the official
definition of a taxonomic species, and is often used to define species-like clusters
in metagenomic sequencing.
Operational Taxonomic Units
If species are defined by a barcode and sequence identity threshold, then the number
of barcode sequence clusters at this threshold provides an estimate of the number of species
or species-like groups present in the data. Such clusters are called
Operational Taxonomic Units
(OTUs). If the identity threshold is chosen to approximate species, the
clusters may be called species-like OTUs (sOTUs). Species-like OTUs based
on viral RdRp are sometimes called viral OTUs (vOTUs).
Sequence identity thresholds at higher ranks
Clustering at a lower identity can similarly provide an estimate of the
number of groups at higher taxonomic ranks. For example, clustering 16S rRNA at
95% identity corresponds roughly to genus and 90% identity roughly to family.
With RdRp, genus is around 75% and family is very roughly 50%.
This extrapolation to higher ranks breaks down at some point, and it
is not possible to get good approximations to phylum or class ranks
with any widely-used barcode gene including 16S, 18S, ITS, COI or RdRp
using generic sequence clustering methods, i.e., methods which are not specifically
designed for phylogeny.