PalmDB

Barcode sequences and OTUs

What is a barcode sequence?

A barcode is a full or partial gene sequence used for classification, diversity estimation and/or phylogeny. The gene is usually chosen to be well conserved and present in a large group of organisms. The canonical example of a barcode is 16S ribosomal RNA (16S rRNA) which is used to define species in bacteria and archaea (prokaryotes). 16S rRNA is highly conserved, making it possible to design so-called universal PCR primers which bind to a large majority of prokaryotes. For example, V1-V9 primers bind to conserved sequence before hypervariable region V1 and after hypervariable region V9, giving a PCR product which is close to a full length 16S rRNA gene (approx 1,200nt). Shorter sub-regions of 16S rRNA such as V3-V5 have often been used in metagenomic next-generation sequencing to match shorter read lengths. In some cases (e.g. V3-V5, approx 500nt), the barcode is generally sufficient to identify species, while shorter barcodes (e.g. V4, approx. 250nt) may be effective at resolving different genera but have trouble distinguishing species.

Defining species by a barcode

Microbial species are often defined by a specifying (1) the barcode sequence of a type strain and (2) a sequence identity threshold such that a strain with identity above this threshold is considered to belong to the same species, and conversely to a different species if the identity is less. For example, bacterial species are conventionally defined by 97% 16S rRNA identity. This is a practical approach for defining species for organisms which do not reproduce sexually, so the usual definition of species as a potentially inter-breeding group cannot be applied. With RNA viruses, RdRp with a 90% identity threshold is sometimes used as part of the official definition of a taxonomic species, and is often used to define species-like clusters in metagenomic sequencing.

Operational Taxonomic Units

If species are defined by a barcode and sequence identity threshold, then the number of barcode sequence clusters at this threshold provides an estimate of the number of species or species-like groups present in the data. Such clusters are called Operational Taxonomic Units (OTUs). If the identity threshold is chosen to approximate species, the clusters may be called species-like OTUs (sOTUs). Species-like OTUs based on viral RdRp are sometimes called viral OTUs (vOTUs).

Sequence identity thresholds at higher ranks

Clustering at a lower identity can similarly provide an estimate of the number of groups at higher taxonomic ranks. For example, clustering 16S rRNA at 95% identity corresponds roughly to genus and 90% identity roughly to family. With RdRp, genus is around 75% and family is very roughly 50%. This extrapolation to higher ranks breaks down at some point, and it is not possible to get good approximations to phylum or class ranks with any widely-used barcode gene including 16S, 18S, ITS, COI or RdRp using generic sequence clustering methods, i.e., methods which are not specifically designed for phylogeny.