USEARCH manual > algorithms > OTU clustering
Finding species in SSU reads

 
See also
 
OTU clustering
 
SSU metagenomics
  SSU reference databases


Ideally, each species would be unique in the sequenced region, allowing identification of both known and novel species in the sample. This is approximately true, but there are several complications that make data analysis challenging, summarized below.

Issue Description
Non-unique genes There are many known examples where different species have identical 16S sequences. For example, Brucella abortus, B. melitensis and B. suis. See also the Sacchi et al. anthrax paper cited below.
 
Gene duplications A given species may have several copies of the 16S operon, and some of these may have non-identical sequences. Some of these copies may have <97% identity, especially when only a short region of the gene is sequenced. Gene duplications are difficult to distinguish from different species when these are due to novel species or unknown duplications in known species. See for example Sacchi et al. (2007) for a detailed analysis of 16S copies in species and strains of anthrax (Baccilus anthracis).
 
Sequencing error Depending on the quality filtering used, the error rate of 454 and Illumina reads is typically of the order of 1%. In both cases, the most common error is an incorrect base call. With pyrosequencing (454), the number of bases may be incorrectly called, which is most often due to runs of identical bases (homopolymers, e.g. AAAA... or TTTT...) where the length of the run is incorrect. This type of error is very rare with Illumina.
 
PCR artifacts With 454 and Illumina technologies, PCR amplicons are sequenced. Copying errors occur during amplification, and chimeric amplicons often form. While chimeric reads typically only account for a small percentage of the total reads, they may account for a large fraction of the unique sequences. This means that amplicon sequences are often not correct biological sequences, and a correct set of gene sequences would be hard to recover even if sequencing error could be corrected by denoising. Chimeras tend to produce spurious OTUs which are particularly difficult to detect.
 
Primer mismatches "Universal" primers typically only match around 80 to 90% of known species. Genes having two or more mismatches with the primer are usually not amplified in detectable amounts.
 
Incomplete reference databases Several SSU sequence databases are available containing tens of thousands of 16S and other SSU sequences. However, despite the large number of known sequences, it is widely believed that the databases represent only a fraction, probably only a small fraction, of existing species, and novel species probably account for many reads that do not have close matches (97% or higher) to the databases. The observation of large numbers of low-abundance novel 16S sequences in metagenomics has lead to the rare biosphere hypothesis.