See also
OTU
clustering
SSU
metagenomics
SSU reference databases
Ideally, each species would be unique in the sequenced region,
allowing identification of both known and novel species in the sample. This is
approximately true, but there are several complications that make
data analysis challenging, summarized below.
Issue |
Description |
Non-unique genes |
There are
many known examples where different species have identical 16S sequences. For
example, Brucella abortus, B. melitensis and B. suis. See
also the Sacchi et al. anthrax paper cited below.
|
Gene duplications |
A given
species may have several copies of the 16S
operon, and some of these may
have non-identical sequences. Some of these copies may have <97% identity,
especially when only a short region of the gene is sequenced. Gene duplications
are difficult to distinguish from different species when these are due to novel
species or unknown duplications in known species. See for example
Sacchi et al.
(2007) for a detailed analysis of 16S copies in species and strains of
anthrax (Baccilus anthracis).
|
Sequencing error |
Depending
on the quality filtering used, the error rate of 454 and Illumina reads is
typically of the order of 1%. In both cases, the most common error is an
incorrect base call. With pyrosequencing (454), the number of bases may be
incorrectly called, which is most often due to runs of identical bases (homopolymers,
e.g. AAAA... or TTTT...) where the length of the run is incorrect. This type of
error is very rare with Illumina.
|
PCR artifacts |
With 454
and Illumina technologies, PCR amplicons are sequenced. Copying errors occur
during amplification, and chimeric amplicons
often form. While chimeric reads typically only account for a small percentage
of the total reads, they may account for a large fraction of the unique
sequences. This means that amplicon sequences are often not correct biological
sequences, and a correct set of gene sequences would be hard to recover even if
sequencing error could be corrected by denoising. Chimeras tend to produce
spurious OTUs which are particularly difficult to detect.
|
Primer mismatches |
"Universal" primers typically only match around 80 to 90% of known species.
Genes having two or more mismatches with the primer are usually not amplified in
detectable amounts.
|
Incomplete reference databases |
Several
SSU sequence databases are available containing
tens of thousands of 16S and other SSU sequences. However, despite the large
number of known sequences, it is widely believed that the databases represent
only a fraction, probably only a small fraction, of existing species, and novel
species probably account for many reads that do not have close matches (97% or
higher) to the databases. The observation of large numbers of low-abundance
novel 16S sequences in metagenomics has lead to the
rare biosphere hypothesis.
|
|