USEARCH manual > abundance estimation |
abundance estimation |
There are two important sources of bias: PCR amplification bias and gene density bias. With short reads that only partially cover amplicons, it is important to trim the reads intelligently before making an abundance estimate. See global trimming. Use the ‑sizeout option to create size annotations. Abundance estimation by dereplication These are questionable assumptions, but even if there are sequencing and/or amplification errors, using dereplication is often a good strategy. There isn't much we can do about PCR bias, because as far as I know, it isn't possible to predict PCR bias from primary sequence. With errors due to sequencing and amplification (imperfect copying during PCR), some sequences fail to match, creating small spurious clusters (mostly singletons) and causing abundance of the true biological sequence to be underestimated. However, this may not be very harmful in practice for a couple of reasons. 1. The underestimate due to errors applies consistently to all clusters, and it is usually only ratios between abundances that are significant, not absolute values. The ratios between abundances of larger clusters should be reasonably stable against errors, and it is usually these abundances that are the most important. 2. The small, spurious clusters due to errors may be ignored or merged into the correct cluster in a post-processing step. In the case of uchime_denovo, the most important aspect of abundance estimation is that potential parents have higher abundances than their chimeras. Amplicons that are sufficiently abundant to be parents will usually also have enough error-free reads to give a large cluster. In the case of UCLUST, sequences with errors will tend to be merged into their correct cluster. Abundance estimation by clustering Example using dereplication usearch -derep_prefix reads.fasta -output uniques.fasta -sizeoout Example using clustering usearch -cluster_fast reads.fasta -id 0.99 -consout cons.fasta -sizeoout |