OTU benchmark (MOCK)

  
USEARCH performance home page
.

OTU benchmark discussion
.

MOCK results
The key result is that the number of OTUs produced by the USEARCH pipeline is close to the expected number based on the known reference sequences, as shown in the following table (first two columns). The number is substantially reduced compared to a naive use of UCLUST (the clustering algorithm in USEARCH) and UCHIME.
 

Set

Ref seqs

OTUs
otupipe

OTUs
uclust+uchime

Ref seqs (97%)

Reads

Exact

Even1

87

74

474

62

53,771

65

Even2

87

77

405

62

45,128

73

Even3

87

78

443

62

54,153

72

Uneven1

87

70

218

62

44,926

69

Uneven2

87

57

274

62

44,176

54

Uneven3

87

68

204

62

50,931

71

Titanium

89

76

272

69

25,438

37

 
Sets are from (Quince et al. 2011), data is here: http://userweb.eng.gla.ac.uk/christopher.quince/Data/AmpliconNoise.html

Ref seqs is the number of known reference sequences. The Titanium reference set should be complete. The Even and Uneven sets reference sets are probably incomplete due to missing paralogs.

OTUs otupipe is the number of OTUs found by the otupipe.

Ref seqs 97% is the number of reference sequences after clustering using UCLUST at 97%. This is a way to estimate a lower bound on the number of OTUs that should be found. Since an OTU pipeline cannot group paralogs from a single species if the paralogs are diverged more than 97%, we might expect more OTUs than species even if the algorithm is performing perfectly.

OTUs uclust+uchime Number of clusters found by a naive method for comparison. I did a standard UCLUST clustering at 97% followed by UCHIME in reference database mode using the known reference sequences.

Reads number of reads.

Exact is the number of reference sequences that were recovered exactly by the pipeline.

Download
Data and scripts for reproducing my results can be downloaded here: otupipe_mock_bench.tar.gz

References
Quince, C., Lanzen, A., Davenport, R.J. and Turnbaugh, P.J. (2011) Removing noise from pyrosequenced amplicons, BMC Bioinformatics, 12, 38.