Protein clustering benchmark (ORTHO)

USEARCH performance home page
 
Results
See here for ORTHO results.

Methods

The ORTHO benchmark is based on a proteins from OrthoDB v4 (Waterhouse et al., 2011). A subset containing 346,912 sequences was used so that clustering would be tractable for CD-HIT on a commodity computer with 1Gb RAM. Much larger datasets can be clustered by USEARCH on this class of machine, but then there would be no other program to compare with. The subset was constructed by concatenating the following files containing proteomes for eleven mammalian species:
 

   Bos_taurus.Btau_4.0.58
   Canis_familiaris.BROADD2.58

   Equus_caballus.EquCab2.58

   Felis_catus.CAT.58

   Gorilla_gorilla.gorGor3.58

   Homo_sapiens.GRCh37.58

   Mus_musculus.NCBIM37.58

   Pan_troglodytes.CHIMP2.1.58

   Pongo_pygmaeus.PPYG2.58

   Rattus_norvegicus.RGSC3.4.58

   Vicugna_pacos.vicPac1.58
 

Labels were truncated at the first white space in order reduce the file size, leaving numeric identifiers. This was done because the memory required by CD-HIT scales like the size of the input data + size of the output data. By contrast, memory required by USEARCH scales like the size of the output data only, so can be substantially less for large datasets with high redundancy. The result is a 173Mb FASTA file. Sequences were sorted by decreasing length prior to clustering by USEARCH. The sort required ~10s time and ~200Mb RAM.
 

References
Waterhouse RM, Zdobnov EM, Tegenfeldt F, Li J, Kriventseva EV. OrthoDB (2001) The hierarchical catalog of eukaryotic orthologs in 2011. NAR, Jan 2011.