PalmDB

RdRp barcodes

Defining a barcode segment

Compared to other widely-used barcode genes such as 16S rRNA, viral RdRp is more difficult to work with because the boundaries of the RdRp protein coding sequence are often at cleavage sites rather than start/stop codons, and these sites generally cannot be identified by sequence analysis, especially when sequence identity to well-annotated genomes is low. Therefore, it is usually not possible to reliably identify full-length RdRp sequences comprising all of the RdRp protein sequence and nothng else. The usual solution to this problem is to implement a trimming procedure designed to identify a subsequence of RdRp which is comparable between diverged species.

Trimming RdRp to a globally alignable segment

For a segment to be comparable between diverged species, it must be approximately globally alignable so that it can be considered to cover the "same" part of the gene and enable apples-to-apples comparisons. This requirement is of course rather vague, and it depends on how diverged the species might be. In practice, trimming is widely used, but is rarely described in detail. For example, the Interpro/PFAM hidden Markov models (HMMs) for viral RdRp have large variations in length and cover different sets of subdomains in different groups. Wolf2018 made a multiple alignment across most then-known RNA viruses by trimming RdRp sequences to a segment roughly corresponding to our palmcore, though the details are not fully described.

Palmprint and palmcore barcodes

The palmprint and palmcore segments have well-defined boundaries which can be identified by sequence analysis in highly diverged genes, enabling RdRp to be trimmed consistently. These sequences are thus well suited for use as barcodes.

RdRp barcode for species-like OTUs

The identities of the palmprint and palmcore segments correlate well with the identity of the full-length RdRp gene, and both segments are thus well-suited for species definition and clustering into sOTUs. In metagenomic data analysis, palmprints have the advantage of being shorter (~100aa, vs. ~400aa for palmcore and ~500 to 1,200aa for the full-length gene), and it is therefore more likely that an intact palmprint will be found in fragmented assembly contigs. In contrast to the much slower-evolving 16S rRNA gene, where short barcodes such as V4 cannot reliably resolve species, the fast evolution of RdRp ensures that palmprints can easily resolve species and there is therefore no obvious disadvantage to palmprints in this context.

RdRp barcode for phylogeny

Building a high-quality tree requires a global multiple sequence alignment (MSA) for use in maximum-likelihood phylogeny estimation (ML). Longer sequences are preferred because they contain more information, providing that homology is maintained over the entire segment. For RdRp of highly diverged viruses, this requirement implies that the sequence must be restricted to the palm domain to avoid the problem of different (i.e., non-homologous) RdRp domain content in different virus families. Here, palmcores represent a good compromise between maximizing length and minimizing the risk of over-extending into different domains.