RdRp barcodes
Defining a barcode segment
Compared to other widely-used barcode genes such as 16S rRNA,
viral RdRp is more difficult to work with because the
boundaries of the RdRp protein coding sequence are often at cleavage sites rather than
start/stop codons, and these sites generally cannot be identified by sequence
analysis, especially when sequence identity to well-annotated
genomes is low. Therefore, it is usually not possible to reliably identify full-length RdRp
sequences comprising all of the RdRp protein sequence and nothng else. The usual solution
to this problem is to implement a trimming procedure designed to identify a
subsequence of RdRp which is comparable between diverged species.
Trimming RdRp to a globally alignable segment
For a segment to be comparable between diverged species, it must be approximately
globally alignable so that it can be considered to cover the "same" part of the gene
and enable apples-to-apples comparisons.
This requirement is of course rather vague, and it depends on how diverged the
species might be. In practice, trimming is widely used, but is rarely described in detail.
For example, the Interpro/PFAM hidden Markov models (HMMs) for viral RdRp have
large variations in length and cover different sets of subdomains in different groups.
Wolf2018
made a multiple alignment across most then-known RNA viruses by trimming
RdRp sequences to a segment roughly corresponding to our
palmcore,
though the details are not fully described.
Palmprint and palmcore barcodes
The
palmprint and palmcore segments have well-defined boundaries which
can be identified by sequence analysis in highly diverged genes, enabling RdRp
to be trimmed consistently. These sequences are thus well suited for use as barcodes.
RdRp barcode for species-like OTUs
The identities of the palmprint and palmcore segments correlate well with the identity
of the full-length RdRp gene, and both segments are thus well-suited for species
definition and clustering into sOTUs. In metagenomic data analysis, palmprints have
the advantage of being shorter (~100aa, vs. ~400aa for palmcore and ~500 to 1,200aa for the
full-length gene), and it is therefore more likely that an intact
palmprint will be found in fragmented assembly contigs. In contrast to the much
slower-evolving 16S rRNA gene, where short barcodes such as V4 cannot reliably
resolve species, the fast evolution of RdRp ensures that palmprints can easily
resolve species and there is therefore no obvious disadvantage to palmprints
in this context.
RdRp barcode for phylogeny
Building a high-quality tree requires a global multiple sequence alignment (MSA)
for use in maximum-likelihood phylogeny estimation (ML). Longer sequences
are preferred because they contain more information, providing that homology
is maintained over the entire segment. For RdRp of highly
diverged viruses, this requirement implies that the sequence must be restricted
to the palm domain to avoid the problem of different (i.e., non-homologous) RdRp
domain content in different virus families. Here, palmcores represent a good
compromise between maximizing length and minimizing the risk of over-extending
into different domains.