Paired read merging (assembly)
Paired-end (PE) reads are often
generated by next-generation sequencers. When the forward and reverse reads
overlap, they can be "assembled" or "merged" to give a single sequence. I
prefer to call this merging rather than assembly to distinguish from whole-genome assembly using shotgun reads.
People who use the term assemblers call the result of merging a pair a "contig",
I call it a "consensus sequence".
Consensus
sequence
Paired read mergers generate consensus sequences by
aligning the forward and reverse reads and resolving any mismatches found in
the alignment. The reverse read is reverse-complemented so that its sequence
is oriented on the same strand as the forward read. Usually, the alignment
covers part of the forward and reverse reads leaving unaligned segments at
the beginning of both reads. However, if the sequencing construct is shorter
than the read length then the alignment is
staggered with unaligned segments at the ends rather than the beginning of
the reads. By default, the fastq_mergepairs command trims (discards) the unaligned
segments if the alignment is staggered.
In mismatched positions,
the base call with the higher Q score is
chosen. If the Q scores are the same, the base call is taken from the forward read.
Posterior quality scores
If the base calls match,
the Q score should increase. If there is a mismatch, the Q score should be
usually be reduced, but not always. The Q score at a mismatch position may
stay the same or even increase if one of the reads has a low enough Q score
that the base call is predicted to be wrong, e.g. Q=2 which indicates an
error probability of 63.1%. The error probabilities after merging are called
posterior probabilities, and the corresponding Q scores are called posterior
Qs. The term "posterior" comes from Bayesian statistics, meaning after new
evidence has been taken into account. The equations for calculating
posterior Qs are given in Edgar & Flyvbjerg (2015).
Calculating correct posterior Qs is important because it enables an improved estimate of the number of errors in the read.
Other programs calculate
incorrect posterior Q scores
Most assemblers generate posterior
Q scores which are obviously wrong in at least some situations (see
benchmark results). The best (i.e., worst)
example is PANDAseq, which reports lower Q scores in most aligned positions
even if there is agreement between the two reads (tested through v2.8).