NAST alignment format

See also
nastout files

NAST (Nearest Alignment Space Termination) is a multiple alignment format originally designed for 16S rRNA, though the approach can readily be adapted to other genes and regions.

NAST was introduced in a paper by DeSantis et al. , NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes , Nucl. Acids Res. (2006) 34 (suppl 2): W394-W399 . At least three third-party aligners, PyNAST , NAST-ier and mothur , are also available.

The main idea of NAST is to create a reference multiple alignment with a fixed number of columns that does not change as new sequences are introduced. Columns in a NAST alignment serve as fixed reference points for a set of homologous sequences, e.g. 16S genes. Similar ideas have been applied to other genes, e.g. IMGT unique numbering for immunoglobulins ( PMID 12477501 ). Lacking a better name, I generically refer to this approach as "NAST".

A new sequence can be aligned to a reference alignment relatively easily, by identifying the closest sequence or closest few sequences. A pair-wise alignment or small multiple alignment is then made, which can readily be mapped back to the full reference alignment. This approach allows new sequences to be annotated with features, e.g. hypervariable regions, using a pre-defined map of features to column numbers. Given the very large datasets now available for 16S, immunoglobulins and other genes and regions, some traditional methods are computationally intractable, while a NAST alignment enables efficient calculation of pair-wise distances, identification of chimeric sequences, etc.

There are two main disadvantages of NAST-like approaches. First, novel insertions cannot be accommodated correctly because the format has a fixed number of columns by definition. Therefore, novel insertions must be deleted (this is the solution adopted in USEARCH), or misalignments must be introduced. Neither solution is entirely satisfactory, for obvious reasons. Second, some (if not most) genes and regions are simply too variable, making it impossible to build a reasonable multiple alignment, e.g. the fungal Internal Transcribed Spacer (ITS) region.

I believe that better results can usually be obtained by constructing pair-wise or multiple alignments of subsets de novo , as done for example by the uchime_ref command (compare ChimeraSlayer and the mothur Chimera.slayer command which use a NAST-based method but are slower and less accurate). However, NAST methods can be convenient in some situations, especially where there is existing data and annotations that rely on 16S NAST or a NAST-like fixed column scheme such as IMGT numbering.