SAM file format

See also
sam_filter command
samout output file

The Sequence Alignment/MAP (SAM) file format is commonly used by read mapping software. It is described in Li et. al, The Sequence Alignment/MAP format and SAMtools, Bioinformatics (2009). A formal specification is provided at the SAMtools web site.

SAM a problematic format for both developers and users as it is poorly designed and described and has several incompatible variants in common use. My goal is to make the SAM support in USEARCH compatible with as many other programs as possible. If you run into a compatibility problem, please let me know.

SAM is a text-based file format. A SAM file optionally starts with header lines having a '@' character in the first column. In USEARCH, SAM headers are are ignored in input files and are not generated in output files.

Following the header lines are tab-separated lines called "records". Each record either represents an alignment or reports that a read was not aligned successfully to a target sequence (called a template in SAM terminology). There are 11 fields that are always present in a record (see table below), which are followed by zero or more tag fields.

Field		Description
1		Query sequence label (typically, a read label).
2		Integer representing the sum of integer flag values: 4=no hit, 16=rev-comp'd, 256=secondary hit (all hits except one, the top hit, have this flag).
3		Target sequence label (reference database label).
4		1-based position in target.
5		Mapping quality (ignored / set to * by USEARCH).
6		CIGAR string.
7		Ref name of mate (ignored / set to * by USEARCH).
8		Position of mate (ignored / set to * by USEARCH).
9		Target sequence length.
10		Query sequence. If soft clipping is used, this is the full-length query. If hard clipping is used, this is just the alignment segment of the query. The sam_softclip option specifies soft clipping. Default is hard clipping, because with soft clipping SAM files can be very large with long query sequences. BWA puts * in this field for secondary hits (non-standard space-saving optimization)).
11		ASCII string with Phred scores (ignored / set to * by USEARCH).
12, 13..		Tag fields (optional). Currently, USEARCH generates the following tags (this may be subject to change): AS, XN, XM, XO, XG, NM, MD and YT. Most tags are ignored on input. MD is required in an input file in order to generate complete alignments if the database is not provided (see alnout option to sam_filter).