USEARCH cluster format (UC) is a tab-separated text file. UC output
is supported by clustering and database search. By convention, the
.uc filename extension is used. Each line is either a comment
(starts with #) or a record. Every input sequence generates one record (H, S or
N); additional record types give information about
clusters. If an input sequence matched a target sequence, then the
alignment and the identity computed from that alignment are also
provided. Fields that do not apply to a given record type are
filled with an asterisk placeholder (*).By
default, only the top hit is written to the UC output file. This reflects that
the format is primarily designed for clustering, in which case -maxaccepts > 1
is used to increase cluster quality by finding closer centroid sequences. The -uc_allhits
option can be used to specify that all hits are to be written (mostly useful for
database searches). The -uc_allhits option is supported in version 6.0.217 and
later.
Field |
|
Description |
1 |
|
Record type
S, H, C or N (see table below). |
2 |
|
Cluster
number (0-based). |
3 |
|
Sequence
length (S, N and H) or cluster size (C). |
4 |
|
For H
records, percent identity with target. |
5 |
|
For H
records, the strand: + or - for nucleotides, . for
proteins. |
6 |
|
Not used,
parsers should ignore this field. Included for backwards
compatibility. |
7 |
|
Not used,
parsers should ignore this field. Included for backwards
compatibility. |
8 |
|
Compressed alignment or the symbol '=' (equals
sign). The = indicates that the query is 100% identical to the target sequence
(field 10). |
9 |
|
Label of
query sequence (always present). |
10 |
|
Label of
target sequence (H records only). |
Record |
|
Description |
H |
|
Hit.
Represents a query-target alignment. For clustering, indicates the
cluster assignment for the query. If ‑maxaccepts > 1, only there
is only one H record giving the best hit. To get the other accepts,
use another type of output
file, or use the ‑uc_allhits option (requires version 6.0.217 or later). |
S |
|
Centroid
(clustering only). There is one S record for each cluster, this
gives the centroid (representative) sequence label in the 9th
field. Redundant with the C record; provided for backwards
compatibility. |
C |
|
Cluster
record (clustering only). The 3rd field is set to the cluster size
(number of sequences in the cluster) and the 9th field is set to
the label of the centroid sequence. |
N |
|
No hit (for
database search without clustering only). Indicates that no accepts
were found. In the case of clustering, a query with no hits becomes
the centroid of a new cluster and generates an S record instead of
an N record. |
|