The fastx_subsample command generates a random subset of sequences in a FASTA or FASTQ file. The subset is written to filename(s) given by the -fastaout and/or -fastqout options.
This command is useful for making fast assessments of large datasets, e.g. by analyzing a small sample of NGS reads.
Paired reads are supported by using the -reverse option to specify the reverse (R2) filename. You must then provide the -output2 option for the reverse subset.
The size of the subset must be specified by the -sample_size or -sample_pct options, which give the number of sequences and percentage of the input sequences, respectively.
Size annotations are supported if the -sizein and -sizeout options are given.
If the -xsize option is given, any size annotations in the input sequence labels are stripped.
The -randseed option sets a seed for the random number generator, enabling reproducible subsets to be generated. By default, the seed is taken from the system clock so that in general the subset will change each time the command is run. The value must be an integer.
The -notmatched and -notmatchedfq options are filenames for sequences that are not selected in FASTA and FASTQ format, respectively.
Examples
usearch -fastx_subsample raw_reads.fastq -sample_pct 10 -randseed 1 -fastaout ten_pct.fastqusearch -fastx_subsample derep.fa -sizein -sizeout -sample_size 10000 -fastaout ten_k.fa
usearch -fastx_subsample Sample_R1.fq -reverse
Sample_R2.fq -fastqout fwd.fq \
-output2 rev.fq -sample_pct 25