Part 2. Implementing the pipeline script
With 454 data, the first step is to make sure you understand the
read structure. HMP reads start with the
control sequence TCAG followed by the barcode and primer.
Making the barcodes.fa file The pipeline needs a FASTA file
containing the barcodes in which the barcode sequence labels are the sample
identifiers. This is the barcodes.fa file in the misc/ subdirectory, which
looks like this:
Here, sequence labels are SRA
sample identifiers, e.g.
SRS042483.
I added TCAG at the start of each barcode for compatibility with the
fastq_strip_barcode_relabel2.py script (next step below). Here's how I made the barcodes.fa file. Data in HMP archives is
described by LMD (library metadata) files (see
HMP documentation for library metadata).
LMD files are tabbed text with fields described in the info/lmd_format.txt
tutorial file.
The info/SRR058098.lmd file is the LMD file for the
reads used in this tutorial. This contains the information we need to make
the barcodes.fa file: field 7 is the barcode sequence and field 10 is the
sample identifier. The make_barcodes.bash script uses these fields to make
the barcodes.fa file in a single command line using the Linux sed and tr
commands. This command line uses fancy Linux tricks like regular expressions
and a hack to insert newlines using tr, so don't worry if you don't
understand it. The main thing here is to understand the format of the
barcodes.fa file so that you can make one for your own data.
Reformatting the reads With 454 reads, the
fastq_strip_barcode_relabel2.py script is used to strip the barcode and
primer and add a sample identifier to the
read label. This script assumes that the read starts with the barcode
and primer -- this is why I added TCAG to the barcode sequences. (This
design is to support reads with different or missing control sequences).
The primer sequence is field 9 in the LMD file. We can verify that all
the samples have the same primer like this:
cut -f9 info/SRR058098.lmd | uniq
CCGTCAATTCMTTTRAGT This gives us everything we need to run the Python script; this is the
command line from run_uparse.bash:
The sed command changes barcodelabel= to sample=. This is because I now
use sample=xxx as the convention for the sample identifier. Some older
scripts such as fastq_strip_barcode_relabel2.py use barcodelabel= for not
very good reasons.
Quality filtering and length trimming 454 reads often have highly variable lengths with quality that
falls towards the ends of the reads. To get good OTUs, the reads must be
trimmed to a fixed length and filtered
for quality. The fastq_eestats2
provides a readable report to assist you in choosing which length to use.
The command line in run_uparse.bash looks like this:
$usearch -fastq_eestats2 ../fq/reads.fq -output
$out/eestats2.txt The output file is
out/eestats2.txt:
The recommended
expected error cutoff is 1.0 (middle column). I
chose 350 as the length (yellow highlight) because this keeps two thirds of
the reads while increasing the length to 400 discards almost half. This
gives the fastq_filter options
-fastq_maxee 1.0 and -fastq_trunclen 350 used in run_uparse.bash (below).
Don't worry if discarding one third of the reads sounds like a lot; most of
the discarded reads will map to OTUs.