Hmptut tutorial: Human Microbiome Project samples (454 reads)

See also
OTU / denoising tutorial home

This tutorial uses reads of 11 samples deposited in archive SRR058098 by the Human Microbiome Project. I took a random subset of 5% of the data (using the fastx_subsample command), giving 19,735 reads.

Download this archive:
/downloads/hmptut_v10.tar.gz

Make a top-level directory for the tutorials (see tutorial directories for description of subdirectories) and extract the data files from the archives.

mkdir -p ~/tutorials
cd ~/tutorials
tar -zxvf ~/Downloads/hmptut_v10.tar.gz

Set the $usearch environment variable to the path name of your usearch binary file.

The run.bash script runs a basic OTU analysis. (Denoising is not included because the UNOISE algorithm doesn't work well with pyrosequencing reads). Run it from the scripts/ directory (commands below). Notice the dot and slash (./) before the script name. This tells the shell to look for the script file in your current directory (dot means current directory). Note that tutorial scripts always assume that scripts/ is your current directory.

cd ~/tutorials/misop/scripts
./run.bash

Running these scripts should regenerate the pre-computed files in the hmptut/out directory.

1. Run the setup_sintax.bash script to download a taxonomy reference database for 16S. Write a script to create alignments of OTU sequences to the taxonomy database using the usearch_global command. Use these options: -id 0.9 -strand both, and use the -alnout option to create human-readable alignments. Are the OTUs sequences on the plus strand, minus strand, or both?

2. Write a script that runs the sintax command to predict taxonomy for the OTU sequences. Use the -strand both option (why?). You should find that most OTUs have genus predictions with boostrap confidence value 1.0 (very high confidence), but Otu3 has a lower confidence (something like 0.66). Run the Otu3 sequence at NCBI BLAST (see: How to BLAST a 16S sequence). Looking at the BLAST hits, can you explain why the genus confidence is low for this sequence? (Hint: how many different genera have hits with 97% identity or higher?).

3. Wriite a script to run the fastq_eestats2 command on the truncated reads created by run.bash (out/trunc.fq). From the report generated by this command, what fraction of reads would pass a filter which (a) truncates at 200nt and (b) sets a maximum expected error threshold of 0.5?