sampledoc
News and Announcements »

Open-reference OTU picking applied to Illumina data

This document describes how to use QIIME to analyze very large data sets, generally on the HiSeq2000, using an open-reference OTU picking protocol. There are two primary options here: first is to use standard open-reference uclust (i.e., uclust_ref allowing for new clusters); and second is to use the subsampled open-reference OTU picking protocol. The results from both will be nearly identical. The difference is that the first option is slower, but available in QIIME 1.4.0 (and most earlier versions). The runtime will be limiting for multiple HiSeq2000 lanes, or for single HiSeq2000 lanes with high diversity. The second option is much faster, and will work for multiple HiSeq2000 lanes or runs. It is available in QIIME 1.5.0 and later.

This document very briefly covers option 1. Most of the text covers option 2, including a description of how to use it, what exactly is happening, and test results from applying this to some well-understood 454 data.

Note

You can always find a link to the latest version of the Greengenes reference OTUs and the AMI of the latest QIIME EC2 instances here.

Option 1: Standard open-reference OTU picking

Standard open-reference OTU picking is suitable for a single HiSeq2000 lane (unless it’s very high diversity, in which case runtime may be a limiting factor). You’ll use the pick_otus_through_otu_table.py workflow script in QIIME with a custom parameters file.

Your parameters file should look like the following:

pick_otus:otu_picking_method uclust_ref
pick_otus:refseqs_fp <PATH TO REFERENCE COLLECTION>
pick_otus:enable_rev_strand_match True

Where <PATH TO REFERENCE COLLECTION> is replaced with the path to the reference data set you’d like to pick OTUs against. The QIIME development group frequently used the Greengenes reference OTUs. On the QIIME EC2 instance, this path would be /software/gg_otus-4feb2011-release/rep_set/gg_97_otus_4feb2011.fasta.

This command should be run in parallel. Each job will need approximately 4GB of RAM, so if running on EC2 and you want to start 8 parallel jobs (recommended setting for EC2), your instance type should be m2.4xlarge.

You can then use the following commands. You should always use full paths which are represented here by $PWD, but will usually look something like /home/ubuntu/my_data/ (in other words, they should start with a /). In this example your input sequences (seqs.fna), your parameters file (ucr_params.txt), and your metadata mapping file (map.txt) are all in the same directory represented by $PWD. If you work from the directory containing those files, you can leave $PWD in the commands instead of specifying the full paths.

First, pick otus, choose representative sequences, assign taxonomy to OTUs, and build a phylogenetic tree. The -aO 8 specifies that we want to start 8 parallel jobs - adjust this according to the resources you have available. This is open-reference OTU picking, so reads will be clustered against the reference database (in parallel) and reads which fail to hit the reference data set will subsequently be clustered de novo (serially):

pick_otus_through_otu_table.py -i $PWD/seqs.fna -o $PWD/ucr/ -p $PWD/ucr_params.txt -aO 8

When working with Illumina data you typically want to filter singleton OTUs (i.e., OTUs with only one sequence) as these are likely to represent sequencing or PCR errors. In QIIME 1.4.0 (and most earlier versions) you can do that with this command:

filter_otu_table.py -i $PWD/ucr/otu_table.txt -o $PWD/ucr/otu_table_mc2.txt -c 2 -s 1

In QIIME 1.4.0-dev and later, you can filter singleton OTUs with this command:

filter_otus_from_otu_table.py -i $PWD/ucr/otu_table.biom -o $PWD/ucr/otu_table_mc2.biom -c 2

You’ll notice that depending on your version of QIIME, the extension on your OTU table file will differ. In QIIME 1.4.0 and earlier, it will be .txt. In QIIME 1.4.0-dev and later it will be .biom. We’ll continue this example assuming that your OTU table ends with .txt (if you’re working with QIIME 1.4.0-dev or later, you likely decided to go with option 2 for OTU picking).

As PCoA of UniFrac distances between samples is a frequent result of interest in microbial ecology, we’ll cover how to generate PCoA plots next. The first thing you’ll want to do is evenly sample your OTU table. To choose an even sampling depth, review the number of reads per sample:

per_library_stats.py -i $PWD/ucr97/otu_table_mc2.biom

This will print information on the number of reads per sample to the terminal. Choose a depth of sampling that maximizes the number of sequences you’ll include, and also the number of samples that have at least that many sequences: samples with fewer sequences will be excluded from your beta diversity/PCoA analysis. Even sampling is absolutely critical to getting meaningful UniFrac distances between your samples.

After choosing an even sampling depth you can use the beta_diversity_through_plots.py script to rarify your OTU table, compute UniFrac distances between samples, and run Principal Coordinates Analysis with the following command (in this example we have chosen an even sampling depth of 25,000 sequences/sample):

beta_diversity_through_plots.py -i $PWD/ucr/otu_table_mc2.biom -e 25000 -o $PWD/ucr/bdiv_even25000/ -t $PWD/ucr/rep_set.tre -m $PWD/map.txt -aO8

Again the -aO8 specifies that the job should be run in parallel on 8 processors. Adjust this according to your resources. When this completes you can open 3D interactive PCoA plots by opening the file bdiv_even25000/unweighted_unifrac_3d_continuous/unweighted_unifrac_pc_3D_PCoA_plots.html in a web browser. This may take a minute to load for very large sets of samples.

Option 2: Subsampled open-reference OTU picking

Subsampled reference OTU picking is suitable for any analysis that standard open-reference OTU picking can be used for, but additionally scales to much larger data sets (such as multiple HiSeq runs, which may require several days on ~100 processors to analyze).

Warning

If processing multiple HiSeq lanes, don’t combine the sequence data into a single file. Instead, see Using the subsampled open-reference OTU picking workflow in iterative mode.

This is an open-reference OTU picking protocol, meaning that sequences are clustered against a reference database, and reads which fail to hit the reference are subsequently clustered de novo. This differs from standard open-reference OTU picking as it was optimized at several steps to enable running on massive numbers of sequences (hundreds of millions, which is massive as of this writing). The steps in this workflow are as follows.

Step 0: Prefilter (parallel)

Prefilter the input sequence collection by searching reads against the reference set with a low percent identity threshold (default is 60%, modify with --prefilter_percent_id). The choice of 60% is described here. All reads which fail to hit the reference set are discarded as likely sequencing error.

Warning

If most or all of your sequences are being filtered at this step, your sequences may be in the reverse orientation with respect to your reference database. To address this, you should add the following to your parameters file (creating one, if necessary) and pass this file as -p to pick_subsampled_reference_otus_through_otu_table.py: pick_otus:enable_rev_strand_match True. This is included in the instructions below, but be aware that this doubles the memory used in this step of the workflow.

Step 1: Closed reference (parallel)

Apply closed-reference OTU picking against the reference collection. Generate a fasta file containing all reads that fail to hit the reference collection.

Step 2: De novo clustering of subsampled failures (serial)

Randomly subsample the sequences that failed to hit the reference collection, and write these to a new fasta file (default subsampling percentage is 0.1%, modify with -s/--percent_subsample). Cluster these reads de novo, and choose a representative set of sequences as the centroid of each OTU cluster. These are the new reference OTUs.

Step 3: Closed reference, round 2 (parallel)

Pick closed-reference OTUs against the representative sequences from the previous step. Write all sequences that fail to hit the reference collection to a fasta file.

Step 4: De novo (serial)

Pick de novo OTUs on all reads that failed to hit the reference collection in the previous step. These are the clean-up OTUs. This step can be suppress by passing --suppress_step4.

Post-OTU processing (parallel and serial, depending on step)

  1. Assemble the reference OTUs, the new reference OTUs, and the clean-up OTUs into a new OTU map, and construct an OTU table. At this stage, all OTUs with a sequence count of smaller than 2 (i.e., the singleton OTUs) are discarded. This can be modified with the --min_otu_size option, and disabled by passing --min_otu_size=1.
  2. Construct a new reference collection based on this OTU picking run. This new reference collection will be the combination of the full input reference collection, the new reference OTU representative sequences, and the clean-up OTU representative sequences. Note that this will not include representatives of the singleton OTUs by default. Also note that this differs from the representative set of sequences for this run in that it contains all of the input reference sequences, not only the ones that are represented in the current data set (which is what the representative sequences for this run contains).
  3. Taxonomy will be assigned to all OTUs (using RDP classifier by default) and representative sequences will be aligned and a tree will be constructed. Finally, an additional OTU table will be constructed that excludes reads that failed to align with PyNAST. It is recommended that this OTU table be used in downstream analysis.

To apply this analysis to seqs1.fna, picking OTUs against the reference collection refseqs.fna you can run the following command. Your parameters file should contain the following:

pick_otus:otu_picking_method uclust_ref
pick_otus:enable_rev_strand_match True

You should always use full paths which are represented here by $PWD, but will usually look something like /home/ubuntu/my_data/ (in other words, they should start with a /). In this example your input sequences (seqs1.fna), and your metadata mapping file (map.txt) are all in the same directory represented by $PWD. If you work from the directory containing those files, you can leave $PWD in the commands instead of specifying the full paths:

pick_subsampled_reference_otus_through_otu_table.py -i $PWD/seqs1.fna -r $PWD/refseqs.fna -o $PWD/ucrss/ -aO 8 -p $PWD/ucrss_params.txt

This command should be run in parallel. Each job will need approximately 4GB of RAM, so if running on EC2 and you want to start 8 parallel jobs (recommended setting for EC2), your instance type should be m2.4xlarge. The -aO 8 specifies that we want to start 8 parallel jobs - adjust this according to the resources you have available.

As PCoA of UniFrac distances between samples is a frequent result of interest in microbial ecology, we’ll cover how to generate PCoA plots next. The first thing you’ll want to do is evenly sample your OTU table. To choose an even sampling depth, review the number of reads per sample:

per_library_stats.py -i $PWD/ucrss/otu_table_mc2_w_tax_no_pynast_failures.biom

This will print information on the number of reads per sample to the terminal. Choose a depth of sampling that maximizes the number of sequences you’ll include, and also the number of samples that have at least that many sequences: samples with fewer sequences will be excluded from your beta diversity/PCoA analysis. Even sampling is absolutely critical to getting meaningful UniFrac distances between your samples.

After choosing an even sampling depth you can use the beta_diversity_through_plots.py script to rarify your OTU table, compute UniFrac distances between samples, and run Principal Coordinates Analysis with the following command (in this example we have chosen an even sampling depth of 25,000 sequences/sample):

beta_diversity_through_plots.py -i $PWD/ucrss/otu_table_mc2_w_tax_no_pynast_failures.biom  -e 25000 -o $PWD/ucrss/bdiv_even25000/ -t $PWD/ucrss/rep_set.tre -m $PWD/map.txt -aO8

Again the -aO8 specifies that the job should be run in parallel on 8 processors. Adjust this according to your resources. When this completes you can open 3D interactive PCoA plots by opening the file bdiv_even25000/unweighted_unifrac_3d_continuous/unweighted_unifrac_pc_3D_PCoA_plots.html in a web browser. This may take a minute to load for very large sets of samples.

Filtering an open-reference OTU table to reference OTUs only

There are cases where you may be interested in working with the closed reference subset of your open reference OTU table (meaning only those OTUs that hit the reference collection, excluding the new OTUs). Following from the above commands, to do that you can filter the new OTUs from the OTU table with the following command:

filter_otus_from_otu_table.py -i $PWD/ucrss/otu_table_mc2_w_tax_no_pynast_failures.biom -o $PWD/ucrss/otu_table_mc2_w_tax_no_pynast_failures.reference_only.biom --negate_ids_to_exclude -e $PWD/refseqs.fna

What this does is filter exclude all OTUs with identifiers that are not present in $PWD/refseqs.fna, so all of the new OTUs.

Subsampled OTU picking workflow evaluation

Several analyses were performed to confirm that results are comparable between the subsampled open-reference OTU picking workflow and the standard open-reference OTU picking workflow. These include analyses on two different data sets: one host-associated (the Costello Whole Body study) and one free-living (the Lauber 88 soils study). These two were chosen as Greengenes (the reference set being used) is known to be biased toward human-associated microbes, so I wanted to confirm that the method works when few sequences fail to hit the reference set (whole body) and when many sequences fail to hit the reference set (88 soils).

Several tests were performed:
  • beta diversity (procrustes analysis to compare subsampled OTU results to de novo, open-reference, and closed-reference OTU picking)
  • alpha diversity (test for correlation in observed OTU count between subsampled OTU results and de novo, open-reference, and closed-reference OTU picking)
  • otu category significance (reviewed significant OTUs - need a good way to quantitate this)

For all analyses, sequences that fail to align with PyNAST and singleton OTUs were discarded (these are defaults in the subsampled OTU picking workflow).

88 soils analysis

This analysis is based on the data presented in the Lauber 88 soils study.

Alpha diversity

Here I checked whether the subsampled reference OTU alpha diversities for all samples were correlated with the de novo OTU picking, standard open-reference OTU picking, and closed-reference OTU picking alpha diversities. The observed species/OTUs metric was calculated on add data sets (alpha_diversity.py -m observed_species), and the Pearson correlations were computed for subsampled reference OTU picking with the three other sets of values.

Results

  • subsampled open-reference OTU picking versus de novo OTU picking: r=0.995 p=4.836e-88
  • subsampled open-reference OTU picking versus standard open-reference OTU picking: r=1.000 p=0.000
  • subsampled open-reference OTU picking versus closed-reference OTU picking: r=0.8634 p=1.405e-27

Conclusions

Subsampled open-reference OTU picking alpha diversity values are significantly correlated with de novo, standard open-reference, and closed-reference OTU picking results. This suggests that we will derive the same biological conclusions between regarding alpha diversity when using the subsampled OTU picking workflow.

Beta diversity

Here I checked whether Procrustes comparisons of unweighted UniFrac PCoA plots between subsampled open-reference OTU picking and de novo OTU picking, standard open-reference OTU picking, and closed-reference OTU picking yield significant results. This was calculated using transform_coordinate_matrices.py which is described in the Procrustes tutorial. p-values are based on 1000 Monte Carlo iterations.

Results

  • subsampled open-reference OTU picking versus de novo OTU picking: M2=0.009 p<0.001
  • subsampled open-reference OTU picking versus standard open-reference OTU picking: M2=0.007 p<0.001
  • subsampled open-reference OTU picking versus closed-reference OTU picking: M2=0.039 p<0.001

Conclusions

Procrustes results are highly significant for the three comparisons, suggesting that we will derive the same biological conclusions regardless of which of these OTU picking workflows is used.

OTU category significance

Here I confirm that the same taxonomy groups are identified as significantly different across the pH gradient in these soils using ANOVA, regardless of which OTU picking workflow is applied. These results were computed with the otu_category_significance.py script. To define a category for this test I binned the pH values by truncating the values to integers (so 5.0, 5.3, and 5.9 are all binned to pH 5) and using this binned pH as the category. Since I’m just looking for consistent results across the different OTU picking methods I don’t think it’s important that this isn’t the most biologically relevant binning strategy. Note that OTU ids are not directly comparable across all analyses, so it is best to compare the taxonomies.

Results

Top 5 OTUs that differ across bins for subsampled open-reference OTU picking:

OTU ID Bonferroni-adjusted p-value Taxonomy
New.CleanUp.ReferenceOTU26927 1.99e-11 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Chromatiales; f__Sinobacteraceae
New.CleanUp.ReferenceOTU34053 7.06e-09 k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae
212596 9.26e-09 k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae
112859 1.22e-08 k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__Hyphomicrobiaceae
New.CleanUp.ReferenceOTU36189 4.35e-08 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Rubrobacterales; f__Rubrobacteraceae

Top 5 OTUs that differ across bins for de novo OTU picking:

OTU ID Bonferroni-adjusted p-value Taxonomy
26819 3.19e-11 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Chromatiales; f__Sinobacteraceae
28062 5.92e-10 k__Bacteria; p__Acidobacteria; c__; o__; f__Koribacteraceae
35264 2.43e-09 k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae
45059 5.48e-09 k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__; f__
7687 2.056e-08 k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae

Top 5 OTUs that differ across bins for standard open-reference OTU picking:

OTU ID Bonferroni-adjusted p-value Taxonomy
DeNovoOTU26928 1.99e-11 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Chromatiales; f__Sinobacteraceae
DeNovoOTU34054 7.06e-09 k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae
212596 9.26e-09 k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae
112859 1.22e-08 k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__Hyphomicrobiaceae
DeNovoOTU36190 4.35e-08 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Rubrobacterales; f__Rubrobacteraceae

Top 5 OTUs that differ across bins for closed-reference OTU picking:

OTU ID Bonferroni-adjusted p-value Taxonomy
212596 4.03e-09 k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae; g__CandidatusSolibacter; s__
112859 4.62e-09 k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__; g__; s__
544749 5.56e-08 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Chromatiales; f__Sinobacteraceae; g__; s__
541300 1.28e-07 k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae; g__CandidatusSolibacter; s__
563862 1.95e-07 k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae; g__CandidatusSolibacter; s__

Conclusions

In lieu of a solid statistical approach to compare these results, the results appear consistent across the different OTU picking workflows. The standard open-reference and subsampled open-reference are remarkably consistent.

Additional sanity check: is the new reference dataset sane?

To confirm that the new reference data set works as expected, I applied standard open-reference OTU picking on the original input sequences against the new reference collection generated by the subsampled OTU analysis. The idea here is that most reads should now hit the reference collection. A number of reads still fail, but on close investigation these turn out to all cluster into singleton OTUs. So, this is expected as singletons are not included in the reference collection (possible to adjust this with the --min_otu_size parameter [default = 2]). The new reference collection that is generated does appear to be sane. The command used for this analysis was:

pick_otus_through_otu_table.py -i /home/ubuntu/data/lauber_88soils/seqs.fna -o /home/ubuntu/data/lauber_88soils/subsample_ref_otus_eval/ucr97_v_new_ref/ -p /home/ubuntu/data/lauber_88soils/subsample_ref_otus_eval/ucr_v_newref_params.txt -aO 3

The parameters file (-p) for this analysis contained the following lines:

pick_otus:otu_picking_method uclust_ref
pick_otus:refseqs_fp /home/ubuntu/data/lauber_88soils/subsample_ref_otus_eval/prefilter60/new_refseqs.fna
pick_otus:enable_rev_strand_match True

Whole body analysis

This analysis is based on the data presented in the Costello Whole Body study.

Alpha diversity

Here I checked whether the subsampled reference OTU alpha diversities for all samples were correlated with the de novo OTU picking, standard open-reference OTU picking, and closed-reference OTU picking alpha diversities. The observed species/OTUs metric was calculated on add data sets (alpha_diversity.py -m observed_species), and the Pearson correlations were computed for subsampled reference OTU picking with the three other sets of values.

Results

  • subsampled open-reference OTU picking versus de novo OTU picking: r=0.99 p=0.0
  • subsampled open-reference OTU picking versus standard open-reference OTU picking: r=1.0 p=0.0
  • subsampled open-reference OTU picking versus closed-reference OTU picking: r=0.95 p=0.0

Conclusions

Subsampled open-reference OTU picking alpha diversity values are significantly correlated with de novo, standard open-reference, and closed-reference OTU picking results. This suggests that we will derive the same biological conclusions between regarding alpha diversity when using the subsampled OTU picking workflow.

Beta diversity

Here I checked whether Procrustes comparisons of unweighted UniFrac PCoA plots between subsampled open-reference OTU picking and de novo OTU picking, standard open-reference OTU picking, and closed-reference OTU picking yield significant results. This was calculated using transform_coordinate_matrices.py which is described in the Procrustes tutorial. p-values are based on 1000 Monte Carlo iterations.

Results

  • subsampled open-reference OTU picking versus de novo OTU picking: M2=0.056 p<0.001
  • subsampled open-reference OTU picking versus standard open-reference OTU picking: M2=0.053 p<0.001
  • subsampled open-reference OTU picking versus closed-reference OTU picking: M2=0.072 p<0.001

Conclusions

Procrustes results are highly significant for the three comparisons, suggesting that we will derive the same biological conclusions regardless of which of these OTU picking workflows is used.

OTU category significance

Here I confirm that the same taxonomy groups are identified as significantly different across the body sites in this study using ANOVA, regardless of which OTU picking workflow is applied. These results were computed with the otu_category_significance.py script. Note that OTU ids are not directly comparable across all analyses, so it is best to compare the taxonomies.

Results

Top 5 OTUs that differ across bins for subsampled open-reference OTU picking:

OTU ID Bonferroni-adjusted p-value Taxonomy
470747 5.37e-151 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae
471122 4.35e-143 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae
470477 4.14e-135 k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Carnobacteriaceae
94166 1.61e-125 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae
442949 1.33e-124 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae

Top 5 OTUs that differ across bins for de novo OTU picking:

OTU ID Bonferroni-adjusted p-value Taxonomy
6389 4.54e-148 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae
17234 1.16e-141 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae
18227 3.05e-139 k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Carnobacteriaceae
7262 1.22e-134 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae
18575 2.74e-122 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae

Top 5 OTUs that differ across bins for standard open-reference OTU picking:

OTU ID Bonferroni-adjusted p-value Taxonomy
470747 5.37e-151 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae
471122 4.35e-143 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae
470477 4.14e-135 k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Carnobacteriaceae
94166 1.61e-125 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae
442949 1.33e-124 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae

Top 5 OTUs that differ across bins for closed-reference OTU picking:

OTU ID Bonferroni-adjusted p-value Taxonomy
470747 3.25e-150 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae
471122 2.20e-148 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae
470477 2.51e-138 k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Carnobacteriaceae
94166 4.30e-130 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae
442949 2.04e-121 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae

Conclusions

In lieu of a solid statistical approach to compare these results, the results are remarkably consistent across the different OTU picking workflows.

Additional sanity check: what reads are being discarded by the prefilter?

To investigate what reads get discarded at the prefilter stage, I evaluated a subset of the reads discarded when the prefilter was set to 80% (--prefilter_percent_id 0.80) versus when the prefilter was set to 60% (default).

Sequences filtered at 80% but not at 60%

These three have high percent id matches to 16S sequences in NCBI:

>F12Pinl.140479_129272 FFLHOYS02GCJLO orig_bc=ATACAGAGCTCC new_bc=ATACAGAGCTCC bc_diffs=0
CTGGGCCGTGTCTCAGTCCCAGTGTGGCTGATCATCCGAAAAGACCAGCTAAGCATCATTGGCTTGGTCAGCCTTTACCTAACCAACTACCTGATACTACGTGGGCTCATCGAACAGCGCGAATTAGCTTGCTTTATGAATTATTCAGGATTTGGAGTGAACTATTCGGCAGATTCCCACGCGTTACGCACCCGTTCGCCACTTTGCTTG
>F32Indr.140459_1174716 FFO92CG02IYZBA orig_bc=GCTATCACGAGT new_bc=GCTATCACGAGT bc_diffs=0
CCGGGCCGTGTCTCAGTCCCAGTGTGGCTGATCATCCGAAAAGACCAGCTAAGCATCATTGGCTTGGTCAGCCTTTACCTGACCAACTACCTAATACTACGCAGGCTCATCAAACAGCGCTTTTTAGCTTTCTTCAGGATTTGGCCCGAACTGTTCGGCAGATTCCCACGCGTTACGCACCCGTTCGCCACTTTGTTCTCAACTGTTCCCACCTCCTGGGCGAGA
>F32Forr.140528_1210712 FFO92CG02IKGYS orig_bc=GCGTTACACACA new_bc=GCGTTACACACA bc_diffs=0
CCGGGCCGTGTCTCAGTCCCAGTGTGGCTGATCATCCGAAAAGACCAGCTAAGCATCATTGGCTTGGTCAGCCTTTACCTGACCAACTACCTAATACTACGCAGGCTCATCAAACAGCGCTTTTGAGCTTTCTTCAGGATTTGGCCCGAACTGTTCGGCAGATTCCCACGCGTTACGCACCCGTTCGCCACTTTGTTCTCAACTATTCCGATTCTTTTTTCGGTAGGCCGTTA

Sequences filtered 80% and at 60%

These three reads hit a small fragment, a human sequence, and nothing in NCBI, respectively:

>M22Pinr.140692_1148864 FFO92CG01EQIWQ orig_bc=CGCACATGTTAT new_bc=CGCACATGTTAT bc_diffs=0
GGAAAAGGGAAAAACAGATGAGACAAATAGAAAACAAATAGCAAATTAGTAGGTGTTAACATGACTTTATCAATAATTACATCAAATGTAGATGATGTTAACCATGGATTGACAAACTTTTTCTTTATAGGACCAGACAGTCAATATTTTAGGTCTTTGAGGCCATATGGTATCTGTCATAACCACTCAACTGAGCCAGGATCAAACTCTGA
>F31Nstl.140789_1153834 FFO92CG02FSK33 orig_bc=GCAGCCGAGTAT new_bc=GCAGCCGAGTAT bc_diffs=0
CANNOT INCLUDE THIS READ DUE TO IRB RESTRICTIONS
>F32Nstl.140804_1160735 FFO92CG01BRQNZ orig_bc=GCTGCTGCAATA new_bc=GCTGCTGCAATA bc_diffs=0
CTGAAACCCTGGGTCACCAAAAGGCAGGAGGAGGAGGGACAGGGCAAGGCAGGGGAAGAGAGGGGAGGCTGACTCACATACACACATATGCATGCACACATCACACCCACATTCATGTACACACACACAGATTCACATGCATGCACAGCACAATCGCACACTTGTATACACACACAGGCACA

Conclusions

Based on this analysis (and currently unpublished data – will fill in when available), a threshold of 60% was chosen as the default value for discarding sequences that are likely not rRNA.

Using the subsampled open-reference OTU picking workflow in iterative mode

The subsampled open-reference OTU picking workflow can be run in iterative mode to support multiple different sequence collections, such as several HiSeq runs. In iterative mode, the list of sequence files will be processed in order, and the new reference sequences generated at each step will be used as the reference collection for the subsequent step. After all input collections have been processed a single OTU table and tree, covering all of the input collections, will be generated.

To apply this analysis to seqs1.fna and seqs2.fna in iterative mode, picking OTUs against the reference collection refseqs.fna you can run the following command.

To apply this analysis to seqs1.fna, picking OTUs against the reference collection refseqs.fna you can run the following command. Your parameters file should contain the following:

pick_otus:otu_picking_method uclust_ref
pick_otus:enable_rev_strand_match True

You should always use full paths which are represented here by $PWD, but will usually look something like /home/ubuntu/my_data/ (in other words, they should start with a /). In this example your input sequences (seqs1.fna), and your metadata mapping file (map.txt) are all in the same directory represented by $PWD. If you work from the directory containing those files, you can leave $PWD in the commands instead of specifying the full paths:

pick_subsampled_reference_otus_through_otu_table.py -i $PWD/seqs1.fna,$PWD/seqs2.fna -r $PWD/refseqs.fna -o $PWD/ucrss_iter/ -aO 8 -p $PWD/ucrss_params.txt

This command should be run in parallel. Each job will need approximately 4GB of RAM, so if running on EC2 and you want to start 8 parallel jobs (recommended setting for EC2), your instance type should be m2.4xlarge. The -aO 8 specifies that we want to start 8 parallel jobs - adjust this according to the resources you have available.

After iterative OTU picking you can continue on with beta diversity (and other) analyses as described here.


sampledoc