A comprehensive evaluation of the sl1p pipeline for 16S rRNA gene sequencing analysis
© The Author(s) 2017
Received: 13 April 2017
Accepted: 21 July 2017
Published: 14 August 2017
Advances in next-generation sequencing technologies have allowed for detailed, molecular-based studies of microbial communities such as the human gut, soil, and ocean waters. Sequencing of the 16S rRNA gene, specific to prokaryotes, using universal PCR primers has become a common approach to studying the composition of these microbiota. However, the bioinformatic processing of the resulting millions of DNA sequences can be challenging, and a standardized protocol would aid in reproducible analyses.
The short-read library 16S rRNA gene sequencing pipeline (sl1p, pronounced “slip”) was designed with the purpose of mitigating this lack of reproducibility by combining pre-existing tools into a computational pipeline. This pipeline automates the processing of raw 16S rRNA gene sequencing data to create human-readable tables, graphs, and figures to make the collected data more readily accessible.
Data generated from mock communities were compared using eight OTU clustering algorithms, two taxon assignment approaches, and three 16S rRNA gene reference databases. While all of these algorithms and options are available to sl1p users, through testing with human-associated mock communities, AbundantOTU+, the RDP Classifier, and the Greengenes 2011 reference database were chosen as sl1p’s defaults based on their ability to best represent the known input communities.
sl1p promotes reproducible research by providing a comprehensive log file, and reduces the computational knowledge needed by the user to process next-generation sequencing data. sl1p is freely available at https://bitbucket.org/fwhelan/sl1p.
The recent surge of next-generation sequencing technologies have allowed the scientific community to use marker genes, most popular of which being the 16S rRNA gene, to more thoroughly understand mixed bacterial communities (i.e., microbiomes). However, the adoption of any new technology requires standards and quality control. Alongside a plethora of 16S rRNA gene amplicon studies, quality control efforts have addressed the standardization of experimental and bioinformatic methods. For example, laboratory standards have been proposed for the preparation and storage of biological samples [1–3] as well as procedures for the isolation and sequencing of DNA which mitigate environmental contamination [4, 5]. Sequencing controls have greatly reduced variability between laboratories and datasets . Similarly, efforts have been made to standardize the bioinformatic processing of amplicon sequencing results [6, 7]. Next-generation sequencing technologies are subject to varying levels of sequencing error; traditionally, processing of amplicon sequencing data has involved filtering based on input sequence quality, followed by clustering of sequences into operational taxonomic units (OTUs) which are given a taxonomic label based on their similarity to a known database (for e.g. [8–10]). Choice of algorithms for quality filtering, OTU clustering, and taxonomic assignment have been shown to affect the downstream analysis of biologically meaningful results .
OTU clustering, typically computed at 97% sequence similarity, can be divided by approach. Reference-based (or phylotyping) approaches, such as BLAST  and UCLUST-reference , compare input sequences to a reference database. In contrast, de novo-based approaches are independent of a reference set. De novo approaches include hierarchical clustering methods such as Mothur’s average linkage algorithm , and ESPIRIT , as well as greedy algorithms such as CD-HIT [15, 16], DNACLUST , UPARSE , and AbundantOTU+ . Similarly, choice of taxonomic assignment algorithm and reference database also vary across 16S rRNA amplicon studies.
Recent benchmark studies have helped identify some of the most accurate methods in each of these categories. For example, Kopylova et al. identified a series of clustering methods, including UPARSE and USEARCH, which outperformed the widely used UCLUST algorithm . Schloss and colleagues have also presented numerous comparisons of OTU clustering algorithms to find that de novo methods out perform reference-based methods [20, 21] and, more specifically, that the average neighbour algorithm often outperforms all others [20, 22, 23]. Some comparisons of taxonomic methods have also been performed (for e.g., ).
Without a comprehensive workflow, such a surplus of available methods for 16S rRNA gene data processing makes it difficult to identify the most accurate approaches. Further, because each step has been developed independently, processing often involves file and command line manipulations between steps; conducting these manipulations in high-throughput is often inaccessible to a traditionally trained microbiologist, and makes it difficult to reproduce or extend data analyses. Widely used and important tools, such as QIIME  and Mothur , have aided in these issues; however, their step-by-step approach and various parameters represent a significant barrier to effective amplicon data processing and do not fully mitigate issues of reproducibility. To combat this need for ease-of-use, reproducible data processing, and want of a non-biased assessment of processing options, we developed the short-read library 16S rRNA gene sequencing pipeline (sl1p, pronounced “slip”), a 16S rRNA data processing software. sl1p takes Illumina-generated FASTQ files as input and automates all data processing to generate a reproducible OTU table with taxonomic assignments. This pipeline is compatible with any primer set or amplicon gene, and currently offers access to eight OTU clustering algorithms, two taxonomic assignment options, three 16S rRNA gene reference databases, and two phylogenetic outputs. As presented here, the default processing steps and software used in sl1p have been determined to be the most accurate available approaches based on their assessment with synthetic communities generated as part of the Human Microbiome Project (HMP) , and a set of 190 individually picked isolates. All steps in data processing are recorded by sl1p in a log file for future reference and reproducibility.
sl1p is a tool designed to be accessible to the microbiologist without detailed bioinformatic training; as such, it is fully automated, needing one line input from the user upon startup. Further, the output of sl1p includes an R markdown file with the appropriate code to visualize read counts per sample, taxonomic assignments, α-, and β-diversity from which the user can begin their own analyses. sl1p is freely available at https://bitbucket.org/fwhelan/sl1p.
The sl1p pipeline
During initialization, the user can use command line flags to deviate from sl1p’s default functionality (Additional file 1). By default, quality filtering consists of cutadapt  to trim the PCR primers from the FASTQ input, PANDAseq (version 2.9)  to align paired-end reads, sickle (https://github.com/najoshi/sickle; version 1.33) to quality trim the resulting pairs, and USEARCH , as implemented in QIIME (v1.9.1) , to identify and remove chimeric sequences. Users have the choice of eight OTU clustering approaches: five greedy algorithms including AbundantOTU+ 0.93b (default; ), CD-HIT 3.1.1 [15, 16], DNACLUST (release 3) , UCLUST v1.2.22q , and UPARSE (USEARCH version 8.0.1623) , and two reference-based approaches, BLAST 2.2.22  and UCLUST , which can either be strictly closed (UCLUST-ref-strict) or conduct closed clustering followed by de novo on any leftover sequence not matching the reference database (UCLUST-ref). Taxonomic assignment (and OTU clustering, where appropriate) can be assigned using two methods, BLAST or the RDP Classifier 2.2 (default; ), against three reference databases: Greengenes Feb. 2011 (default), Greengenes Aug. 2013 , and Silva Release 111 . Finally, OTU tables, phylogenies, and preliminary analyses are conducted using QIIME and R (v3.3.1). Importantly, as part of sl1p’s commandline options, the user can choose to run all possible combinations of OTU clustering algorithms, taxonomic assignment methods, and choice of reference databases automatically with one command, making comparisons of available methods reproducible and easy to approach.
The sl1p pipeline is open source and publicly available at https://bitbucket.org/fwhelan/sl1p. The pipeline is written in Perl and consists of one main script which calls on auxiliary scripts to aid in reformatting data between steps as necessary. Accompanying setup and install scripts are provided to download and install sl1p.
Generation of test datasets
The Human Microbiome Project Mock Communities (HMP-mock)
Genomic DNA of two unique representations of a 20 member mock community generated as part of the Human Microbiome Project  were obtained from BEI Resources (Catalog Nos. HM-782D and HM-783D; ATCC, Manassas, VA). The first sample (HMP-mock1) is an even distribution of the 20 bacterial organisms from 17 genera, whereas the second (HMP-mock2) is a staggered distribution of the same organisms . For each sample, 3 PCR replicates were generated by using 1 μl of genomic DNA PCR amplified using 1 μl of dNTPs, 0.25 μl of Taq polymerase (Life Technologies, Carlsbad, CA) and 5 μl of PCR primers designed for the v3 region of 16S rRNA gene . These amplification products were then split across two runs of the Illumina sequencer to generate sequencing replicates. Base calling was performed using CASAVA (v1.8.2). Sequencing depth ranged from 5917 to 113,084 reads with an average of 57,257. A negative PCR control was generated in parallel.
Single and Combined Isolate Controls (URTCul)
One hundred ninety single colonies were picked from a collection of upper respiratory tract culture isolates (URTCul) and restreaked until pure on appropriate solid agar plates as described in . Once pure, isolates were picked directly into 5% Chelex, boiled, and centrifuged at 13,000 rpm for 5 min. 5 μl of the supernatant was used as template for a 50 μl PCR reaction of the variable regions 8F-926R [37, 38] of the 16S rRNA gene and sequenced using Sanger sequencing (amplicon length = 918 bps). The resulting Sanger sequences for each isolate were taxonomically assigned using independent blastn searches against NCBI’s RefSeq database. Taxonomic assignments were made to the species level; in the case of multiple species matching with percent identity within <1% of each other, multiple species names were included in the taxonomic assignment are presented (e.g. g__Streptococcus;s__infantis_mitis). This dataset contained 8 unique genera and 33 unique species.
After amplification and Illumina sequencing of each isolate individually, the raw FASTQ reads were combined in silico to create one sample (Fig. 2, URTCul-combined). Further, the taxonomic assignments of the Sanger sequencing results were consulted to create a second in silico sample in which only uniquely identified taxa were combined (Fig. 2, URTCul-uniques). The artificial sequencing depths of these 2 samples were 2148 and 423, respectively. These data are publicly available (BioProject ID PRJNA 381557).
Publicly available dataset
Data processing comparisons
The short-read library 16S rRNA gene sequencing pipeline (sl1p) was developed as an automated and reproducible 16S rRNA gene sequencing processing tool. The output of this tool consists of an R markdown file and accompanying HTML output showing preliminary analyses of the data (e.g. Additional file 4). In order to determine the most accurate default settings of this tool, we systematically tested various methods within the sl1p workflow using 2 approaches (i.) 2 mock community samples from the HMP (HMP-mock), and (ii.) 190 single bacterial isolates (URTCul-singles) and their combination as a totality of the 190 sequencing results (URTCul-combined) or the combination of unique taxa from this pool (URTCul-uniques).
sl1p removes low quality reads effectively
One of the consequences of using next-generation sequencing technologies in high-throughput is the propensity for sequencing error. For instance, Illumina technology is known to have an increased error rate towards the 3’ end of the read, and that the reverse read is generally of poorer quality then the forward. Mitigating this error prior to OTU generation and taxonomic assignment is essential in order to refrain from the generation of spurious OTUs.
sl1p utilizes a multi-step approach to quality control. Immediately following removal of sequencing primers with cutadapt, forward and reverse reads are assembled using PANDAseq. While many options are available for the merging of paired-end reads, PANDAseq includes both quality filtering and read assembly. Across our PCR and sequencing replicates of HMP-mock, approximately 12.5% of raw input reads were culled at this step (Fig. 3a); the majority of culled reads were due to mis-alignment of forward and reverse reads. Following, cutadapt was used to remove any reads containing Illumina annealing or sequencing primers. While this step removed only 2.7% of the HMP-mock input (Fig. 3a), we have found it to be an important way of removing erroneous sequencing results, and a measure of an infrequent poor Illumina sequencing run. Next, sickle was used to trim quality sequence (and to remove any reads < 100 bp post-trimming). It is at this stage where the most quality-filtering is done, with an average 29.6% read loss (Fig. 3a). However, it is this strict quality filtering that results in clean, high-quality paired-end reads (Fig. 3b); when we compare this strict threshold with lower quality cutoffs, we begin to see a decline in the final paired-end read quality as the cutoff drops below 30 (Additional file 5).
The last step in sl1p’s quality control workflow is chimera checking. Because 16S rRNA gene amplicon data is generated via PCR amplification, chimeric sequences can be an issue, especially if the PCR amplification reaction traverses a highly conserved region as is the case for multi-variable region amplicons. As such, sl1p uses QIIME’s implementation of USEARCH to conduct chimera checking on the generated paired-end reads (Fig. 1 & Additional file 1). This approach is database-dependent; however no significant differences were observed between sl1p’s 3 options for reference database (removal of 0.36%, 0.4%, and 0.39% of reads for Greengenes 2011, 2013, and Silva Release 111, respectively on the HMP-mock data).
Following sl1p’s quality control workflow, an average of 55.2% of the raw input HMP-mock reads remain. This percentage is higher to that found with the URTCul dataset (mean of 30.4%); a greater number of unassembled paired-end reads (57.9% of raw input removed during PANDAseq alignment) were observed with the URTCul v34 sequencing, possibly due to the shorter overlap in the target sequence (Additional file 6).
OTU clustering algorithms produce varying numbers of OTUs compared to known input communities
Clustering of input reads into Operational Taxonomic Units (OTUs) has been the most well-studied effect on processed reads [11, 22, 45–48]. OTUs are typically clustered based on a 97% threshold based upon imperial studies identifying this as the differentiating threshold of species ; however when sequencing is restricted to small regions within the gene, this threshold may provide differentiation between the genus and species level, depending on the organism in question .
CPU time for OTU clustering approaches implemented in sl1p
OTU picking approach
CPU time (in mins)
Choice of data processing algorithms affect taxonomic assignment
To further quantify these differences, comparisons can be made between the known taxa and relative abundance compared to each set of OTU picking, taxonomic assignment, and reference database options (Additional files 11 and 12). At this level of resolution, independent of the number of OTUs assigned to each genera, we can see that the proportions of each genera output from sl1p reflect the expected proportions in each of the HMP-mock samples. However, in some sets of processing options, some mistakes are made in taxonomic assignment. The combination of the RDP Classifier and Greengenes 2013 database, for example, incorrectly identifies genus Flexispira in place of the Heliobacter genus in HMP-mock1 (Additional file 11). In other cases, the correct assignment is made, though more conservatively left at the family, order, or class level (Additional files 11 and 12); for example, Greengenes 2013 using BLAST as the taxon assignment algorithm assigns some OTUs to the class Bacilli, failing to differentiate between the Bacillus, Listeria, Staphylococcus, Enterococcus, and Streptococcus species present in HMP-mock1. Overall, across all methods and the HMP-mock samples, BLAST in combination with Greengenes 2011 was the only combination to provide no errors in taxonomic assignment at the genus level. This accuracy comes with a small increase in computing time compared with the RDP Classifier (data not shown).
Choice of processing methods affect biologically relevant results of 16S rRNA gene sequencing
Like all bioinformatic pipelines and processing workflows, what is most important in the output is the reflection of the true underlying biology in the results. While 16S rRNA sequencing data can be analyzed in a number of ways in order to answer many unique research questions, calculations of α and β diversity are often fundamental to analyses. α diversity, or within sample diversity, is a calculation performed on each sample within a dataset. This metric can be calculated using different indices depending on the question at hand. Popular approaches include Shannon and Simpson diversity as these indices incorporate both evenness and richness of the community into their calculations [51, 52]. Other metrics, such as Chao1, are estimates of species richness . Using output of the sl1p processing pipeline, we calculated the Shannon, Chao1, and Simpson diversity metrics on the HMP-mock data (Fig. 6a & Additional file 13). Here, only the OTU clustering algorithm contributes to the estimated richness and evenness of OTU composition, except in the case of reference-based algorithms which are database-dependent (Additional file 1). We observe that the output of α diversity metrics is dependent on the processing methods employed. The range of calculated Shannon diversity scores within the same sample processed using different commonly-used approaches is greater than 1.0 (range 1.54-2.84) (Fig. 6a). Similarly, Chao1 estimates species richness anywhere from 20 to 2,451 depending on data processing options employed; Simpson diversity, in contrast, has much less observed variability between OTU clustering methods and reference database choice. Interestingly, these metrics are also affected by changes in read depth as seen in the variation between sequencing replicates (Additional file 13a); rarefaction of reads somewhat reduces this variation depending on the metric employed (Additional file 13b).
β, or between-sample, diversity is often used as a measure of difference between sample states (e.g. health and disease). Similar to α diversity, there are a variety of distance metrics one can utilize depending on the question at hand. A popular set of these metrics use the phylogenetic distances between OTUs as a contributor to the distance score. Using sl1p, we discovered that the output of these metrics are dependent on how the accompanying phylogenetic tree is generated Fig. 6b-c). Comparisons using Procrustes analysis show substantial differences in the PCoA plots generated using the weighted UniFrac method with different phylogenetic inputs (Fig. 6b-c). One approach recommended in the QIIME workflow, is the use of PyNAST  and FastTree  to create a multiple sequence alignment and phylogeny of the representative sequence from each OTU in the community (Fig. 6d, default phylogeny). However, because this phylogeny is reliant on the sequence diversity within the sequenced variable region, which is often ≤100–300 bp in length, it often does not reflect the true bacterial phylogeny but instead creates paraphyletic phyla (Fig. 6d). Because of this, sl1p generates an alternate phylogeny which represents the Greengenes reference 16S rRNA gene phylogeny trimmed to those OTUs present within the given dataset. Beginning with a curated phylogeny ensures that the phylogenetic relationships between organisms within a given sample set are preserved. Using these phylogenies to generate the Weighted and Unweighted UniFrac metrics, summarized here as Principal Coordinate Analyses (PCoAs), results in differences in the calculated distance between the samples within this community (Fig. 6b-c); Procrustes analysis was used to visualize the differences between these phylogenetic inputs. These results indicate that processing options greatly affect the output and potential interpretation of 16S rRNA gene sequencing results.
sl1p is an automated, reproducible 16S rRNA gene sequencing processing pipeline that makes 16S rRNA data processing accessible to those without formal bioinformatic training. sl1p is not restricted by variable region or choice of PCR primer set. In this study, we outline the workflow of this tool, which can be broken down into 3 main steps: FASTQ quality control, OTU clustering, and taxonomic assignment (Fig. 1). We show how sl1p can aid in the comparison of multiple options and the effects they have on downstream analyses. The quality control workflow within sl1p was determined based on the parameters necessary to obtain high quality base pair assembly along the length of each paired-end sequence (Fig. 3). In order to compare the effect of various OTU picking approaches and taxonomic assignment methods, mock communities were employed. Comparisons of OTU clustering algorithms displayed a wide range of predicted OTUs, generally over-estimating diversity. This, as well as the under-estimations made by UPARSE (Fig. 4), have been previously shown [11, 20, 23]. Further, the choice of taxonomic assignment algorithm and reference database greatly influenced the predicted taxonomic composition of the communities (Fig. 5).
Most importantly, the use of sl1p to compare data processing outputs (OTU tables, taxonomic summaries, and phylogenies) recognizes the effect processing options have on biological analyses (Fig. 6). Popular α diversity metrics such as Shannon diversity are greatly affected by OTU clustering options and sequencing depth (Fig. 6a). These results have implications on the interpretation of microbiome studies across manuscripts and research groups which may process their data using different methods. Further, the differences between sequencing runs have implications for studies which are split across multiple sequencing runs due to size of the sample set. Importantly, the rarefaction of these data did not fully mitigate these effects (Additional file 13b). Further, the alternative phylogenetic representation of the OTU data generated by sl1p better describes the bacterial tree of life, allowing for more accurate β diversity distances to be calculated between samples, furthering our knowledge of differences between varying microbial communities.
The default parameters of sl1p were carefully chosen based on the analyses presented within this study. Of course, all algorithms and tools tested have their own merits and niches within this widely growing field; this is reflected in the fact that no set of tools out performed others in all circumstances (Figs. 4, 5 and 6). We chose AbundantOTU+ as the default OTU picking approach. AbundantOTU+ most closely predicted the correct number of OTUs within HMP-mock1, HMP-mock2, URTCul-combined, and URTCul-unique, without under-estimating diversity. AbundantOTU+ was also the tool able to correctly predict the highest number of single isolate samples in the URTCul-singles dataset. This method also performed well in tests of correctly identified taxa, including the Bray-Curtis dissimilarity comparisons. For choice of taxon assignment algorithm, we chose the RDP Classifier as sl1p’s default. This tool consistently calculated the most number of accurate URTCul-singles isolates, and out-performed or tied BLAST performance on Bray-Curtis dissimilarity comparisons in all cases except for HMP-mock1. These results mirror previous comparisons of taxonomic assignment as completed by Liu et al. . Lastly, Greengenes 2011 is sl1p’s default reference database based on its superior performance in the Bray-Curtis distance comparisons of the HMP-mock communities and as one of the best choices for genus-level taxon identification for the URTCul data. Even though the Greengenes 2013 reference database represents an update to the 2011 version, it often incorrectly predicted taxonomic assignments to the species level which were incorrect compared to the known composition of the mock communities.
It is important to note that these default parameters are based on mock communities of human-associated microbes and may not represent the best combination of tools in the study of other microbiota. The authors hope that by providing a pipeline in which multiple OTU picking, taxonomic assignment, and reference database options are easily accessible, that the user can choose to easily deviate from these defaults as they see fit. Further, as the field of microbiome research continues to grow new approaches to data processing can be implemented in sl1p and benchmarked against established approaches. Having a non-biased method for tool comparison will be important for the maturation of this field.
In conclusion, we present a 16S rRNA gene sequence processing workflow with the aim of generating the most biologically meaningful outputs for the furthering of 16S rRNA gene sequencing techniques and microbiome research in general.
The following reagents were obtained through BEI Resources, NIAID, NIH, as part of the Human Microbiome Project: Genomic DNA from Microbial Mock Community B (Even, Low Concentration), v5.1L, for 16S rRNA Gene Sequencing, HM-782D & Genomic DNA from Microbial Mock Community B (Staggered, Low Concentration), v5.2L, for 16S rRNA Gene Sequencing, HM-783D. THE URTCul sequencing data was generously provided by Steve P. Bernier and Laura Rossi. The authors wish to thank the efforts of the HMP and the Surette laboratory for depositing sequence data in publicly available resources which allow studies like these. The authors also wish to thank the efforts of the Surette laboratory in helping to make this pipeline more effective and user-friendly.
This work was supported by grants from Cystic Fibrosis Canada to MGS. FJW is supported by a Cystic Fibrosis Canada Doctoral Studentship and a Canadian Institutes of Health Research (CIHR) Fredrick Banting and Charles Best Canadian Graduate Scholarship. MGS is supported as a Canada Research Chair in Interdisciplinary Microbiome Research.
Availability of data and materials
The datasets generated and/or analysed during the current study are available as follows: HMP-mock from ; URTCul in NCBI’s SRA repository with BioProject ID PRJNA 381557; the fecal dataset analysed as part of Fig. 6 is available in NCBI’s SRA repository with Bioproject Submission SUB2392090.
FJW wrote all of the computer code that makes up the sl1p software, conducted all analyses, and wrote the manuscript text. FJW and MGS contributed to the intellectual design of this manuscript. All authors approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Sinha R, Abnet CC, White O, Knight R, Huttenhower C. The microbiome quality control project: baseline study design and future directions. Genome Biol. 2015; 16(1):276.View ArticlePubMedPubMed CentralGoogle Scholar
- Dominiani C, Wu J, Hayes RB, Ahn J. Comparison of methods for fecal microbiome biospecimen collection. BMC Microbiol. 2014; 14:103.View ArticleGoogle Scholar
- Zhao J, Li J, Schloss PD, Kalikin LM, Raymond TA, Petrosino JF, Young VB, LiPuma JJ. Effect of sample storage conditions on culture-independent bacterial community measures in cystic fibrosis sputum specimens. J Clin Microbiol. 2011; 49(10):3717–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Knudsen BE, Bergmark L, Munk P, Lukjancenko O, Priemé A, Aarestrup FM, Pamp SJ. Impact of Sample Type and DNA Isolation Procedure on Genomic Inference of Microbiome Composition. mSystems. 2016; 1(5):00095–16.View ArticleGoogle Scholar
- Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P, Parkhill J, Loman NJ, Walker AW. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 2014; 12(1):87.View ArticlePubMedPubMed CentralGoogle Scholar
- Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Peña AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010; 7(5):335–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009; 75(23):7537–41.View ArticlePubMedPubMed CentralGoogle Scholar
- The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature. 2012; 486(7402):207–14.View ArticlePubMed CentralGoogle Scholar
- Sze M, Schloss PD. Looking for a Signal in the Noise: Revisiting Obesity and the Microbiome. mBio. 2016; 7(4):e01018–16.View ArticlePubMedPubMed CentralGoogle Scholar
- Planer JD, Peng Y, Kau AL, Blanton LV, Ndao IM, Tarr PI, Warner BB, Gordon JI. Development of the gut microbiota and mucosal IgA responses in twins and gnotobiotic mice. Nature. 2016; 534(7606):263–6.PubMedPubMed CentralGoogle Scholar
- Kopylova E, Navas-Molina JA, Mercier C, Xu ZZ, Mahé F, He Y, Zhou HW, Rognes T, Caporaso JG, Knight R. Open-Source Sequence Clustering Methods Improve the State Of the Art. mSystems. 2016; 1(1):00003–15.View ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10.View ArticlePubMedGoogle Scholar
- Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010; 26(19):2460–1.View ArticlePubMedGoogle Scholar
- Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W, Farmerie W. ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res. 2009; 37(10):76–6.View ArticleGoogle Scholar
- Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006; 22(13):1658–9.View ArticlePubMedGoogle Scholar
- Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28(23):3150–2.View ArticlePubMedPubMed CentralGoogle Scholar
- Ghodsi M, Liu B, Pop M. DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinforma. 2011; 12(1):271.View ArticleGoogle Scholar
- Edgar RC. UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat Methods. 2013; 10(10):996–8.View ArticlePubMedGoogle Scholar
- Ye Y. Identification and Quantification of Abundant Species from Pyrosequences of 16S rRNA by Consensus Alignment. In: Proceedings 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). New York: Institute of Electrical and Electronics Engineers (IEEE): 2011. p. 153–7.Google Scholar
- Westcott SL, Schloss PD. De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units. PeerJ. 2015; 3:1487.View ArticleGoogle Scholar
- Jackson MA, Bell JT, Spector TD, Steves CJ. A heritability-based comparison of methods used to cluster 16S rRNA gene sequences into operational taxonomic units. PeerJ. 2016; 4:2341.View ArticleGoogle Scholar
- Schloss PD, Westcott SL. Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rRNA Gene Sequence Analysis. Appl Environ Microbiol. 2011; 77(10):3219–226.View ArticlePubMedPubMed CentralGoogle Scholar
- Schloss PD. Application of a Database-Independent Approach To Assess the Quality of Operational Taxonomic Unit Picking Methods. mSystems. 2016; 1(2):00027–16.View ArticleGoogle Scholar
- Mizrahi-Man O, Davenport ER, Gilad Y. Taxonomic Classification of Bacterial 16S rRNA Genes Using Short Sequencing Reads: Evaluation of Effective Study Designs. PLOS ONE. 2013; 8(1):53608.View ArticleGoogle Scholar
- The Human Microbiome Project Consortium. A framework for human microbiome research. Nature. 2012; 486(7402):215–1.View ArticleGoogle Scholar
- Bartram AK, Lynch MDJ, Stearns JC, Moreno-Hagelsieb G, Neufeld JD. Generation of multimillion-sequence 16S rRNA gene libraries from complex microbial communities by assembling paired-end illumina reads. Appl Environ Microbiol. 2011; 77(11):3846–52.View ArticlePubMedPubMed CentralGoogle Scholar
- Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci. 2010; 108(Supplement 1):4516–22.PubMedPubMed CentralGoogle Scholar
- Walters W, Hyde ER, Berg-Lyons D, Ackermann G, Humphrey G, Parada A, Gilbert JA, Jansson JK, Caporaso JG, Fuhrman JA, Apprill A, Knight R. Improved Bacterial 16S rRNA Gene (V4 and V4-5) and Fungal Internal Transcribed Spacer Marker Gene Primers for Microbial Community Surveys. mSystems. 2016; 1(1):00009–15.View ArticleGoogle Scholar
- Parada AE, Needham DM, Fuhrman JA. Every base matters: assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. Environ Microbiol. 2016; 18(5):1403–1414.View ArticlePubMedGoogle Scholar
- Apprill A, McNally S, Parsons R, Weber L. Minor revision to V4 region SSU rRNA 806R gene primer greatly increases detection of SAR11 bacterioplankton. Aquat Microb Ecol. 2015; 75(2):129–37.View ArticleGoogle Scholar
- Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011; 17(1):10.View ArticleGoogle Scholar
- Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD. PANDAseq: paired-end assembler for illumina sequences. BMC Bioinforma. 2012; 13(1):31.View ArticleGoogle Scholar
- Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007; 73(16):5261–7.View ArticlePubMedPubMed CentralGoogle Scholar
- DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol. 2006; 72(7):5069–072.View ArticlePubMedPubMed CentralGoogle Scholar
- Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013; 41(Database issue):590–6.View ArticleGoogle Scholar
- Sibley CD, Grinwis ME, Field TR, Eshaghurshan CS, Faria MM, Dowd SE, et al.Culture enriched molecular profiling of the cystic fibrosis airway microbiome. PLoS One. Public Library of Science. 2011; 6:e22702. doi:10.1371/journal.pone.0022702.View ArticleGoogle Scholar
- Wang Y, Zhang ZS, Ruan JS, Wang YM, Ali SM. Investigation of actinomycete diversity in the tropical rainforests of Singapore. J Ind Microbiol Biotechnol. 1999; 23(3):178–87.View ArticleGoogle Scholar
- Muyzer G, de Waal EC, Uitterlinden AG. Profiling of complex microbial populations by denaturing gradient gel electrophoresis analysis of polymerase chain reaction-amplified genes coding for 16S rRNA. Appl Environ Microbiol. 1993; 59(3):695–700.PubMedPubMed CentralGoogle Scholar
- Moayyedi P, Surette MG, Kim PT, Libertucci J, Wolfe M, Onischi C, Armstrong D, Marshall JK, Kassam Z, Reinisch W, Lee CH. Fecal Microbiota Transplantation Induces Remission in Patients With Active Ulcerative Colitis in a Randomized Controlled Trial. Gastroenterology. 2015; 149(1):102–1096.View ArticlePubMedGoogle Scholar
- McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLOS ONE. 2013; 8(4):61217.View ArticleGoogle Scholar
- Wickham H. ggplot2: Elegant Graphics for Data Analysis. New York: Springer; 2009. http://0-www.springer.com.brum.beds.ac.uk/us/book/9780387981413.View ArticleGoogle Scholar
- Wickham H. Reshaping Data with the reshape Package. J Stat Softw. 2007; 21(12):1–20.View ArticleGoogle Scholar
- Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc.
- Asnicar F, Weingart G, Tickle TL, Huttenhower C, Segata N. Compact graphical representation of phylogenetic data and metadata with GraPhlAn. PeerJ. 2015; 3:1029.View ArticleGoogle Scholar
- Huse SM, Welch DM, Morrison HG, Sogin ML. Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbiol. 2010; 12(7):1889–98.View ArticlePubMedPubMed CentralGoogle Scholar
- Barriuso J, Valverde JR, Mellado RP. Estimation of bacterial diversity using next generation sequencing of 16S rDNA: a comparison of different workflows. BMC Bioinforma. 2011; 12(1):473.View ArticleGoogle Scholar
- Sun Y, Cai Y, Huse SM, Knight R, Farmerie WG, Wang X, Mai V. A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis. Brief Bioinform. 2012; 13(1):107–21.View ArticlePubMedGoogle Scholar
- May A, Abeln S, Crielaard W, Heringa J, Brandt BW. Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations. Bioinformatics. 2014; 30(11):1530–8.View ArticlePubMedGoogle Scholar
- Konstantinidis KT, Tiedje JM. Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci. 2005; 102(7):2567–572.View ArticlePubMedPubMed CentralGoogle Scholar
- Mahe F, Rognes T, Quince C, Vargas CD, Dunthorn M. Swarm : robust and fast clustering method for amplicon-based studies. PeerJ. 2014; 2(593):1–13.Google Scholar
- Shannon CE. A Mathematical Theory of Communication. Bell Syst Tech J. 1948; 27(3):379–423.View ArticleGoogle Scholar
- Simpson E. Measurement of Diversity. Nature. 1949; 163:688.View ArticleGoogle Scholar
- Chao A. Nonparametric Estimation of the Number of Classes in a Population. Scand J Stat. 1984; 11(4):265–70.Google Scholar
- Caporaso JG, Bittinger K, Bushman FD, DeSantis TZ, Andersen GL, Knight R. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics. 2010; 26(2):266–7.View ArticlePubMedGoogle Scholar
- Price MN, Dehal PS, Arkin AP. FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix. Mol Biol Evol. 2009; 26(7):1641–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu Z, DeSantis TZ, Andersen GL, Knight R. Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Res. 2008; 36:e120.View ArticlePubMedPubMed CentralGoogle Scholar