Introduction

Molecular approaches designed to describe prokaryotic diversity routinely rely on classifying heterogeneous nucleic acids amplified via universal 16S rRNA gene polymerase chain reaction (PCR). The resulting mixed amplicons can be quickly, but coarsely, categorized using terminal-restriction fragment length polymorphism (t-RFLP), single-strand conformation polymorphism (SSCP), or temperature/denaturing gradient gel electrophoresis (T/DGGE) [39]. Association of taxonomic nomenclature to each group may be accomplished through sequencing, but this requires additional labor to physically isolate each gene and does not scale well for large comparative studies such as environmental monitoring. Species richness may be predicted from a few hundred sequences, but reproducible discovery of species composition may require >104 sequencing reactions per sample [14].

To increase the throughput of detection of microorganisms within complex samples, multiple DNA probes have been arrayed onto solid surfaces to allow for parallel, multispecies detection. Successful differentiation of specific collections of bacteria has been achieved by using 16S rRNA gene microarrays containing tens to hundreds of probes for Enterococcus [37], Cyanobacteria [2], nitrifying bacteria [33], fish pathogens [58], and other bacterial groups. Alternatively, protein-encoding genes have been targeted to survey environments using microarrays [63, 64]. Recently, high-density 16S rRNA gene microarrays have emerged in efforts to detect any bacterial type without a priori knowledge of the community structure. Two major challenges have impeded this goal: Probes must be designed that are sensitive to only a specified branch of the prokaryotic tree, and hybridization scoring algorithms are required to interpret probe responses into reliable identifications. If a single unique probe for a taxon cannot be found, several probes can be utilized in combination with rule-based scoring. By vastly increasing the total number of probes within a microarray, more taxa can be queried and detection confidence can be improved. Using this approach, it was shown that organisms from environmental samples were accurately classified into their respective orders using an array with 62,358 probes [12, 61].

In this study, a novel microarray containing 297,851 probes targeted to 16S rRNA genes was tested by using amplicons derived from soil, water, and aerosols. The community profiles derived from the hybridizations were compared to the results from cloning-and-sequencing the same amplicons. A fraction of the clones (8%) were sufficiently divergent from database sequences to be considered novel and were not identified by the array. The microarray results confirmed the majority of clone-detected subfamilies, but additionally showed greater amplicon diversity. Importantly, the microarray detected phyla that would have otherwise been overlooked if relying solely on the clone library. Three of these phyla have been confirmed with specific PCR amplification. The results illustrate the consequence of relying only on clone libraries or high-density 16S rRNA gene microarrays when profiling a microbial community.

Methods

Microarray Design

The microarray probe design approach previously described for differentiating Staphylococcaceae [9] was applied to all known 16S rRNA gene sequences containing at least 600 nucleotides. Briefly, sequences (Escherichia coli bp positions 47 to 1473) were extracted from a multiple sequence alignment composed of more than 30,000 records within the 15 March 2002 release of the 16S rRNA gene database, greengenes.lbl.gov [11]. This region was selected because it is bounded on both ends by universally conserved segments that can be used as PCR priming sites to amplify bacterial or archaeal [13] genomic material using only two to four primers. Putative chimeric sequences were filtered from the data set by using the software package Bellerophon [25], preventing them from being misconstrued as novel organisms [28]. Filtered sequences were clustered to enable each sequence of a cluster to be complementary to a set of perfectly matching (PM) probes. Putative amplicons were placed in the same cluster as a result of common 17-mers found in the sequence. The resulting 8935 clusters, each containing approximately 3% sequence divergence, were considered operational taxonomic units (OTUs) representing all 121 demarcated prokaryotic orders. The taxonomic family of each OTU was assigned according to the placement of its member organisms in Bergey's Taxonomic Outline [22]. The taxonomic outline as maintained by Hugenholtz [26] was consulted for phylogenetic classes containing uncultured environmental organisms or unclassified families belonging to named higher taxa. The OTUs comprising each family were clustered into subfamilies by transitive sequence identity according to a previously described method [9]. Altogether, 842 subfamilies were found. The taxonomic position of each OTU as well as the accompanying NCBI accession numbers of the sequences composing each OTU can be viewed at http://greengenes.lbl.gov/Download/Clones_v_Array/.

The objective of the probe selection strategy was to obtain an effective set of probes capable of correctly categorizing mixed amplicons into their proper OTU. For each OTU, a set of 11 or more specific 25-mers (probes) was sought to be prevalent in members of a given OTU but dissimilar from sequences outside the given OTU. The average number of probes chosen for each OTU was 24. In the first step of probe selection for a particular OTU, each of the sequences in the OTU was separated into overlapping 25-mers, the potential targets. Then each potential target was matched to as many sequences of the OTU as possible. It was not adequate to use a text pattern search to match potential targets and sequences because partial gene sequences were included in the reference set. Therefore, the multiple sequence alignment provided by Greengenes was necessary to provide a discrete measurement of group size at each potential probe site. For example, if an OTU containing seven sequences possessed a probe site where one member was missing data, then the site-specific OTU size was only six. In ranking the possible targets, those having data for all members of that OTU were preferred over those found only in a fraction of the OTU members. In the second step, a subset of the prevalent targets was selected and reverse-complimented into probe orientation, avoiding those capable of mishybridization to an unintended amplicon. Probes presumed to have the capacity to mishybridize were those 25-mers that contained a central 17-mer matching sequences in more than one OTU [56]. Thus, probes that were unique to an OTU solely due to a distinctive base in one of the outer four bases were avoided. Also, probes with mishybridization potential to sequences having a common tree node near the root were favored over those with a common node near the terminal branch. Probes complementary to target sequences that were selected for fabrication are termed PM probes. As each PM probe was chosen, it was paired with a control 25-mer [mismatching probe (MM)], identical in all positions except the thirteenth base. The MM probe did not contain a central 17-mer complimentary to sequences in any OTU. The probe complementing the target (PM) and MM probes constitute a probe pair analyzed together. Sets of probes for each OTU can be viewed at: http://greengenes.lbl.gov/cgi-bin/nph-show_probes_2_otu_alignments.cgi.

The chosen oligonucleotides were synthesized by a photolithographic method at Affymetrix Inc. (Santa Clara, CA, USA) directly onto a 1.28 × 1.28 cm glass surface at an approximate density of 10,000 molecules/μm2 [6]. The entire array of 506,944 probe features was arranged as a grid of 712 rows and columns. Thus, each unique probe sequence (feature) on the array occupied a square with an 18-μm side and had a copy number of roughly 3.2 × 106. Of these features, 297,851 were oligonucleotide PM or MM probes with exact or inexact complementarity, respectively, to 16S rRNA genes. The remaining were used for image orientation, normalization controls, or for pathogen-specific signature amplicon detection using additional targeted regions of the chromosome [62]. Each high-density 16S rRNA gene microarray was designed with additional probes that: (1) target amplicons of prokaryotic metabolic genes spiked into the 16S rRNA gene amplicon mix in defined quantities just before fragmentation and (2) are complimentary to prelabeled oligonucleotides added into the hybridization mix. The first control collectively tests the fragmentation, biotinylation, hybridization, staining, and scanning efficiency. It also allows the overall fluorescent intensity to be normalized across all the arrays in an experiment. The second control directly assays the hybridization, staining, and scanning.

Environmental Sampling and DNA Extraction

Air samples were collected at a flow rate of approximately 10 L/min onto 1.0 μm polyethylene terephthalate (Celanex) filters (Hoechst-Celanese, Dallas, TX) over a 24-h period simultaneously from six locations in San Antonio, TX, USA. Sample filters were washed in 10 mL buffer (0.1 M sodium phosphate, 10 mM EDTA, pH 7.4, 0.01% Tween 20), and the suspension was stored frozen until needed. One 0.6-mL aliquot of wash was taken from each thawed filter wash and combined in a “day pool”. DNA was extracted by using a modification of a soil technique [46]. After centrifugation of the day pool at 16,000 × g for 25 min, the pellets were resuspended in 400 μL sodium phosphate buffer (100 mM NaH2PO4, pH 8) and transferred into two 2-mL silica bead lysis tubes containing 0.9 g of zirconia/silica lysis bead mix (0.3 g of 0.5 mm and 0.6 g of 0.1 mm). For each lysis tube, 300 μL buffered SDS [100 mM NaCl, 500 mM Tris pH 8, 10% (wt/vol) SDS] and 300 μL phenol/chloroform/isoamyl alcohol (25:24:1) were added. Lysis tubes were inverted and finger-flicked three times to mix the buffers before bead mill homogenization with a Bio101 Fast Prep 120 machine (Qbiogene, Irvine, CA, USA) at 6.5 m/s for 45 s. The bead-beating duration was selected for its ability to release DNA from spores while not overfragmenting genomes [12]. Following lysate centrifugation at 16,000 × g for 5 min, the aqueous supernatant was removed to a new 2-mL tube and maintained at −20°C for 1 h to overnight. An equal volume of chloroform was added to the thawed supernatant prior to vortexing for 5 s and centrifugation at 16,000 × g for 3 min. The supernatant was then combined with two volumes of binding Solution 3 (MoBio, Carlsbad, CA, USA). Genomic DNA (gDNA) from the mixture was isolated on a MoBio spin column, washed with Solution 4, and eluted in 60 μL of 1× TE according to the manufacturer's instructions. The gDNA was further purified by passage through a Sephacryl S-200 HR spin column (Amersham, Piscataway, NJ, USA) and stored at 4°C. Each of the gDNA preparations from four different “day pools” from the week of July 14, 2003 was independently PCR-amplified. PCR products were combined to constitute the sample for the week.

Subsurface water was collected during polylactate-stimulated bioremediation of a chromate-contaminated aquifer at the Hanford 100H site, WA (http://www-esd.lbl.gov/ERT/hanford100h/). Water, approximately 150 mL, was filtered through sterile 0.22-μm anodisc filters (Whatman, Florham Park, NJ, USA) and DNA was extracted by using a modification of the procedure described for air samples. Anodisc filters were manually fragmented in a sterile whirlpak bag, and 1 mL of phosphate buffer was added. Filter fragments in buffer were transferred to a bead lysis tube. Tubes were centrifuged at 16,000 × g for 5 min and 700 μL of buffer was removed. Next, 300 μL of buffered SDS solution and 300 μL of phenol/chloroform/isoamyl alcohol (25:24:1) were added and bead beating was performed at 5.5 m/s for 30 s. After centrifugation, the aqueous phase was mixed with an equal volume of chloroform in a phase-lock gel tube (Eppendorf, Westbury, NY, USA) and further extracted. The top phase containing nucleic acids was purified as for the air samples without the need for additional Sephacryl purification. DNA was eluted in 50 μL sterile water and stored at −20°C until needed.

Subsurface soils were obtained from a uranium-contaminated soil (area 2) at the NABIR Field Research Center at Oak Ridge, TN, USA (http://www.esd.ornl.gov/nabirfrc/). More information about the soil characteristics is available at http://public.ornl.gov/nabirfrc/other/FRCSummary.pdf DNA was extracted from triplicate 500-mg (wet weight) subsamples of soil using a BIO101 soil DNA extraction kit (Qbiogene) according to the manufacturer's protocol.

16S rRNA Gene Amplification

The 16S rRNA gene was amplified from the DNA extracts using universal primers 27f.1 (5′-AGRGTTTGATCMTGGCTCAG) and 1492R (5′-GGTTACCTTGTTACGACTT). PCR for air and soil samples was carried out by using the TaKaRa Ex Taq system (Takara Bio Inc., Japan) as follows, with at least three replicate PCR reactions performed per sample and pooled before analysis. Each PCR reaction mix contained 1× buffer, 0.8 mM TaKaRa dNTP mixture, 0.02 U/μL Ex Taq polymerase, 0.4 mg/mL bovine serum albumin (BSA), and 1.0 μM of each primer. PCR conditions were 1 cycle of 3 min at 95°C, followed by 35 cycles of 95°C (30 s), 53°C (30 s), 72°C (60 s), and a final extension at 72°C for 7 min. DNA extracts from water samples were amplified by using a slightly different protocol using a range of eight different annealing temperatures between 48°C and 58°C. Only 30 cycles were performed for amplification from water samples and amplicons from the eight different annealing temperatures were combined.

Cloning-and-Sequencing

Amplicon pools from the three environments were subjected to cloning as follows: Amplicons were ligated and cloned by using the TOPO-TA pCR2.1 kit (Invitrogen, Carlsbad, CA) according to the manufacturer's instructions. Individual clones containing organism-specific 16S rRNA gene fragments were purified by using magnetic beads [54], and sequenced from each terminus using an ABI3700 (Applied Biosystems, Foster City, CA), assembled using Phred and Phrap [16, 17], and were required to pass quality tests of Phred 20 (base call error probability < 10−2.0) to be included in the analysis. Sequencing was performed at the DOE Joint Genome Institute (JGI; http://www.jgi.doe.gov/). Putative chimeric sequences were obtained by using Bellerophon [25]. Sequences were aligned to the Greengenes 7682-character format by using the NAST [10] web server (http://greengenes.lbl.gov/NAST). Similarity to public database records was calculated with DNADIST [19], by using the DNAML-F84 option assuming a transition/transversion ratio of 2.0, and an A, C, G, and T 16S rRNA gene base frequency of 0.2537, 0.2317, 0.3167, and 0.1979, respectively. This was empirically calculated from all records of the Greengenes 16S rRNA gene multiple sequence alignment over 1250 nucleotides in length. The lane mask [35] was used to restrict similarity observations to 1287 conserved columns (lanes) of aligned characters. Cloned sequences from this study were rejected from further analysis when less than 1000 characters could be compared to a lane-masked reference sequence. Sequences were assigned to a taxonomic node by using a sliding scale of similarity thresholds [52]. Phylum, class, order, family, subfamily, or OTU placement was accepted when a clone surpassed similarity thresholds of 80%, 85%, 90%, 92%, 94%, or 97%, respectively. When similarity to nearest database sequence was below 94%, the clone was considered to represent a novel subfamily and a novel class was denoted when similarity was less than 85%.

Accumulation curves, diversity estimates (Shannon–Weaver index [43]), and nonparametric richness estimations (Chao1 and ACE [4, 5]) were calculated by using the software DOTUR [51] with the clone distance matrix as input and a nearest-neighbor clustering algorithm. Dominance in clone libraries was calculated as 1 − Shannon evenness index (1 − E), where evenness (E) is represented as follows: E = H/ln S (H = Shannon–Weaver diversity index; S = total richness in a sample).

Accession Numbers

Sequences generated in this study have been deposited in Genbank as accession numbers DQ125500–DQ125935 (soil), DQ129237–DQ129656 (air), and DQ264398–DQ264650 (water). Fasta formatted records can also be obtained at http://greengenes.lbl.gov/Download/Clones_v_Array/.

Microarray Processing

Identical amplicon pools used for cloning were also used for array analysis. For air samples, 2 μg amplicons was concentrated to a volume less than 40 μL with a Microcon YM100 spin filter (Millipore, Billerica, MA, USA). For soil samples and water samples, 2 μg (∼1012 gene copies) and 500 ng (∼3 × 1011 gene copies) of amplicons, respectively, were concentrated using a PCR clean up kit (MoBio). The PCR products were spiked with known concentrations of amplicons derived from prokaryotic metabolic genes. This mix was fragmented to 50–200 bp using DNase I (0.02 U/μg DNA; Invitrogen) and One-Phor-All buffer per the Affymetrix protocol. The complete mixture was incubated at 25°C for 10 min, 98°C for 10 min, and then labeled. Biotin labeling was accomplished using an Enzo Bioarray Terminal Labeling Kit (Affymetrix) as per the manufacturer's instructions. Next, labeled DNA was denatured (99°C for 5 min) and hybridized to the DNA microarray at 48°C overnight (>16 h) at 60 rpm. The arrays were subsequently washed and stained. Reagents, conditions, and equipment are detailed elsewhere [44].

Scanning and Probe Set Scoring

Arrays were scanned using a GeneArray Scanner (Affymetrix). The scan was captured as a pixel image using standard Affymetrix software (GeneChip Microarray Analysis Suite, version 5.1) that reduced the data to an individual signal value for each probe. Background probes were identified as those producing intensities in the lowest 2% of all intensities. The average intensity of the background probes was subtracted from the fluorescence intensity of all probes. The noise value (N) was the variation in pixel intensity signals observed by the scanner as it read the array surface. The standard deviation of the pixel intensities within each of the identified background cells was divided by the square root of the number of pixels comprising that cell. The average of the resulting quotients was used for N in the calculations described below.

Probe pairs scored as positive were those that met two criteria: (1) the fluorescence intensity from the perfectly matched probe (PM) was at least 1.3 times greater than the intensity from the mismatched control (MM), and (2) the difference in intensity, PM minus MM, was at least 130 times greater than the squared noise value (>130N 2). The positive fraction (PosFrac) was calculated for each probe set as the number of positive probe pairs divided by the total number of probe pairs in a probe set. An OTU was considered “present” when its PosFrac was greater than 0.92 in all three replicates. Present calls were propagated upwards through the taxonomic hierarchy by considering any node (subfamily, family, order, etc.) as “present” if at least one of its subordinate OTUs was present.

Validation of Array-Detected OTUs Not Detected by the Clone Library

PCR primers targeting specific OTUs within the Nitrospiraceae and Planctomycetaceae (Table 3) were generated by ARB's probe design feature [41] and Primer3 [50]. Melting temperatures were constrained from 45°C to 65°C and G + C content between 40% and 70% was preferred. The primers were chosen to contain 3′ bases noncomplimentary to sequences outside of the subfamily. TM7 phylum-specific primers [29] were obtained from the literature. DNA sequences were generated by PCR as described above with the necessary adjustments in annealing temperatures. Amplicons were purified (PureLink PCR Purification Kit; bases non-complimentary to sequences outside of the sub-family. TM7 phylum-specific primers [29] were obtained from the literature. DNA sequences were generated by PCR as described above with the necessary adjustments in annealing temperatures. Amplicons were purified (PureLink PCR Purification Kit, Invitrogen), sequenced, and matched to an OTU in the same manner as described in the cloning-and-sequencing method except the minimum count of base comparisons was not used to exclude data.

Results

Cloning

Sequences from the clone libraries of the three environments were assembled into contigs from two sequencing reactions initiated at the 5′ and 3′ termini of the 16S rRNA gene. Initially, 1391 contigs with low base call error probability were accepted for analysis. After removing contigs that did not contain sufficient data to allow at least 1000 characters to be compared to a lane-masked reference sequence and filtering the sequences for putative chimeras, 1155 remained. The vetted libraries from air, soil, and water contained 417, 485, and 253 clones, respectively. Figure 1 shows the class level (85% database sequence homology) distributions of clones within each of the three ecosystem samples; not shown are 1, 8, and 0 clones (air, soil, water) considered novel at the class level compared to existing database entries. For air, soil, and water clone libraries, 9.8%, 14.4%, and 1.6% of clones, respectively, were outside the subfamily level (94%) assignment threshold and were considered novel sequences.

Figure 1
figure 1

Class-level distribution of clones within libraries from air, soil ,and water samples.

Clone library analysis indicated that Firmicutes dominated the air sample mostly within the class Bacilli, whereas Actinobacteria dominated the soil sample. The water sample consisted solely of three classes: flavobacteria (Bacteroidetes), β-proteobacteria, and γ-proteobacteria, each with similar distribution. Figure 2 shows the accumulation curves for the three samples where the cumulative number of subfamilies observed is plotted against the sampling effort. All communities were incompletely sampled as evidenced by the nonasymptotic curves [30]; however, the water community appeared relatively well sampled compared to air and soil.

Figure 2
figure 2

Accumulation curves for air (●; n = 417), soil (□; n = 485) and water (△; n = 253) bacterial communities analyzed by clone library. The maximum diversity scenario where every sample is a new observation (○). Sampling efficiency at subfamily level is presented: subfamily is defined as 94% sequence homology. Data for curves represent an average of 1000 simulations performed using the software DOTUR.

Comparison of Cloning with Microarray Analysis

A comparison between clone library and microarray assessment of community composition is shown for each taxonomic level in Table 1. The breadth of 16S rRNA gene sequence types was expressed as a count of distinct groups detected in each environment, by each method, at six levels of taxonomic resolution. It is clear that even at the phylum level cloning underestimated richness compared with the microarray. Among all three environments, the amplicons categorized by cloning were in general concordance with a subset of taxonomic categories reported from the arrays. This trend continued as the resolution of the comparisons increased from phylum to subfamily. Hybridization of aerosol amplicons produced “present” calls in 238 subfamilies, 178 of which were not found in the corresponding clone library. Subfamily richness appeared to be greater in the soil sample with 279 subfamilies detected; of these, 239 were not encountered in the clone library. The water sample was shown to have the lowest richness by both methods, but of the 99 subfamilies detected by microarray, only 6 were detected by cloning. For all sample types, very few subfamilies deemed present by the clone libraries were not reflected in the microarray hybridization results.

Table 1 Count of distinct taxonomic groups detected by microarrays and/or clone libraries from three sample matrices

A synopsis of the phylum-level community composition is assembled in Table 2, allowing comparison of the 34 phyla reported in at least one of the environments by at least one of the methods. By cloning, 10 phyla were detected in the aerosol and soil samples, whereas only 2 phyla were detected in the water sample. In contrast, the array method detected all of the phyla detected by cloning and an additional 17, 23, and 17 phyla in the air, soil, and water samples, respectively.

Table 2 Phyla detected in different sample matrices by cloning and sequencing or by high-density DNA microarray analysis

PCR and Sequencing Confirmation of Additional Phyla Detected by Microarray

To determine whether the additional phyla detected by microarray were true positives, and not the product of unforeseen cross-hybridization, three OTUs from diverse phyla, detected in aerosols by microarray only, were chosen for further investigation (Table 3). OTU 864 (OTU numbers correspond to 16S rRNA microarray probe sets) within the phylum Nitrospira comprised sequences discovered in sludge, soil, reservoirs, and in an aquifer. All 13 probe pairs in the probe set for OTU 864 were positive in 3 of 3 arrays. The sample was interrogated with primers designed from known sequences in the OTU and the resulting amplicons were sequenced, revealing similarity to OTU 864. Similarly, all 11 probe pairs of a Planctomycetes OTU (OTU 4948) were consistently positive despite this OTU being unrepresented among the 417 aerosol clones. Primers were designed from the five 16S rRNA gene sequences generated from a municipal wastewater plant [7] that defined OTU 4948. Taxon-specific PCR and sequencing confirmed that a sequence matching this OTU was present in the sample. Evidence for the presence of phylum TM7 came from the probe set complementary to OTU 8155. General TM7 phylum-specific primers [29] produced sequences attributed to a related TM7 OTU identified as 3664.

Table 3 Phyla detected only by DNA microarray analysis and subsequently verified by PCR

Diversity Estimates

Table 4 lists diversity estimates and richness predictions based on the clones sampled and also compares predicted richness values to those observed by both cloning and array methods. Shannon–Weaver diversity estimates for the clone libraries indicate that sample diversity is of the order air > soil >>> water. Both Chao1 and ACE nonparametric richness estimators predicted that the subfamily level richness of the air and soil samples is far greater than that observed through clone sampling and these estimates were in strong agreement with the subfamily counts reported by the array analysis. The water sample produced a large discrepancy in subfamily richness between the cloning and array methods regardless of whether direct clone observations or nonparametric predictors were used for the comparison. In Table 4, a trend was observed—the greater the dominance encountered in the clone library, the greater the difference between the cloning and array observed counts of subfamilies.

Table 4 Clone library based estimates of diversity and predicted richness compared with observed richness determined by cloning and array approaches

Discussion

The postulated complexity of each microbial community that has been isolated from the environment, combined with the number of potentially unique ecosystems, has hindered efforts to sufficiently catalog microbial biodiversity. There are estimates of thousands [14, 55] to millions [21] of unique bacterial genomes present in a gram of soil. Outdoor aerosols may be equally complex, composed of organisms released from multiple habitats, both locally and over long distances. Furthermore, microbial communities can change temporally as environmental conditions vary. The need for high-throughput accurate biological monitoring is clear.

Community fingerprinting has provided methods for rapidly profiling microbial communities with replication. These approaches, based on heterogeneity in amplicon length, endonuclease cleavage sites, melting profiles, or single strand secondary structure, have allowed high-throughput inference of species richness and evenness [39]. However, fingerprinting methods are generally deficient in providing taxonomic microbial identity and typically yield less than 100 clearly defined bands, peaks, or products for analysis. T-RFLP offers the greatest taxonomic resolution of the rapid fingerprinting methods, potentially capable of class identification [34] when DNA sequences in the sample are nearly identical to database reference sequences, or when over 10 restriction enzymes are used in parallel for each sample [49]. However, the reliability of T-RFLP for taxonomic assignment is unclear, because the presence of classes reported by T-RFLP—but not found in a corresponding clone library—have been left unverified.

PCR has made it possible to easily obtain composite samples of mixed rRNA genes from natural environments [53, 60]. Although amplification biases have been demonstrated in defined communities due to primer selection, number of cycles, and template concentration [18, 48, 59], this technique has been valuable in increasing our understanding of the complexity of individual communities [27]. Regardless of the method used to limit biases from environmental samples, most of our knowledge on microbial composition of specific communities comes from isolating individual, amplified 16S rRNA genes for cloning and sequencing. The sequences are compared to references in large databases, allowing either specific phylogenetic classification or proposal of novel taxa when a clone is sufficiently divergent from known groups. The limitation becomes the number of clones or PCR products requiring sequencing and analysis. It has been suggested that environmental samples may require over 40,000 sequencing reactions to document 50% of the richness [14]. This approach is laborious, costly, and time-consuming, often taking weeks to complete the analysis of one clone library. Thus, performing studies with sample replication becomes rapidly overwhelming. Despite the effort required, clone libraries are often the chosen method, and the current “gold standard” for obtaining the greatest estimate of diversity. Typical libraries of cloned 16S rRNA gene fragments include fewer than 1000 sequences [15, 32, 45], well below the suggested quantity.

By making it possible to conduct sequence analyses on the complete pool of 16S rRNA gene fragments at once, high-density photolithography microarrays have the ability to provide microbial identification within complex environmental samples in a high-throughput manner. Hybridization (with replicates) requires only 1.5 days compared to the clone-and-sequence method, which necessitates 3 days or longer to sequence and analyze a typical library of several hundred clones. It has previously been demonstrated that organisms from complex environmental samples can be accurately classified into their respective orders by using an array with 62,358 probes [12, 61]. The present study investigated the response of a novel microarray containing 297,851 probes when hybridized to 16S rRNA gene amplicons generated from aerosols, soil, and water. The microarray design approach was based on the anticipation that the tool would be used to characterize samples without prior knowledge of their microbial composition. For this reason, more than 30,000 diverse 16S rRNA gene sequences were clustered into 8935 OTUs consisting of 842 subfamilies. Every OTU was interrogated by 24 probes, on average, each adjacent to a control probe used to subtract the effects of nonspecific hybridization. The requirement of a sequence-specific interaction from multiple unique probes to identify the presence of each OTU was implemented to increase the confidence of detection over single probe per OTU methods.

This high-density microarray targets the most unique portions of the 16S rRNA gene for a given cluster, and the results are summarized at higher phylogenetic levels. This is in contrast to scoring probes at each node in a hierarchical tree [40]. Our approach accommodates OTUs that may be divergent from other members of the same encapsulating node (which is often the case with environmental sequences) by not requiring that a single probe solution must be found for the entire node. Although the microbial census from every environment is far from complete [3], the key questions for suitability of this approach are: “Is the sequence variation from all extant prokaryotes unlimited, encompassing every possible nucleotide variation within the 16S rRNA gene?”, or conversely, “Can a majority of the organisms be classified on the basis of similarity to identified sequences in the databases?” Unlike estimates of microbial genomic variability [21], sequence variability of the 16S rRNA gene appears more constrained [52], most likely because of the functional necessity of the ribosome. Thus, to some degree, sequences not yet in the database may share some homology with targets used for probe analysis. This basic assumption of probe design, which has enabled the identification of one or a few OTUs using Southern analysis [20, 36], fluorescence in situ hybridization (FISH) [8, 23], and quantitative PCR (QPCR) [1, 24], was extended to design probes for a substantially greater number of OTUs. To allow for the detection of environmental sequences slightly divergent from those represented on our array, we do not require a sequence specific interaction of 100% of probe pairs defining an OTU.

It was hypothesized that a phylogenetic profile calculated from an array analysis should reflect the composition of sequence types obtained by cloning the same amplicon pool. Specifically, we tested the effectiveness of the novel microarray in detecting and categorizing environmental 16S rRNA genes into taxa with defined nomenclature. Three environments were selected for bacterial community evaluation by means of DNA extraction and universal 16S rRNA gene PCR amplification. Products were split for sampling by cloning-and-sequencing or microarray hybridization.

The three clone libraries produced 253–485 sequences each and varied in composition relative to each other. The soil and air were dominated by Actinobacteria and Bacilli, respectively (Fig. 1), and contained over 40 subfamilies each. The water library possessed considerably less richness with only six subfamilies. As predicted, accumulation curves demonstrated that hundreds of clones were insufficient to catalog all subfamilies putatively present, but that the water appeared more thoroughly sampled than the others. The divergent characteristics of the three clone libraries were considered beneficial for testing the array against dissimilar 16S rRNA gene amplicon community structures.

After each amplicon pool was hybridized to replicate arrays, probe responses were matched to OTUs in the database. Detection of an OTU required more than 92% of the probe pairs assigned to the particular probe set to hybridize such that the PM probe had a greater intensity than the MM probe partner. This threshold was chosen to allow sequences with minor divergence from database entries, from which the array was designed, to be detectable by the array. The OTUs found by the array and/or the cloning method were summarized to the subfamily, family, order, class, and, lastly, phylum to discern the resolution at which the results deviate. Regardless of the resolution considered, the array consistently revealed greater richness than the corresponding clone library (Table 1). This result was expected because nonasymptotic accumulation curves demonstrated that the clone libraries were only a partial sample of the total sequence diversity. The array predicted the presence of every phylum represented in the clone library. The same concurrence held for nearly all classes and orders in all three environmental samples. The atmospheric bacterial samples offered the most relevant example of a previously uncharacterized environment, because very few aerosol-derived 16S rRNA gene sequences are publicly available. In this relatively unstudied environment, OTU-level matching of cloned sequences and array positives was in poor agreement. Yet, the two methods concurred at the subfamily ranking with some exceptions. For example, in air samples, subfamilies within the Myxococcaceae (δ-proteobacteria) and Williamsiaceae (Actinobacteria) were overlooked by the array. Similarly, in the soil, cloned subfamilies within the Opitutaceae (Verrucomicrobia) and Propionibacteriaceae (Actinobacteria) were not found by the arrays using a PosFrac threshold of 0.92. Although an explanation for the reduced PosFrac within Myxococcaceae and Opitutaceae could not be attributed to mismatches between the clone and probe sequences, it was clear that divergence at the loci targeted by the probes would prohibit a sequence-specific response for the Williamsiaceae and Propionibacteriaceae.

The greater number of phyla reported by the array, but not represented in the clone libraries, was unanticipated (Table 2). There are two main factors which, individually or combined, may help explain this anomaly: (1) either the array approach overestimates richness as a result of nonspecific hybridization leading to false-positives or (2) cloning does not truly represent sequence distribution because of insufficient sampling or perhaps cloning bias. Three phyla detected only by array analysis (Nitrospira, Planctomycetes, and TM7) in air samples were chosen for further investigation. Amplification using specific PCR primers and sequencing of amplicons confirmed the presence of the phyla and in two cases the exact OTU detected by the array was also confirmed. This demonstrated not only that the microarrays revealed broader diversity than a typical clone library, but also that the additional components could be identified and subsequently verified with a confirmatory third method. It is significant that entire phyla would have been overlooked if the clone library were the sole source of taxonomic sampling.

It was impractical to determine if sequencing to extinction (asymptotic accumulation/rarefaction curves) would have revealed the additional phyla, because it has been estimated that >104 sequences may only be sufficient to encounter half of an environment's microbial richness [14]. However, the clone libraries presented in this work reflect the method as it is typically practiced rather than how it would be statistically complete. Nonetheless, it is possible to predict richness within microbial communities by using rarefaction and statistical estimators [4, 5, 51]. As expected for environmental bacterial community sampling efforts, accumulation curves and nonparametric richness estimators demonstrated that no community in this study was sampled to completion. Importantly, the predicted richness extrapolated from cloning observations from the air and soil was quite similar to that enumerated by array analysis. The richness detected by the array for the water sample considerably exceeded the predicted richness due to high dominance.

In a previous study on environmental amplicon sampling, we demonstrated a lack of correlation between the numbers of clones from a phylogenetic taxon and the corresponding hybridization intensity by using a 62,358 probe array [61]. Analogous discrepancies have been documented when sampling environmental PCR products by cloning versus SSCP [31], or versus T-RFLP [42]—suggesting a cloning bias. It is possible that the cloning process is limited because of nonrandom selection from a heterogeneous pool when amplicons are nonuniform in length [47] or form variable secondary structures [38]. Conversely, the 16S rRNA microarray has the potential advantage of increased sensitivity. Where typical clone libraries must be pruned of sequencing aberrations (including chimeras) usually resulting in only hundreds of amplicons graduating to the final taxonomic assessment, the array accepts the entire mass of PCR products to be exposed to the probes. Using the described method of data analysis, the microarray requires >107 gene copies for detection (manuscript in preparation). In this study, between 1011 and 1012 molecules were sampled by the array, whereas only hundreds were analyzed by cloning. Therefore, minority amplicon types, with concentrations 4 orders of magnitude less than those in the majority, will have an increased probability of being detected by the microarray. In fact, high dominance within clone libraries correlated with large differences between the richness detected by array and cloning approaches (Table 4). The trend may predict that underestimation of the true richness can occur when cloning efforts produce only a limited pool of diversity. We acknowledge that further investigation of this trend is necessary, especially to exclude the possibility that simply small sample sizes are the sole cause of underestimation.

The described high-density universal 16S rRNA microarray has been successfully used to monitor metal-reducing bacteria during uranium bioremediation [57] and flux in airborne prokaryote populations in urban settings (Andersen et al., unpublished data). In this study, we presented the results of applying PCR products to the arrays; however, by interrogating nonamplified rRNA, a significant source of bias can be alleviated. This is the focus of ongoing studies.

In summary, although this microarray is unreliable in classifying novel taxa it, was capable of confirming the majority of clone-detected subfamilies in addition to revealing greater richness, even at the phylum level. Furthermore, richness observations from the array analysis corresponded well with nonparametric richness predictions calculated from clone sampling, indicating a more complete inventory of the sampled ecosystems. A subset of taxa uniquely identified by the array was verified, illustrating the consequence of relying solely on clone libraries when profiling a microbial community. The laborious, costly, and time-consuming nature of clone library analysis diminishes its utility in studies requiring replication and temporal monitoring. The responsiveness of the 16S rRNA microarray to nucleic acids from diverse phyla in complex mixtures and its suitability for investigations requiring replication, demonstrated a necessary advance toward the goal of high-throughput ecological monitoring. For these reasons, we believe the high-density DNA microarray offers a promising approach for studies of microbial ecology.