Direct-to-consumer microbiome tests prove wide variability
Methodological differences show major discrepancies, even when identical samples are analyzed.
-
04/16/2026
-
by Doug Brunk
Direct-to-consumer (DTC) gut microbiome testing services produce highly inconsistent results, with variability across companies comparable to differences seen between patients, according to a new analysis that used standardized stool samples.
In a study published in Communications Biology that evaluated seven commercial services, researchers found that methodological differences led to major discrepancies in microbial profiles and health interpretations, even when identical samples were analyzed.
Investigators from the National Institute of Standards and Technology (NIST) submitted identical aliquots of a homogenized human fecal reference material to seven companies in triplicate, for a total of 21 tests. The standardized material removed natural biological differences, so only the testing methods were being evaluated. The reports came back within two to eight weeks and were reviewed by hand for comparison.
All companies used next-generation sequencing, including 16S ribosomal RNA gene sequencing or whole metagenome sequencing, but they varied in how they collected and handled samples, how deeply they sequenced them, and how data was analyzed. For example, reported sequencing depth ranged from fewer than 20,000 reads to more than 10 million reads, and the number of identified species ranged from fewer than 100 to more than 1,700 across companies.
At the genus level, variability was considerable. Across all analyses, 1,208 unique taxa were identified, yet only three genera were consistently detected in every sample. Even after excluding an outlier replicate, just 17 genera were shared across all datasets, representing fewer than 2% of total taxa identified.
Measures of diversity and composition further underscored disagreement. Samples from a single donor analyzed by different companies were as dissimilar, or more so, than samples from different donors processed with the same method. This finding was supported by statistical modeling: for 17 of 18 common genera, methodological variability was equal to or greater than biological variability.
Agreement across companies was also limited when examining relative abundance. For a set of 18 shared genera, no single method matched consensus values for all taxa, and each method differed for several genera. On average, only 10 to 16 of 18 taxa per company fell within the consensus range.
Intra-company reproducibility varied. Some companies produced consistent results across replicates, while others showed clear inconsistency. One company reported a replicate with a markedly different microbial profile from the same sample, including more than 50% unidentified reads, yet still issued a consumer report.
Differences extended to clinical-style outputs. Companies varied in how they defined a “healthy” microbiome and in the reference populations used. For five clinically relevant genera — including Bacteroides, Bifidobacterium, Clostridium, Roseburia, and Faecalibacterium — reported abundances varied widely, and comparator ranges differed between companies.
These discrepancies translated into conflicting health interpretations. In one case, replicate samples from the same stool were classified as both “healthy” and “unhealthy,” with divergent dietary and functional recommendations. Across companies, detection of Clostridioides difficile also varied, with three reporting its presence and four reporting absence.
“Methodological variability can be of the same magnitude as biological variability,” the authors wrote, which “highlights the need for caution in interpreting and acting on these test results.”
The study was conducted by investigators at NIST. One author reported a role as cofounder of a microbiome biotechnology company; the other authors reported no conflicts of interest.
Expert insight
GI & Hepatology News invited study author Diane E. Hoffmann, JD, MSc, Director of the Law & Health Care Program at the University of Maryland Carey School of Law, to comment on the findings. Hoffmann’s research has included focus groups with gastroenterologists, users of DTC microbiome tests, and patients with chronic gut issues, for an NIH funded study evaluating the adequacy of regulation.
Why does this study matter?
Hoffmann: More consumers are purchasing these DTC health and wellness tests, believing that they can tell them something about their gut health. This study shows that the tests may not be analytically valid, let alone clinically valid. Analytical validity goes to whether the test accurately tells you what is in the sample. The fact that the seven companies who were sent the same sample did not produce the same results should tell consumers that you can’t rely on the test results. It’s similar to a physician sending the same blood sample to three different labs; the physician would expect the same results from each lab. Not receiving the same results indicates that something is wrong.
How might the findings influence clinical practice?
Hoffmann: I would think clinicians would be less likely to suggest these tests to patients. In a separate study that I conducted with colleagues, we held focus groups with gastroenterologists and functional medicine physicians. While none of the gastroenterologists ordered these tests, some of the functional medicine doctors did, along with other tests, to get a full picture of what was going on with a patient’s gut. Perhaps the study results will discourage them from suggesting their patients order the tests.
Is there anything else you’d like to say about this work?
Hoffmann: I think it’s important that consumers understand a few things. 1) these tests are a paradigm shift from the way most diagnostic lab tests work. Rather than identifying a specific pathogen among a large number of microorganisms, microbiome-based tests are purportedly telling consumers all of the different microorganisms in their gut and the abundance of each. It’s like finding a needle in a haystack versus identifying each straw in the haystack and classifying it by color or length. 2) The reasons for the different results are several. They include different collection mechanisms, (e.g., some companies ask for an actual section of stool, while others might want you to submit a sample from used toilet paper). They use different sequencing techniques to identify the microorganisms. Some use 16S rRNA gene amplicon sequencing, others use metagenomic sequencing. The different techniques can produce different results.
There is no consensus on what constitutes a healthy microbiome. There is some agreement that diversity is a good thing, but we don’t know which specific microorganisms are necessary for a healthy gut. Defining what is a healthy microbiome is a huge challenge because it is measuring a collection of things. Healthy is not the same as typical.
Finally, these tests are very lightly, if at all, regulated by the FDA.
Hoffmann reported having no relevant disclosures.