Easy To Use Patents Search & Patent Lawyer Directory

At Patents you can conduct a Patent Search, File a Patent Application, find a Patent Attorney, or search available technology through our Patent Exchange. Patents are available using simple keyword or date criteria. If you are looking to hire a patent attorney, you've come to the right place. Protect your idea and hire a patent lawyer.


Search All Patents:



  This Patent May Be For Sale or Lease. Contact Us

  Is This Your Patent? Claim This Patent Now.



Register or Login To Download This Patent As A PDF




United States Patent Application 20160153054
Kind Code A1
FENG; Qiang ;   et al. June 2, 2016

BIOMARKERS FOR COLORECTAL CANCER

Abstract

Biomarkers and methods for predicting the risk of a disease related to microbiota, in particular colorectal cancer (CRC), are described.


Inventors: FENG; Qiang; (Shenzhen, CN) ; ZHANG; Dongya; (Shenzhen, CN) ; QIN; Youwen; (Shenzhen, CN)
Applicant:
Name City State Country Type

BGI Shenzhen Co., Limited
BGI Shenzhen

Shenzhen
Shenzhen

CN
CN
Family ID: 1000001730796
Appl. No.: 15/015358
Filed: February 4, 2016


Related U.S. Patent Documents

Application NumberFiling DatePatent Number
PCT/CN2014/083663Aug 5, 2014
15015358

Current U.S. Class: 506/2 ; 506/16; 702/20
Current CPC Class: C12Q 1/6886 20130101; C12Q 2600/16 20130101; C12Q 2600/112 20130101; C12Q 1/6806 20130101
International Class: C12Q 1/68 20060101 C12Q001/68

Foreign Application Data

DateCodeApplication Number
Aug 6, 2013CNPCT/CN2013/080872

Claims



1. A method of obtaining a set of gene markers for predicting the risk of an abnormal condition related to microbiota, comprising: a) identifying abnormal-associated gene markers by a metagenome-wide association study (MGWAS) strategy comprising: i) collecting a sample from each subject from a population of subjects with the abnormal condition (abnormal) and subjects without the abnormal condition (controls), ii) extracting DNA from each sample, constructing a DNA library from each sample, and carrying out high-throughput sequencing of each DNA library to obtain sequencing reads for each sample; iii) mapping the sequencing reads to a gene catalog, and deriving a gene profile from the mapping result; iv) performing a Wilcoxon rank-sum test on the gene profile to identify differential metagenomic gene contents between the abnormal and controls; b) ranking all of the abnormal-associated gene markers identified in step a) by a minimum redundancy-maximum relevance (mRMR) method, and identifying/classifying sequential marker sets therefrom; and c) for each sequential marker set, estimating the error rate by leave-one-out cross-validation (LOOCV) of a linear discrimination classifier, and selecting an optimal gene marker set with the lowest error rate as the set of gene markers for predicting the risk of the abnormal condition.

2. A method of diagnosing whether a subject has an abnormal condition related to microbiota or is at the risk of developing an abnormal condition related to microbiota, comprising: 1) obtaining sequencing reads from sample j of the subject; 2) mapping the sequencing reads to a gene catalog and deriving a gene profile from the mapping result; 3) determining the relative abundance of each gene marker in a set of gene markers, wherein the set of gene markers is obtained using the method of claim 1; and 4) calculating the index of sample j using the following formula: I j = [ i .epsilon. N log 10 ( A ij + 10 - 20 ) N - i .epsilon. M log 10 ( A ij + 10 - 20 ) M ] , ##EQU00005## wherein: A.sub.ij is the relative abundance of marker i in sample j, wherein i refers to each of the gene markers in the gene marker set, N is a subset of all of the abnormal-associated gene markers in selected biomarkers related to the abnormal condition, M is a subset of all of the control-associated gene markers in selected biomarkers related to the abnormal condition, and |N| and |M| are numbers (sizes) of the biomarkers in these two subsets, respectively, wherein an index greater than a cutoff indicates that the subject has or is at the risk of developing the abnormal condition.

3. The method of claim 1, wherein the metagenome-wide association study (MGWAS) strategy further comprises estimating the false discovery rate (FDR).

4. The method of claim 2, wherein the gene catalog is a non-redundant gene set constructed for related microbiota.

5. The method of claim 2, wherein the abnormal condition related to microbiota is an abnormal condition related to environmental microbiota such as soil microbiota, sea microbiota, or river microbiota.

6. The method of claim 2, wherein the abnormal condition related to microbiota is a disease related to microbiota present in the animal body or the human body, wherein the microbiota is selected from the group consisting of microbiota found in the gastrointestinal tract, nasal passages, oral cavities, skin and the urogenital tract.

7. The method of claim 2, wherein the abnormal condition related to microbiota is a colorectal disease selected from the group consisting of Colorectal Cancer, Ulcerative Colitis, Crohn's Disease, Irritable Bowel Syndrome (IBS), Diverticular Disease, Hemorrhoids, Anal Fissure, and Bowel Incontinence.

8. The method of claim 2, wherein the sequencing reads are obtained via a next-generation sequencing method or a next-next-generation sequencing method.

9. The method of claim 8, wherein the sequencing reads are obtained via at least one system selected from the group consisting of Hiseq 2000, SOLID, 454, and True Single Molecule Sequencing.

10. The method of claim 2, wherein the cutoff value is obtained by a Receiver Operator Characteristic (ROC) method, wherein the cutoff corresponds to the value when the AUC (Area Under the Curve) is at its maximum.

11. A method for diagnosing whether a subject has colorectal cancer (CRC) or is at the risk of developing colorectal cancer, comprising: 1) obtaining sequencing reads from sample j of the subject; 2) mapping the sequencing reads to a human gut gene catalog and deriving a gene profile from the mapping result; 3) determining the relative abundance of each of the gene markers listed in SEQ ID NOs: 1-31; and 4) calculating the index of sample j using the following formula: I j = [ i .epsilon. N log 10 ( A ij + 10 - 20 ) N - i .epsilon. M log 10 ( A ij + 10 - 20 ) M ] , ##EQU00006## wherein: A.sub.ij is the relative abundance of marker i in sample j, wherein i refers to each of the gene markers listed in SEQ ID NOs 1-31, N is a subset of all of CRC-associated gene markers and M is a subset of all of the control-associated gene markers, wherein the subset of CRC-associated gene markers and the subset of control-associated gene markers are shown in Table 1, and |N| and |M| are numbers (sizes) of the biomarkers in these two subsets, respectively, wherein an index greater than a cutoff indicates that the subject has or is at the risk of developing colorectal cancer.

12. The method of claim 11, wherein the cutoff value is obtained by a Receiver Operator Characteristic (ROC) method, wherein the cutoff corresponds to the value when the AUC (Area Under the Curve) is at its maximum.

13. The method of claim 12, wherein the value of the cutoff is -0.0575.

14. A gene marker set for predicting the risk of colorectal cancer (CRC) in a subject, consisting of the genes listed in SEQ ID NOs: 1-31.

15. A kit for analyzing the gene marker set of claim 14, comprising primers used for PCR amplification that are designed according to the genes listed in SEQ ID NOs: 1-31.

16. A kit for analyzing the gene marker set of claim 14, comprising one or more probes that are designed according to the genes listed in SEQ ID NOs: 1-31.

17. A method comprising using of the gene marker set of claim 14 for preparation of a kit for predicting the risk of colorectal cancer (CRC) in a subject.

18. The method of claim 2, wherein the sample is a feces sample, a nasal cavity swab, a buccal swab, a skin swab or a vaginal swab.

19. The method of claim 2, wherein the sequencing reads are obtained via steps comprising: 1) collecting the sample j from the subject and extracting DNA from the sample, and 2) constructing a DNA library and sequencing the library.

20. The method of claim 11, wherein the sequencing reads are obtained via steps comprising: 1) collecting the sample j from the subject and extracting DNA from the sample, 2) constructing a DNA library and sequencing the library.
Description



CROSS-REFERENCE TO RELATED APPLICATION

[0001] The present patent application is a continuation-in-part of PCT Patent Application No. PCT/CN2014/083663, filed Aug. 5, 2014, which was published in the English language on Feb. 12, 2015, under International Publication No. WO 2015/018307 A1, which claims priority to PCT Patent Application No. PCT/CN2013/080872, filed Aug. 6, 2013, and the disclosure of both prior applications is incorporated herein by reference.

REFERENCE TO SEQUENCE LISTING SUBMITTED ELECTRONICALLY

[0002] This application contains a sequence listing, which is submitted electronically via EFS-Web as an ASCII formatted sequence listing with a file name "Sequence_Listing.TXT", creation date of Jan. 26, 2016, and having a size of about 43.5 kilobytes. The sequence listing submitted via EFS-Web is part of the specification and is herein incorporated by reference in its entirety.

FIELD

[0003] The present invention relates to biomarkers and methods for predicting the risk of a disease related to microbiota, in particular colorectal cancer (CRC).

BACKGROUND

[0004] Colorectal cancer (CRC) is the third most common form of cancer and the second leading cause of cancer-related death in the Western world (Schetter et al., 2011, "Alterations of microRNAs contribute to colon carcinogenesis," Semin Oncol., 38:734-742, incorporated herein by reference). A lot of people are diagnosed with CRC and many patients die of this disease each year worldwide. Although current treatment strategies, including surgery, radiotherapy, and chemotherapy, have a significant clinical value for CRC, the relapses and metastases of cancers after surgery have hampered the success of those treatment modalities. Early diagnosis of CRC will help to not only prevent mortality, but also to reduce the costs for surgical intervention.

[0005] Current tests of CRC, such as flexible sigmoidoscopy and colonoscopy, are invasive, and patients may find the procedures and the bowel preparation to be uncomfortable or unpleasant.

[0006] The development of CRC is a multifactorial process influenced by genetic, physiological, and environmental factors. With regard to environmental factors, lifestyle, particularly dietary intake, may affect the risk of developing CRC. The Western diet, which is rich in animal fat and poor in fiber, is generally associated with an increased risk of CRC. Thus, it has been hypothesized that the relationship between the diet and CRC, may be due to the influence that the diet has on the colon microbiota and bacterial metabolism, making both the colon microbiota and bacterial metabolism relevant factors in the etiology of the disease (McGarr et al., 2005, "Diet, anaerobic bacterial metabolism, and colon cancer," J Clin Gastroenterol., 39:98-109; Hatakka et al., 2008, "The influence of Lactobacillus rhamnosus LC705 together with Propionibacterium freudenreichii ssp. shermanii JS on potentially carcinogenic bacterial activity in human colon," Int J Food Microbiol., 128:406-410, both incorporated herein by reference).

[0007] Interactions between the gut microbiota and the immune system have an important role in many diseases both within and outside the gut (Cho et al., 2012, "The human microbiome: at the interface of health and disease," Nature Rev. Genet. 13, 260-270, incorporated herein by reference). Intestinal microbiota analysis of feces DNA has the potential to be used as a noninvasive test for identifying specific biomarkers that can be used as a screening tool for early diagnosis of patients having CRC, thus leading to longer survival and a better quality of life.

[0008] With the development of molecular biology and its application in microbial ecology and environmental microbiology, an emerging field of metagenomics (environmental genomics or ecogenomics), has been rapidly developed. Metagenomics, comprising extracting total community DNA, constructing a genomic library, and analyzing the library with similar strategies for functional genomics, provides a powerful tool to study uncultured microorganisms in complex environmental habitats. In recent years, metagenomics has been applied to many environmental samples, such as oceans, soils, rivers, thermal vents, hot springs, and human gastrointestinal tracts, nasal passages, oral cavities, skin and urogenital tracts, illuminating its significant value in various areas including medicine, alternative energy, environmental remediation, biotechnology, agriculture and biodefense. For the study of CRC, the inventors performed analysis in the metagenomics field.

SUMMARY

[0009] Embodiments of the present disclosure seek to solve at least one of the problems existing in the prior art to at least some extent.

[0010] The present invention is based on at least the following findings by the inventors: Assessment and characterization of gut microbiota has become a major research area in human disease, including colorectal cancer (CRC), one of the common causes of death among all types of cancers. To carry out analysis on the gut microbial content of CRC patients, the inventors performed deep shotgun sequencing of the gut microbial DNA from 128 Chinese individuals and conducted a Metagenome-Wide Association Study (MGWAS) using a protocol similar to that described by Qin et al., 2012, "A metagenome-wide association study of gut microbiota in type 2 diabetes," Nature, 490, 55-60, the entire content of which is incorporated herein by reference. The inventors identified and validated 140,455 CRC-associated gene markers. To test the potential ability to classify CRC via analysis of gut microbiota, the inventors developed a disease classifier system based on 31 gene markers that are defined as an optimal gene set by a minimum redundancy--maximum relevance (mRMR) feature selection method. For intuitive evaluation of the risk of CRC disease based on these 31 gut microbial gene markers, the inventors calculated a healthy index. The inventors' data provide insight into the characteristics of the gut metagenome corresponding to a CRC risk, a model for future studies of the pathophysiological role of the gut metagenome in other relevant disorders, and the potential for a gut-microbiota-based approach for assessment of individuals at risk of such disorders.

[0011] It is believed that gene markers of intestinal microbiota are valuable for improving cancer detection at earlier stages for the following reasons. First, the markers of the present invention are more specific and sensitive as compared to conventional cancer markers. Second, the analysis of stool samples ensures accuracy, safety, affordability, and patient compliance, and stool samples are transportable. As compared to a colonoscopy, which requires bowel preparation, polymerase chain reaction (PCR)-based assays are comfortable and noninvasive, such that patients are more likely to be willing to participate in the described screening program. Third, the markers of the present invention can also serve as a tool for monitoring therapy of cancer patients in order to measure their responses to therapy.

BRIEF DESCRIPTION OF DRAWINGS

[0012] These and other aspects and advantages of the present disclosure will become apparent and more readily appreciated from the following descriptions taken in conjunction with the drawings. It should be understood that the invention is not limited to the precise embodiments shown in the drawings.

[0013] In the drawings:

[0014] FIG. 1 shows the distribution of P-value association statistics of all the microbial genes analyzed in this study: the association analysis of CRC p-value distribution identified a disproportionate over-representation of strongly associated markers at lower P-values, with the majority of genes following the expected P-value distribution under the null hypothesis, suggesting that the significant markers likely represent true rather than false associations;

[0015] FIG. 2 shows minimum redundancy maximum relevance (mRMR) method to identify 31 gene markers that differentiate colorectal cancer cases from controls: an incremental search was performed using the mRMR method which generated a sequential number of subsets; for each subset, the error rate was estimated by a leave-one-out cross-validation (LOOCV) of a linear discrimination classifier; and the optimum subset with the lowest error rate contained 31 gene markers;

[0016] FIG. 3 shows the discovered gut microbial gene markers associated with CRC: the CRC indexes computed for the CRC patients and the control individuals from this study are shown along with patients and control individuals from earlier studies on type 2 diabetes and inflammatory bowel disease; the boxes depict the interquartile ranges between the first and third quartiles, and the lines inside the boxes denote the medians; the calculated gut healthy index listed in Table 6 correlated well with the ratio of CRC patients in the population; and the CRC indexes for CRC patient microbiomes are significantly different from the rest (***P<0.001);

[0017] FIG. 4 shows that ROC analysis of the CRC index from the 31 gene markers in Chinese cohort I showing excellent classification potential, with an area under the curve of 0.9932;

[0018] FIG. 5 shows that the CRC index was calculated for an additional 19 Chinese CRC and 16 non-CRC samples in Example 2: the boxes in the inset depict the interquartile ranges (IQR) between the first and third quartiles (25th and 75th percentiles, respectively) and the lines inside denote the medians, while the points represent the gut healthy indexes in each sample; the squares represent the case group (CRC); the triangles represent the controls group (non-CRC); the triangle with the * represents non-CRC individuals that were diagnosed as CRC patients;

[0019] FIG. 6 shows species involved in gut microbial dysbiosis during colorectal cancer: the differential relative abundance of two CRC-associated and one control-associated microbial species were consistently identified using three different methods: MLG mOTU and the IMG database;

[0020] FIG. 7 shows the enrichment of Solobacterium moore and Peptostreptococcus stomati in the CRC patient microbiomes;

[0021] FIG. 8 shows the Receive-Operator-Curve of the CRC-specific species marker selection using the random forest method and three different species annotation methods: (A) the IMG species annotation method was carried out using clean reads to IMG version 400; (B) the mOTU species annotation method was carried out using published methods; and (C) all significant genes were clustered using MLG methods and species annotations using IMG version 400;

[0022] FIG. 9 shows the stage-specific abundance of three species that are associated with or enriched in stage II and later, using three species annotation methods: MLG, IMG and mOTU;

[0023] FIG. 10 shows the species involved in gut microbial dysbiosis during colorectal cancer: the relative abundances of one bacterial species enriched in control microbiomes and three bacterial species enriched in CRC-associated microbiomes, during different stages of CRC (three different species annotation methods were used) are shown;

[0024] FIG. 11 shows the correlation between quantification by the metagenomic approach and quantitative polymerase chain reaction (qPCR) for two gene markers;

[0025] FIG. 12 shows the evaluation of the CRC index from 2 genes in Chinese cohort II: (A) the CRC index based on 2 gene markers separates CRC and control microbiomes; (B) ROC analysis reveals marginal potential for classification using the CRC index, with an area under the curve of 0.73; and

[0026] FIG. 13 shows the validation of robust gene markers associated with CRC: qPCR abundance (in log 10 scale, zero abundance plotted as -8) of three gene markers was measured in cohort II, which consisted of 51 cases and 113 healthy controls; two gene markers were randomly selected (m1704941: butyryl-CoA dehydrogenase from F. nucleatum, m482585: RNA-directed DNA polymerase from an unknown microbe), and one was targeted (m1696299: RNA polymerase subunit beta, rpoB, from P. micra): (A) the CRC index based on the three genes clearly separates CRC microbiomes from controls; (B) the CRC index classifies has an area under the receiver operating characteristic (ROC) curve of 0.84; and (C) the P. micra species-specific rpoB gene shows relatively higher incidence and abundance starting in CRC stages II and III (P=2.15.times.10.sup.-15) as compared to the control and stage I microbiomes.

DETAILED DESCRIPTION

[0027] Various publications, articles and patents are cited or described in the background and throughout the specification, each of these references is herein incorporated by reference in its entirety. Discussion of documents, acts, materials, devices, articles or the like which have been included in the present specification is for the purpose of providing context for the present invention. Such discussion is not an admission that any or all of these matters form part of the prior art with respect to any inventions disclosed or claimed.

[0028] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which this invention pertains. Otherwise, certain terms used herein have the meanings as set in the specification. Terms such as "a", "an" and "the" are not intended to refer to only a singular entity, but include the general class for which a specific example can be used for illustration. The terminology herein is used to describe specific embodiments of the invention, but its usage does not delimit the invention, except as outlined in the claims.

[0029] In one aspect, the present invention relates to a method of obtaining a set of gene markers for predicting the risk of an abnormal condition related to microbiota, comprising

[0030] a) identifying abnormal-associated gene markers by a metagenome-wide association study (MGWAS) strategy comprising:

[0031] i) collecting a sample from each subject from a population of subjects with the abnormal condition (abnormal) and subjects without the abnormal condition (controls),

[0032] ii) extracting DNA from each sample, constructing a DNA library from each sample, and carrying out high-throughput sequencing of each DNA library to obtain sequencing reads for each sample;

[0033] iii) mapping the sequencing reads to a gene catalog, and deriving a gene profile from the mapping result;

[0034] iv) performing a Wilcoxon rank-sum test on the gene profile to identify differential metagenomic gene contents between the abnormal and controls;

[0035] b) ranking all of the abnormal-associated gene markers identified in step a) by minimum redundancy-maximum relevance (mRMR) method, and identifying or classifying sequential marker sets therefrom; and

[0036] c) for each of the sequential marker set identified or classified from step (b), estimating the error rate by a leave-one-out cross-validation (LOOCV) of a linear discrimination classifier, and selecting an optimal gene marker set with the lowest error rate as the set of gene markers for predicting the risk of the abnormal condition.

[0037] In another aspect, the present invention relates to a method of diagnosing whether a subject has an abnormal condition related to microbiota or is at the risk of developing an abnormal condition related to microbiota, comprising:

[0038] 1) obtaining sequencing reads from sample j of the subject;

[0039] 2) mapping the sequencing reads to a gene catalog and deriving a gene profile from the mapping result;

[0040] 3) determining the relative abundance of each gene marker in a set of gene markers, wherein the set of gene markers is obtained using a method according to an embodiment of the invention;

[0041] and

[0042] 4) calculating an index of sample j by the following formula:

I j = [ i .epsilon. N log 10 ( A ij + 10 - 20 ) N - i .epsilon. M log 10 ( A ij + 10 - 20 ) M ] , ##EQU00001##

wherein: A.sub.ij is the relative abundance of marker i in sample j, wherein i refers to each of the gene markers in the set of gene markers, N is a subset of all of abnormal-associated gene markers in selected biomarkers related to the abnormal condition, M is a subset of all of control-associated gene markers in the selected biomarkers related to the abnormal condition, and |N| and |M| are numbers (sizes) of the biomarkers in these two subsets, respectively wherein an index greater than a cutoff indicates that the subject has or is at the risk of developing the abnormal condition.

[0043] In one embodiment, in a method of the present invention, the metagenome-wide association study (MGWAS) strategy further comprises estimating the false discovery rate (FDR). In one embodiment, the gene catalog is a non-redundant gene set constructed for the related microbiota. In one embodiment, the abnormal condition related to microbiota is an abnormal condition related to environmental microbiota such as soil microbiota, sea microbiota, or river microbiota. In another embodiment, the abnormal condition related to microbiota is a disease related to microbiota present in the animal body or the human body such as microbiota found in the gastrointestinal tract, nasal passages, oral cavities, skin or the urogenital tract, and the sample is a feces sample, a nasal cavity swab, a buccal swab, a skin swab or a vaginal swab. In a preferred embodiment, the abnormal condition related to microbiota is a colorectal disease selected from the group consisting of Colorectal Cancer, Ulcerative Colitis, Crohn's Disease, Irritable Bowel Syndrome (IBS), Diverticular Disease, Hemorrhoids, Anal Fissure, and Bowel Incontinence. In a most preferred embodiment, the abnormal condition related to microbiota is colorectal cancer (CRC).

[0044] In one embodiment, the sequencing reads are obtained via steps comprising: 1) collecting the sample j from the subject and extracting DNA from the sample, 2) constructing a DNA library and sequencing the library. In one embodiment, the DNA library is sequenced via a next-generation sequencing method or a next-next-generation sequencing method, preferably using at least one system selected from the group consisting of Hiseq 2000, SOLID, 454, and True Single Molecule Sequencing.

[0045] In another embodiment, the cutoff value is obtained by a Receiver Operator Characteristic (ROC) method, wherein the cutoff corresponds to value when the AUC (Area Under the Curve) is at its maximum.

[0046] In yet another aspect, the present invention relates to a method for diagnosing whether a subject has colorectal cancer (CRC) or is at the risk of developing colorectal cancer, comprising:

[0047] 1) obtaining sequencing reads from sample j of the subject;

[0048] 2) mapping the sequencing reads to a human gut gene catalog and deriving a gene profile from the mapping result;

[0049] 3) determining the relative abundance of each of the gene markers listed in SEQ ID NOs: 1-31; and

[0050] 4) calculating the index of sample j using the following formula:

I j = [ i .epsilon. N log 10 ( A ij + 10 - 20 ) N - i .epsilon. M log 10 ( A ij + 10 - 20 ) M ] , ##EQU00002##

wherein: A.sub.ij is the relative abundance of marker i in sample j, wherein i refers to each of the gene markers listed in SEQ ID NOs 1-31, N is a subset of all of the CRC-associated gene markers and M is a subset of all of the control-associated gene markers, wherein the subset of CRC-associated gene markers and the subset of control-associated gene markers are shown in Table 1, and |N| and |M| are numbers (sizes) of the biomarkers in these two subsets, respectively, wherein an index greater than a cutoff indicates that the subject has or is at the risk of developing colorectal cancer.

[0051] In one embodiment, the cutoff value is obtained by a Receiver Operator Characteristic (ROC) method, wherein the cutoff corresponds to the value when the AUC (Area Under the Curve) is at its maximum. In a preferred embodiment, the value of said cutoff is -0.0575.

[0052] In another aspect, the present invention relates to a gene marker set for predicting the risk of colorectal cancer (CRC) in a subject, gene marker set consisting of the genes listed in SEQ ID NOs: 1-31.

[0053] In another aspect, the present invention relates to a kit for analyzing the gene marker set consisting of the genes listed in SEQ ID NOs: 1-31, comprising primers used for PCR amplification that are designed according to the genes listed in SEQ ID NOs: 1-31.

[0054] In another aspect, the present invention relates to a kit for analyzing the gene marker set consisting of the genes listed in SEQ ID NOs: 1-31, comprising one or more probes that are designed according to the genes listed in SEQ ID NOs: 1-31.

[0055] In another aspect, the present invention relates to use of the gene marker set consisting of the genes listed in SEQ ID NOs: 1-31 for predicting the risk of colorectal cancer (CRC) in a subject.

[0056] In another aspect, the present invention relates to use of the gene marker set consisting of the genes listed in SEQ ID NOs: 1-31 for preparation of a kit for predicting the risk of colorectal cancer (CRC) in a subject.

[0057] In one embodiment, the sequencing reads are obtained via steps comprising: 1) collecting the sample j from the subject and extracting DNA from the sample, 2) constructing a DNA library and sequencing the library.

[0058] The present invention is further exemplified in the following non-limiting Examples. Unless otherwise stated, parts and percentages are by weight and degrees are in Celsius. As is apparent to one of ordinary skill in the art, these Examples, while indicating preferred embodiments of the invention, are given by way of illustration only, and the agents referenced are all commercially available.

General Method

[0059] I. Methods for Detecting Biomarkers (Detect Biomarkers by Using MGWAS Strategy)

[0060] To define CRC-associated metagenomic markers, the inventors carried out a MGWAS (metagenome-wide association study) strategy (Qin et al., 2012, "A metagenome-wide association study of gut microbiota in type 2 diabetes," Nature 490, 55-60, incorporated herein by reference). Using a sequence-based profiling method, the inventors quantified the gut microbiota in samples. On average, with the requirement that there should be .gtoreq.90% identity, the inventors could uniquely map paired-end reads to the updated gene catalog. To normalize the sequencing coverage, the inventors used relative abundance instead of the raw read count to quantify the gut microbial genes. However, unlike what is done in a GWAS subpopulation correction, the inventors applied this analysis to microbial abundance rather than to genotype. A Wilcoxon rank-sum test was done on the adjusted gene profile to identify differential metagenomic gene contents between the CRC patients and controls. The outcome of the analyses showed a substantial enrichment of a set of microbial genes that had very small P values, as compared with the expected distribution under the null hypothesis, suggesting that these genes were true CRC-associated gut microbial genes.

[0061] The inventors next controlled the false discovery rate (FDR) in the analysis, and defined CRC-associated gene markers from these genes corresponding to a FDR.

[0062] II. Methods for Selecting the 31 Best Markers from the Biomarkers (Maximum Relevance Minimum Redundancy (mRMR) Feature Selection Framework)

[0063] To identify an optimal gene set, a minimum redundancy--maximum relevance (mRMR) (for detailed information, see Peng et al., 2005, "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Trans Pattern Anal Mach Intell, 27, 1226-1238, doi:10.1109/TPAMI.2005.159, which is incorporated herein by reference) feature selection method was used to select from all the CRC-associated gene markers. The inventors used the "sideChannelAttack" package of R software to perform the incremental search and found 128 sequential markers sets. For each sequential set, the inventors estimated the error rate by a leave-one-out cross-validation (LOOCV) of the linear discrimination classifier. The optimal selection of marker sets was the one corresponding to the lowest error rate. In the present study, the inventors made the feature selection on a set of 140,455 CRC-associated gene markers. Since it was computationally prohibitive to perform mRMR using all of the genes, the inventors derived a statistically non-redundant gene set. Firstly, the inventors pre-grouped the 140,455 colorectal cancer associated genes that were highly correlated with each other (Kendall correlation >0.9). Then the inventors chose the longest gene of each group as a representative gene for the group, since longer genes have a higher chance of being functionally annotated and will draw more reads during the mapping procedure. This generated a non-redundant set of 15,836 significant genes. Subsequently, the inventors applied the mRMR feature selection method to the 15,836 significant genes and identified an optimal set of 31 gene biomarkers that are strongly associated with colorectal cancer for colorectal cancer classification, which are shown in Table 1.

TABLE-US-00001 TABLE 1 31 optimal Gene markers` enrichment information Correlation Enrichment coefficient with mRMR (1 = Control, Gene id CRC rank 0 = CRC) SEQ ID NO: 2361423 -0.558205377 1 0 1 2040133 -0.500237832 2 0 2 3246804 -0.454281109 3 0 3 3319526 0.441366585 4 1 4 3976414 0.431923463 5 1 5 1696299 -0.499397182 6 0 6 2211919 0.410506085 7 1 7 1804565 0.418663439 8 1 8 3173495 -0.55118428 9 0 9 482585 -0.454270958 10 0 10 181682 0.400814213 11 1 11 3531210 0.383705453 12 1 12 3611706 0.413879567 13 1 13 1704941 -0.468122499 14 0 14 4256106 0.42048024 15 1 15 4171064 0.43365554 16 1 16 2736705 -0.417069104 17 0 17 2206475 0.411512652 18 1 18 370640 0.399015232 19 1 19 1559769 0.427134509 20 1 20 3494506 0.382302723 21 1 21 1225574 -0.407066113 22 0 22 1694820 -0.442595115 23 0 23 4165909 0.410519669 24 1 24 3546943 -0.395361093 25 0 25 3319172 0.448526551 26 1 26 1699104 -0.467388978 27 0 27 3399273 0.388569946 28 1 28 3840474 0.383705453 29 1 29 4148945 0.407802676 30 1 30 2748108 -0.426515966 31 0 31

[0064] III. Gut Healthy Index (CRC Index)

[0065] To exploit the potential ability of disease classification by gut microbiota, the inventors developed a disease classifier system based on the gene markers that the inventors defined. For intuitive evaluation of the risk of disease based on these gut microbial gene markers, the inventors calculated a gut healthy index (CRC index).

[0066] To evaluate the effect of the gut metagenome on CRC, the inventors defined and calculated the gut healthy index for each individual on the basis of the selected 31 gut metagenomic markers as described above. For each individual sample, the gut healthy index of sample j, denoted by I.sub.j, was calculated by the formula below:

I j = [ i .epsilon. N log 10 ( A ij + 10 - 20 ) N - i .epsilon. M log 10 ( A ij + 10 - 20 ) M ] , ##EQU00003##

Wherein n A.sub.ij is the relative abundance of marker i in sample j, N is a subset of all of the abnormal-associated gene markers in the selected biomarkers related to the abnormal condition (namely, a subset of all of the CRC-associated gene markers in these 31 selected gut metagenomic markers), M is a subset of all of the control-associated gene markers in the selected biomarkers related to the abnormal condition (namely, a subset of all control-associated markers in these 31 selected gut metagenomic markers), and |N| and |M| are numbers (sizes) of the biomarkers in these two sets, respectively.

[0067] IV. Receiver Operator Characteristic (ROC) Analysis

[0068] The inventors applied the ROC analysis to assess the performance of the colorectal cancer classification based on metagenomic markers. Based on the 31 gut metagenomic markers selected above, the inventors calculated the CRC index for each sample. The inventors then used the "Daim" package of R software to draw the ROC curve.

[0069] V. Disease Classifier System

[0070] After identifying biomarkers using the MGWAS strategy, and the rule that the biomarkers used should yield the highest classification between disease and healthy samples with the least redundancy, the inventors ranked the biomarkers by a minimum redundancy--maximum relevance (mRMR) and found sequential markers sets (the size can be as large as the number of biomarkers). For each sequential set, the inventors estimated the error rate using a leave-one-out cross-validation (LOOCV) of a classifier. The optimal selection of marker sets corresponded to the lowest error rate (In some embodiments, the inventors have selected 31 biomarkers).

[0071] Finally, for intuitive evaluation of the risk of disease based on these gut microbial gene markers, the inventors calculated a gut healthy index. The larger the healthy index, the higher the risk of disease. The smaller the healthy index, the more healthy the subjects. The inventors can build an optimal healthy index cutoff using a large cohort. If the healthy index of the test sample is larger than the cutoff, then the subject is at a higher disease risk. If the healthy index of the test sample is smaller than the cutoff, then the subject has a low risk of disease. The optimal healthy index cutoff can be determined using a ROC method when the AUC (Area Under the Curve) is at its maximum.

[0072] The following examples are offered to illustrate, but not to limit the claimed invention.

Example 1

Identifying 31 Biomarkers from 128 Chinese Individuals and Using a Gut Healthy Index to Evaluate their Colorectal Cancer Risk

[0073] 1.1 Sample Collection and DNA Extraction

[0074] Stool samples from 128 subjects (cohort I), including 74 colorectal cancer patients and 54 healthy controls (Table 2) were collected in the Prince of Wales Hospital, Hong Kong with informed consent. To be eligible for inclusion in this study, individuals had to fit the following criteria for stool sample collection: 1) no taking of antibiotics or other medications, no special diets (diabetics, vegetarians, etc.), and having a normal lifestyle (without extra stress) for a minimum of 3 months; 2) a minimum of 3 months after any medical intervention; 3) no history of colorectal surgery, any kind of cancer, or inflammatory or infectious diseases of the intestine. Subjects were asked to collect stool samples before a colonoscopy examination in standardized containers at home and store the samples in their home freezer immediately. Frozen samples were then delivered to the Prince of Wales Hospital in insulating polystyrene foam containers and stored at -80.degree. C. immediately until use.

[0075] Stool samples were thawed on ice and DNA extraction was performed using the QiagenQIAamp DNA Stool Mini Kit according to the manufacturer's instructions. Extracts were treated with DNase-free RNase to eliminate RNA contamination. DNA quantity was determined using a NanoDrop spectrophotometer, a Qubit Fluorometer (with the Quant-iTTMdsDNA BR Assay Kit) and gel electrophoresis.

TABLE-US-00002 TABLE 2 Baseline characteristics of colorectal cancer cases and controls in cohort I. Parameter Controls (n = 54) Cases (n = 74) Age 61.76 66.04 Sex (M:F) 33:21 48:26 BMI 23.47 23.9 EGFR 72.24 74.15 DM (%) 16 (29.6%) 29 (39.2%) Enterotype (1:2:3) 26:22:6 37:31:6 Stage of disease (1:2:3:4) n.a. 16:21:30:7 Location (proximal:distal) n.a. 13:61 BMI: body mass index; eGFR: epidermal growth factor receptor; DM: diabetes mellitus type 2.

[0076] 1.2 DNA Library Construction and Sequencing

[0077] DNA library construction was performed following the manufacturer's instruction (Illumina HiSeq 2000 platform). The inventors used the same workflow as described previously to perform cluster generation, template hybridization, isothermal amplification, linearization, blocking and denaturation, and hybridization of the sequencing primers (Qin, J. et al. (2012), "A metagenome-wide association study of gut microbiota in type 2 diabetes," Nature 490, 55-60, incorporated herein by reference).

[0078] The inventors constructed one paired-end (PE) library with an insert size of 350 bp for each sample, followed by high-throughput sequencing to obtain around 30 million PE reads of a length of 2.times.100 bp. High quality reads were extracted by filtering out low quality reads containing `N`s in the read, filtering out adapter contamination and human DNA contamination from the raw data, and trimming low quality terminal bases of reads. 751 million metagenomic reads (high quality reads) were generated (5.86 million reads per individual on average, Table 3).

[0079] 1.3 Reads Mapping

[0080] The inventors mapped the high quality reads (Table 3) to a published reference gut gene catalog established from European and Chinese adults (Qin, J. et al. (2012), "A metagenome-wide association study of gut microbiota in type 2 diabetes," Nature, 490, 55-60, incorporated herein by reference) (identity >=90%), and the inventors then derived the gene profiles using the same method of Qin et al. 2012, supra. From the reference gene catalog, as Qin et al. 2012, supra, the inventors derived a subset of 2,110,489 (2.1M) genes that appeared in at least 6 of the 128 samples.

TABLE-US-00003 TABLE 3 Summary of metagenomic data and mapping to reference gene catalog. The fourth column reports P-value results from Wilcoxon rank-sum tests. Parameter Controls Cases P-value Average raw reads 60162577 60496561 0.8082 After removing 59423292 (98.77%) 59715967 (98.71%) 0.831 low quality reads After removing 59380535 .+-. 7378751 58112890 .+-. 10324458 0.419 human reads Mapping rate 66.82% 66.27% 0.252

[0081] 1.4 Analysis of Factors Influencing Gut Microbiota Gene Profiles

[0082] To ensure robust comparison of the gene content of the 128 metagenomes, the inventors generated a set of 2,110,489 (2.1M) genes that were present in at least 6 subjects, and generated 128 gene abundance profiles using these 2.1 million genes. The inventors used the permutational multivariate analysis of variance (PERMANOVA) test to assess the effect of different characteristics, including age, BM1, eGFR, TCHO, LDL, HDL, TG, gender, DM, CRC status, smoking status and location, on the gene profiles of the 2.1M genes. The inventors performed the analysis using the "vegan" function of R, and the permuted p-value was obtained after 10,000 permutations. The inventors also corrected for multiple testing using the "p.adjust" function of R with the Benjamini-Hochberg method to get the q-value for each gene.

[0083] When the inventors performed permutational multivariate analysis of variance (PERMANOVA) on 13 different covariates, only a CRC status was significantly associated with these gene profiles (q=0.0028, Table 4), showing a stronger association than the second-best determinant, body mass index (q=0.15). Thus, the data suggest an altered gene composition in CRC patient microbiomes.

TABLE-US-00004 TABLE 4 PERMANOVA analysis using the microbial gene profile. Analysis was conducted to test whether clinical parameters and colorectal cancer (CRC) status have a significant impact on the gut microbiota with q < 0.05. BMI: body mass index; DM: diabetes mellitus type 2; HDL: high density lipoprotein; TG: triglyceride; eGFR: epidermal growth factor receptor; TCHO: total cholesterol; LDL; low density lipoprotein. Phenotype Df SumsOfSqs MeanSqs F. Model R2 Pr(>F) q-value CRC Status 1 0.679293 0.679293 1.95963 0.015314 0.0004 0.0028 BMI 1 0.484289 0.484289 1.39269 0.011019 0.033 0.154 DM Status 1 0.438359 0.438359 1.257642 0.009883 0.084 0.27272 Location 1 0.436417 0.436417 1.228172 0.016772 0.0974 0.27272 Age 1 0.397282 0.397282 1.138728 0.008957 0.1923 0.4487 HDL 1 0.38049 0.38049 1.083265 0.010509 0.271 0.542 TG 1 0.365191 0.365191 1.039593 0.010089 0.3517 0.564964 eGFR 1 0.358527 0.358527 1.023138 0.009471 0.38 0.564964 CRC Stage 1 0.357298 0.357298 1.002413 0.013731 0.441 0.564964 Smoker 1 0.347969 0.347969 0.999825 0.013511 0.4439 0.564964 TCHO 1 0.321989 0.321989 0.915216 0.008893 0.6539 0.762883 LDL 1 0.306483 0.306483 0.871306 0.00847 0.7564 0.814585 Gender 1 0.267738 0.267738 0.765162 0.006036 0.9528 0.9528

[0084] 1.5 CRC-Associated Genes Identified by MGWAS

[0085] 1.5.1 Identification of colorectal cancer associated genes. The inventors performed a metagenome wide association study (MGWAS) to identify the genes contributing to the altered gene composition in the CRC samples. To identify the association between the metagenomic profile and colorectal cancer, a two-tailed Wilcoxon rank-sum test was used in the 2.1M (2,110,489) gene profiles. The inventors identified 140,455 gene markers, which were enriched in either case or control samples with P<0.01 (FIG. 1).

[0086] 1.5.2 Estimating the false discovery rate (FDR). Instead of a sequential P-value rejection method, the inventors applied the "qvalue" method proposed in a previous study (J. D. Storey and R. Tibshirani (2003), "Statistical significance for genomewide studies," Proceedings of the National Academy of Sciences of the United States of America, 100, 9440, incorporated herein by reference) to estimate the FDR. In the MGWAS, the statistical hypothesis tests were performed on a large number of features of the 140,455 genes. The false discovery rate (FDR) was 11.03%.

[0087] 1.6 Gut Microbiota-Based CRC Classification

[0088] The inventors proceeded to identify potential biomarkers for CRC from the genes associated with the disease, using the minimum redundancy maximum relevance (mRMR) feature selection method. However, since the computational complexity of this method did not allow them to use all 140,455 genes from the MGWAS approach, the inventors had to reduce the number of candidate genes. First, the inventors selected a stricter set of 36,872 genes with higher statistical significance (P<0.001; FDR=4.147%). Then the inventors identified groups of genes that were highly correlated with each other (Kendall's .tau.>0.9) and chose the longest gene in each group, generating a statistically non-redundant set of 15,836 significant genes. Finally, the inventors used the mRMR method and identified an optimal set of 31 genes that were strongly associated with CRC status (FIG. 2, Table 5). The inventors computed a CRC index based on the relative abundance of these markers, which clearly separated the CRC patient microbiomes from the control microbiomes (Table 6), as well as from 490 fecal microbiomes from two previous studies on type 2 diabetes in Chinese individuals (Qin et al. 2012, supra) and inflammatory bowel disease in European individuals (J. Qin et al. (2010), "A human gut microbial gene catalogue established by metagenomic sequencing," Nature, 464, 59, incorporated herein by reference) (FIG. 3, the median CRC-indexes for patients and controls in this study were 6.42 and -5.48, respectively; Wilcoxon rank-sum test, q<2.38.times.10.sup.-10 for all five comparisons, see Table 7). Classification of the 74 CRC patient microbiomes against the 54 control microbiomes using the CRC index exhibited an area under the receiver operating characteristic (ROC) curve of 0.9932 (FIG. 4). At the cutoff -0.0575, the true positive rate (TPR) was 1, and the false positive rate (FPR) was 0.07407, indicating that the 31 gene markers could be used to accurately classify CRC individuals.

TABLE-US-00005 TABLE 6 128 samples` calculated gut healthy index (CRC patients and non-CRC controls) Type Type (Con_CRC:non- (Con_CRC:non- CRC controls; CRC controls; CRC:CRC CRC:CRC Sample ID patients) CRC-index Sample ID patients) CRC-index 502A Con_CRC -7.505749695 A10A CRC 13.26483131 512A Con_CRC -5.150023018 M2.PK002A CRC 7.002094781 515A Con_CRC -4.919398163 M2.PK003A CRC 5.108478224 516A Con_CRC -2.793151285 M2.PK018A CRC 2.243592264 517A Con_CRC -8.078128133 M2.PK019A CRC -0.057498133 519A Con_CRC -7.556675412 M2.PK021A CRC 7.878402029 530A Con_CRC -0.194519906 M2.PK022A CRC 9.047909247 534A Con_CRC -5.251127609 M2.PK023A CRC 5.428574192 536A Con_CRC -7.08635459 M2.PK024A CRC 5.032760805 M2.PK504A Con_CRC -5.470747464 M2.PK026A CRC 6.257085759 M2.PK514A Con_CRC -4.441183208 M2.PK027A CRC 1.59430903 M2.PK520B Con_CRC -8.101427301 M2.PK029A CRC 9.331138747 M2.PK522A Con_CRC 0.269338093 M2.PK030A CRC 4.728023967 M2.PK523A Con_CRC -6.980913756 M2.PK032A CRC 6.055831256 M2.PK524A Con_CRC -9.027027667 M2.PK037A CRC 4.227424374 M2.PK531B Con_CRC -5.483143199 M2.PK038A CRC 2.669264211 M2.PK532A Con_CRC -5.96003222 M2.PK041A CRC 4.558926807 M2.PK533A Con_CRC -7.718764145 M2.PK042A CRC 3.47308125 M2.PK543A Con_CRC -9.844975269 M2.PK043A CRC 5.347387703 M2.PK548A Con_CRC -4.062846751 M2.PK045A CRC 8.09166979 M2.PK556A Con_CRC -4.15150788 M2.PK046A CRC 9.235279951 M2.PK558A Con_CRC -9.712104855 M2.PK047A CRC 8.45229555 M2.PK602A Con_CRC -7.380042553 M2.PK051A CRC 6.602608047 M2.PK615A Con_CRC 3.232971256 M2.PK052A CRC 3.207800397 M2.PK617A Con_CRC -8.878473599 M2.PK055A CRC 5.088317256 M2.PK619A Con_CRC -8.279540689 M2.PK056B CRC 5.504229632 M2.PK630A Con_CRC -5.993197547 M2.PK059A CRC 5.466091636 M2.PK644A Con_CRC 1.230424198 M2.PK063A CRC 3.758294225 M2.PK647A Con_CRC -7.181191393 M2.PK064A CRC 3.763414393 M2.PK649A Con_CRC -1.576643721 M2.PK065A CRC 6.486959786 M2.PK653A Con_CRC -4.246899704 M2.PK066A CRC 1.199091901 M2.PK656A Con_CRC -5.80900221 M2.PK067A CRC 9.938025463 M2.PK659A Con_CRC -7.805935646 M2.PK069B CRC -0.04402983 M2.PK663A Con_CRC -5.007057718 M2.PK083B CRC 8.394697958 M2.PK699A Con_CRC -8.827532431 M2.PK084A CRC 9.25322799 M2.PK701A Con_CRC -0.981728615 M2.PK085A CRC 7.852591304 M2.PK705A Con_CRC -8.822384737 MSC103A CRC 4.05476664 M2.PK708A Con_CRC -6.573782359 MSC119A CRC 4.331580986 M2.PK710A Con_CRC -7.558945558 MSC120A CRC 3.865826479 M2.PK712A Con_CRC -9.207916748 MSC1A CRC 9.930238103 M2.PK723A Con_CRC -4.481542621 MSC45A CRC 9.331894011 M2.PK725A Con_CRC -7.520375154 MSC4A CRC 0.006971195 M2.PK729A Con_CRC -5.318926226 MSC54A CRC 12.10968629 M2.PK730A Con_CRC -4.3710193 MSC5A CRC 3.272778932 M2.PK732A Con_CRC -5.20132309 MSC63A CRC 7.74197911 M2.PK750A Con_CRC -6.64771202 MSC6A CRC 8.063701275 M2.PK751A Con_CRC -3.65391467 MSC76A CRC 6.730976418 M2.PK797A Con_CRC -4.675123647 MSC78A CRC 6.999247399 M2.PK801A Con_CRC -7.766321018 MSC79A CRC 6.805539524 509A Con_CRC -2.479402638 MSC81A CRC 8.465000094 A60A Con_CRC 1.078322254 M118A CRC 8.675933723 506A Con_CRC -4.246837899 M123A CRC 8.627635602 A21A Con_CRC -4.440375851 M2.Pk.001A CRC 7.78045553 A51A Con_CRC -2.809587066 M2.Pk.005A CRC 4.534189338 M2.Pk.009A CRC 8.188718934 M2.Pk.017A CRC 6.225010462 M84A CRC 3.497922009 M89A CRC 0.394210537 M2.Pk.007A CRC 5.703428174 M2.Pk.010A CRC 7.231959163 M122A CRC 8.387516145 M2.Pk.004A CRC 4.246104721 M2.Pk.008A CRC 5.299578303 M2.Pk.011A CRC 6.354957821 M2.Pk.015A CRC 7.719629705 M113A CRC 7.528437656 M116A CRC 10.54991338 M117A CRC 0.072052278 M2.Pk.006A CRC 9.368358379 M2.Pk.012A CRC 1.112535148 M2.Pk.014A CRC 8.671786146 M2.Pk.016A CRC 8.898356611 M115A CRC 7.241420602 M2.Pk.013A CRC 7.331598086

Example 2

Validating the 31 Biomarkers

[0089] The inventors validated the discriminatory power of the CRC classifier using another new independent study group, including 19 CRC patients and 16 non-CRC controls that were also collected in the Prince of Wales Hospital.

[0090] For each sample, DNA was extracted and a DNA library was constructed followed by high throughput sequencing as described in Example 1. The inventors calculated the gene abundance profile for these samples using the same method as described in Qin et al. 2012, supra. The relative abundance of each of the gene markers as set forth in SEQ ID NOs: 1-31 was then determined. The index of each sample was then calculated using the following formula:

I j = [ i .epsilon. N log 10 ( A ij + 10 - 20 ) N - i .epsilon. M log 10 ( A ij + 10 - 20 ) M ] , ##EQU00004##

wherein: A.sub.ij is the relative abundance of marker i in sample j, wherein i refers to each of the gene markers as set forth in SEQ ID NOs 1-31, N is a subset of all of the abnormal-associated gene markers and M is a subset of all of the control-associated gene markers, the subset of CRC-associated gene markers and the subset of control-associated gene markers are shown in Table 1, and |N| and |M| are numbers (sizes) of the biomarkers in these two subsets, respectively, wherein

|N| is 13 and |M| is 18.

[0091] Table 8 shows the calculated index of each sample, and Table 9 shows the relevant gene relative abundance of a representative sample, V30.

[0092] In this assessment analysis, the top 19 samples with the highest gut healthy index were all CRC patients, and all of the CRC patients were diagnosed as CRC individuals (Table 8 and FIG. 5) Only one of the non-CRC controls (FIG. 5, the triangle with *) was diagnosed as a CRC patient. At the cutoff -0.0575, the error rate was 2.86%, validating that the 31 gene markers can accurately classify CRC individuals.

TABLE-US-00006 TABLE 8 35 samples` calculated gut healthy index Type Type (Con_CRC:non- (Con_CRC:non- CRC controls; CRC controls; CRC:CRC CRC:CRC Sample ID patients) CRC-index Sample ID patients) CRC-index V27 Con_CRC 0.269338056 V35 CRC 13.16483131 V19 Con_CRC -0.981728643 V8 CRC 12.12968629 V26 Con_CRC -2.793151257 V13 CRC 10.54991338 V10 Con_CRC -4.371019 V7 CRC 9.958035463 V18 Con_CRC -4.440375832 V17 CRC 9.2432279 V1 Con_CRC -4.675123655 V2 CRC 9.235252955 V14 Con_CRC -4.919398178 V15 CRC 8.465000028 V9 Con_CRC -5.007057768 V25 CRC 8.188718932 V33 Con_CRC -5.20132324 V20 CRC 7.852591353 V29 Con_CRC -5.251127667 V3 CRC 7.74197955 V6 Con_CRC -5.470747485 V24 CRC 7.528437632 V21 Con_CRC -5.96003246 V16 CRC 6.225010478 V22 Con_CRC -6.64771297 V30 CRC 6.055831257 V23 Con_CRC -7.181191336 V31 CRC 5.088317266 V5 Con_CRC -7.558945528 V28 CRC 3.865826489 V32 Con_CRC -8.101427363 V4 CRC 3.758294237 V11 CRC 2.669264236 V34 CRC 2.243592293 V12 CRC 1.199091982

TABLE-US-00007 TABLE 9 Gene relative abundance of Sample V30 Enrichment (1 = Control, Calculation of gene Gene id 0 = CRC) SEQ ID NO: relative abundance 2361423 0 1 2.24903E-05 2040133 0 2 8.77418E-08 3246804 0 3 0 3319526 1 4 0 3976414 1 5 0 1696299 0 6 4.04178E-06 2211919 1 7 7.89676E-07 1804565 1 8 0 3173495 0 9 0.000020166 482585 0 10 0 181682 1 11 0 3531210 1 12 0 3611706 1 13 0 1704941 0 14 1.73798E-06 4256106 1 15 0 4171064 1 16 9.35913E-08 2736705 0 17 1.41059E-07 2206475 1 18 3.12301E-07 370640 1 19 0 1559769 1 20 0 3494506 1 21 0 1225574 0 22 0 1694820 0 23 4.57783E-07 4165909 1 24 0 3546943 0 25 0 3319172 1 26 0 1699104 0 27 4.74411E-06 3399273 1 28 6.0661E-08 3840474 1 29 0 4148945 1 30 3.00829E-07 2748108 0 31 8.14399E-08

[0093] The inventors have therefore identified and validated a 31 markers set that was determined using a minimum redundancy--maximum relevance (mRMR) feature selection method based on 140,455 CRC-associated markers. The inventors have also developed a gut healthy index to evaluate the risk of CRC disease based on these 31 gut microbial gene markers.

Example 3

Identifying Species Biomarkers from the 128 Chinese Individuals

[0094] Based on the sequencing reads of the 128 microbiomes from cohort I in Example 1, the inventors examined the taxonomic differences between control and CRC-associated microbiomes to identify microbial tax a contributing to the dysbiosis. For this, the inventors used taxonomic profiles derived from three different methods, as supporting evidence from multiple methods would strengthen an association. First, the inventors mapped metagenomic reads to 4650 microbial genomes in the IMG database (version 400) and estimated the abundance of microbial species included in that database (denoted IMG species). Second, the inventors estimated the abundance of species-level molecular operational taxonomic units (mOTUs) using universal phylogenetic marker genes. Third, the inventors organized the 140,455 genes identified by MGWAS into metagenomic linkage groups (MLGs) that represent clusters of genes originating from the same genome, and they annotated the MLGs at the species level using the IMG database whenever possible, grouped the MLGs based on these species annotations, and estimated the abundance of these species (denoted MLG species).

[0095] 3.1 Species Annotation of IMG Genomes

[0096] For each IMG genome, using the NCBI taxonomy identifier provided by IMG, the inventors identified the corresponding NCBI taxonomic classification at the species and genus levels using NCBI taxonomy dump files. The genomes without corresponding NCBI species names were left with their original IMG names, most of which were unclassified.

[0097] 3.2 Data Profile Construction

[0098] 3.2.1 Gene Profiles

[0099] The inventors mapped their high-quality reads to a published reference gut gene catalog established from European and Chinese adults (identity >=90%), and the inventors then derived the gene profiles using the same method of Qin et al. 2012, supra.

[0100] 3.2.2 mOTU Profile

[0101] Clean reads (high quality reads, as in Example 1) were aligned to the mOTU reference (79268 sequences total) with default parameters (S. Sunagawa et al. (2013), "Metagenomic species profiling using universal phylogenetic marker genes," Nature methods, 10, 1196, incorporated herein by reference). 549 species-level mOTUs were identified, including 307 annotated species and 242 mOTU linkage groups without representative genomes, the latter of which were putatively Firmicutes or Bacteroidetes.

[0102] 3.2.3 IMG-Species and IMG-Genus Profiles

[0103] Bacterial, archaeal and fungal sequences were extracted from the IMG v400 reference database (V. M. Markowitz et al. (2012), "IMG: the Integrated Microbial Genomes database and comparative analysis system," Nucleic acids research, 40, D115, incorporated herein by reference) downloaded from http: //ftp.jgi-psf.org. 522,093 sequences were obtained in total, and a SOAP reference index was constructed based on 7 equal-sized segments of the original file. Clean reads were aligned to the reference using a SOAP aligner (R. Li et al. (2009), "SOAP2: an improved ultrafast tool for short read alignment," Bioinformatics, 25, 1966, incorporated herein by reference) version 2.22, with the parameters "-m 4-s 32-r 2-n 100-x 600-v 8-c 0.9-p 3". SOAP coverage software was then used to calculate the read coverage of each genome, normalized by genome length, and further normalized to the relative abundance for each individual sample. The profile was generated based on uniquely-mapped reads only.

[0104] 3.3 Identification of Colorectal Cancer-Associated MLG Species

[0105] Based on the identified 140,455 colorectal cancer associated maker genes profile, the inventors constructed the colorectal cancer-associated MLGs using the method described in the previous type 2 diabetes study (Qin et al. 2012, supra). All of the genes were aligned to the reference genomes of the IMG database v400 to obtain genome-level annotation. An MLG was assigned to a genome if >50% constitutive genes were annotated to that genome, otherwise the genome was labeled unclassified. A constitutive gene is a gene that is transcribed continually as opposed to a facultative gene, which is only transcribed when needed. A total of 87 MLGs with a gene number over 100 were selected as colorectal cancer-associated MLGs. These MLGs were grouped based on the species annotations of these genomes to construct MLG species.

[0106] To estimate the relative abundance of an MLG species, the inventors estimated the average abundance of the genes of the MLG species, after removing the genes with the 5% lowest and 5% highest abundance. The relative abundance of the IMG species was estimated by summing the abundance of the IMG genomes belonging to that species.

[0107] These analyses identified 30 IMG species, 21 mOTUs and 86 MLG species that were significantly associated with CRC status (Wilcoxon rank-sum test, q<0.05; see Tables 10, 11). Eubacterium ventriosum was consistently associated with or enriched in the control microbiomes using all three methods (Wilcoxon rank-sum tests--IMG: q=0.0414; mOTU: q=0.012757; MLG: q=5.446.times.10.sup.-4), and Eubacterium eligens was enriched according to two methods (Wilcoxon rank-sum tests--IMG: q=0.069; MLG: q=0.00031). Conversely, Parvimonas micra (q<1.80.times.10.sup.-5), Peptostreptococcus stomatis (q<1.80.times.10.sup.-5), Solobacterium moorei (q<0.004331) and Fusobacterium nucleatum (q<0.004565) were consistently associated with or enriched in CRC patient microbiomes using all three methods (FIG. 6, FIG. 7). P. stomatis has been associated with oral cancer, and S. moorei has been associated with bacteremia. Recent work using 16S rRNA sequencing has reported a significant enrichment of F. nucleatum in CRC tumor samples, and this bacteria has been shown to possess adhesive, invasive and pro-inflammatory properties. The inventors' results confirmed this association in a new cohort with different genetic and cultural origins. However, the highly-significant enrichment of P. micra--an obligate anaerobic bacterium that can cause oral infections like F. nucleatum--in CRC-associated microbiomes is a novel finding. P. micra is involved in the etiology of periodontia, and it produces a wide range of proteolytic enzymes and uses peptones and amino acids as an energy source. It is known to produce hydrogen sulphide, which promotes tumor growth and the proliferation of colon cancer cells. Further research is required to verify whether P. micra is involved in the pathogenesis of CRC, or if its enrichment is a result of CRC-associated changes in the colon and/or rectum. Nevertheless, it represents a potential biomarker for non-invasive diagnosis of CRC.

[0108] 3.4 Species Marker Identification

[0109] In order to evaluate the predictive power of these taxonomic associations, the inventors used the random forest ensemble learning method (D. Knights, E. K. Costello, R. Knight (2011), "Supervised classification of human microbiota," FEMS microbiology reviews, 35, 343, incorporated herein by reference) to identify key species markers in the species profiles from the three different methods.

[0110] 3.4.1 MLG Species Marker Identification

[0111] Based on the constructed 87 MLGs with gene numbers over 100, the inventors performed the Wilcoxon rank-sum test on each MLG using a Benjamini-Hochberg adjustment, and 86 MLGs were selected as colorectal-associated MLGs with q<0.05. To identify MLG species markers, the inventors used the "randomForest 4.5-36" function of R vision 2.10 to analyze the 86 colorectal cancer-associated MLG species. Firstly, the inventors sorted all of the 86 MLG species by the importance given by the "randomForest" method. MLG marker sets were constructed by creating incremental subsets of the top ranked MLG species, starting from 1 MLG species and ending at 86 MLG species.

[0112] For each MLG marker set, the inventors calculated the false predication ratio in the 128 Chinese cohorts (cohort I). Finally, the MLG species sets with the lowest false prediction ratio were selected as MLG species markers. Furthermore, the inventors drew the ROC curve using the probability of illness based on the selected MLG species markers.

[0113] 3.4.2 IMG Species and mOTU Species Markers Identification

[0114] Based on the IMG species and mOTU species profiles, the inventors identified the colorectal cancer-associated IMG species and mOTU species with q<0.05 (Wilcoxon rank-sum test with 6Benjamini-Hochberg adjustment). Subsequently, the IMG species markers and the mOTU species markers were selecting using the random forest approach as in the MLG species markers selection.

[0115] This analysis revealed that 16 IMG species, 10 species-level mOTUs and 21 MLG species were highly predictive of CRC status (Tables 12, 13), with a predictive power of 0.86, 0.90 and 0.94 in ROC analysis, respectively (FIG. 8). Parvimonas micra was identified as a key species from all three methods, and Fusobacterium nucleatum and Solobacterium moorei from two out of three methods, providing further statistical support for their association with CRC status.

[0116] 3.5 MLG, IMG and mOTU Species Stage Enrichment Analysis

[0117] Encouraged by the consistent species associations with CRC status, and to take advantage of the records of disease stages of the CRC patients (Table 2), the inventors explored the species profiles for specific signatures identifying early stages of CRC. The inventors hypothesized that such an effort might even reveal stage-specific associations that are difficult to identify in a global analysis. To identify which species were associated with or enriched in the four colorectal cancer stages or in healthy controls, the inventors carried out a Kruskal test for the MLG species with a gene number over 100, and all of the IMG species and mOTU species with q<0.05 (Wilcoxon rank-sum test with Benjamini-Hochberg adjustment) to obtain the species enrichment information using the highest rank mean among the four CRC stages and the control. The inventors also compared the significance between every two groups by a pair-wise Wilcoxon Rank sum test.

[0118] In Chinese cohort I, several species showed significantly different abundances in the different CRC stages. Among these, the inventors did not identify any species enriched in stage I compared to the other CRC stages and the control samples. Peptostreptococcus stomatis, Prevotella nigrescens and Clostridium symbiosum were enriched in stage II or later compared to the control samples, suggesting that they colonize the colon/rectum after the onset of CRC (FIG. 9). However, Fusobacterium nucleatum, Parvimonas micra, and Solobacterium moorei were enriched in all four stages compared to the control samples and were most abundant in stage II (FIG. 10), suggesting that they play a role in both CRC etiology and pathogenesis, and implicating them as potential biomarkers for early CRC.

Example 4

Validation of Markers by qPCR

[0119] The 31 gene biomarkers were derived using the admittedly expensive deep metagenome sequencing approach. Translating them into diagnostic biomarkers would require reliable detection using more simple and less expensive methods such as quantitative PCR (TaqMan probe-based qPCR). Primers and probes were designed using Primer Express v3.0 (Applied Biosystems, Foster City, Calif., USA). The qPCR was performed on an ABI7500 Real-Time PCR System using the TaqMan.RTM. Universal PCR Master Mixreagent (Applied Biosystems). Universal 16S rDNA was used as an internal control, and the abundance of gene markers were expressed as relative levels to 16S rDNA.

[0120] To validate the test, the inventors selected two case-enriched gene markers (m482585 and m1704941) and measured their abundance by qPCR in a subset of 100 samples (55 cases and 45 controls). Quantification of each of the two genes using the two platforms (metagenomic sequencing and qPCR) showed strong correlations (Spearman r=0.93-0.95, FIG. 11), suggesting that the gene markers could also be reliably measured using qPCR.

[0121] Next, in order to validate the markers in previously unseen samples, the inventors measured the abundance of these two gene markers using qPCR in 164 fecal samples (51 cases and 113 controls) from an independent Chinese cohort (cohort II). Two case-enriched gene markers significantly associated with CRC status, at significance levels of q=6.56.times.10.sup.-9 (m1704941, butyryl-CoA dehydrogenase from F. nucleatum), and q=0.0011 (m482585, RNA-directed DNA polymerase from an unknown microbe). The gene from F. nucleatum was present in only 4 out of 113 control microbiomes, suggesting a potential for developing specific diagnostic tests for CRC using fecal samples. The CRC index based on the combined qPCR abundance of the two case-enriched gene markers separated the CRC samples from control samples in cohort II (Wilcoxon rank-sum test, P=4.01.times.10.sup.-7; FIG. 12A). However, the moderate classification potential (inferred from area under the ROC curve of 0.73; FIG. 12B) using only these two genes suggested that additional biomarkers could improve the classification of CRC patient microbiomes.

[0122] Another gene from P. micra was the highly conserved rpoB gene (namely m1696299, with identity of 99.78%) encoding RNA polymerase subunit .beta., often used as a phylogenetic marker (F. D. Ciccarelli et al. (2006), "Toward automatic reconstruction of a highly resolved tree of life," Science, 311, 1283, incorporated herein by reference). Since the inventors repeatedly identified P. micra as a novel biomarker for CRC using several strategies including species-agnostic procedures, the inventors performed an additional qPCR experiment for this marker gene on Chinese cohort II as described above and found a significant enrichment in CRC patient microbiomes (Wilcoxon rank-sum test, P=2.15.times.10.sup.-15). When the inventors combined this gene with the two qPCR-validated genes, the CRC index from these three genes clearly separated case from control samples in Chinese cohort II (Wilcoxon rank-sum test, P=5.76.times.10.sup.-13, FIG. 13A) and showed reliable classification potential with an improved area under the ROC curve of 0.84 (FIG. 13B). The abundance of rpoB from P. micra was significantly higher compared to control samples starting from CRC stage II (FIG. 13C), agreeing with the inventors' results from species abundance analysis, and providing further evidence that this gene could serve as a non-invasive biomarker for the identification of early stage CRC.

[0123] Sequence Information for the primers and probes for the selected 3 gene markers:

TABLE-US-00008 >1696299 Forward AAGAATGGAGAGAGT TGTTAGAGAAAGAA (SEQ ID NO: 32) Reverse TTGTGATAATTGTGA AGAACCGAAGA (SEQ ID NO: 33) Probe AACTCAAGATCCAGA CCTTGCTACGCCTCA (SEQ ID NO: 34) >1704941 Forward TTGTAAGTGCTGGTA AAGGGATTG (SEQ ID NO: 35) Reverse CATTCCTACATAACG GTCAAGAGGTA (SEQ ID NO: 36) Probe AGCTTCTATTGGTTC TTCTCGTCCAGTGGC (SEQ ID NO: 37) >482585 Forward AATGGGAATGGAGCG GATTC (SEQ ID NO: 38) Reverse CCTGCACCAGCTTAT CGTCAA (SEQ ID NO: 39) Probe AAGCCTGCGGAACCA CAGTTACCAGC (SEQ ID NO: 40)

TABLE-US-00009 TABLE 5 The 31 gene markers identified by the mRMR feature selection method. Detailed information regarding their enrichment, occurrence in colorectal cancer cases and controls, a statistical test of association, taxonomy and identity percentage are listed. Occurrence Control (n = 54) Case (n = 74) Marker Wilcoxon Test P Rate Rate Blastn to IMG v400 Blastp to KEGG v59 gene ID P-value q-value Enrich Count (%) Count (%) Identity Taxonomy Description 3546943 1.59E-06 1.90465E-06 Case 3 5.56 27 36.49 99.09 Bacteroides sp. zinc protease 2_1_56FAA 1225574 1.47E-06 1.8957E-06 Case 0 0.00 13 17.57 88.88 Clostridium hathewayi lactose/L-arabinose transport DSM 13479 system substrate-binding protein 2736705 5.35E-07 8.4594E-07 Case 0 0.00 21 28.38 99.68 Clostridium hathewayi NA DSM 13479 2748108 2.12E-07 4.38881E-07 Case 0 0.00 20 27.03 99.82 Clostridium hathewayi RNA polymerase sigma-70 DSM 13479 factor, ECF subfamily 2040133 7.46E-11 7.70506E-10 Case 7 12.96 44 59.46 99.4 Clostridium cobalt/nickel transport system symbiosum permease protein WAL-14163 1694820 9.78E-08 2.52552E-07 Case 1 1.85 18 24.32 99.17 Fusobacterium V-type H+-transporting ATPase sp. 7_1 subunit K 1704941 1.16E-08 5.12764E-08 Case 1 1.85 21 28.38 99.13 Fusobacterium butyryl-CoA dehydrogenase vincentii nucleatum ATCC 49256 482585 3.81E-09 2.36224E-08 Case 9 16.67 50 67.57 NA NA RNA-directed DNA polymerase 3246804 4.19E-08 1.44418E-07 Case 1 1.85 24 32.43 NA NA citrate-Mg2+:H+ or citrate-Ca2+:H+ symporter, CitMHS family 1696299 8.50E-10 6.58857E-09 Case 1 1.85 33 44.59 99.78 Parvimonas micra DNA-directed RNA polymerase ATCC 33270 subunit beta 1699104 1.00E-08 5.12764E-08 Case 1 1.85 31 41.89 98.08 Parvimonas micra glutamate decarboxylase ATCC 33270 2361423 4.89E-13 1.51641E-11 Case 7 12.96 55 74.32 93.87 Peptostreptococcus transposase anaerobius 653-L 3173495 1.14E-12 1.77065E-11 Case 4 7.41 44 59.46 93.98 Peptostreptococcus transposase anaerobius 653-L 3494506 4.93E-06 5.27005E-06 Control 19 35.19 4 5.41 90.37 Burkholderiales ribosomal small subunit bacterium 1_1_47 pseudouridine synthase A 2211919 3.59E-08 1.3927E-07 Control 49 90.74 39 52.70 80.99 Coprobacillus sp. NA 8_2_54BFAA 2206475 6.49E-07 9.58475E-07 Control 23 42.59 5 6.76 98.59 Eubacterium beta-glucosidase ventriosum ATCC 27560 3976414 1.57E-07 3.48653E-07 Control 15 27.78 3 4.05 87.12 Faecalibacterium cf. adenosylcobinamide-phosphate prausnitzii KLE1255 synthase CobD 3319172 1.12E-07 2.666E-07 Control 19 35.19 2 2.70 84.22 Faecalibacterium UDP-N-acetylmuramoylalanyl- prausnitzii A2-165 D-glutamyl-2,6- diaminopimelate--D-alanyl- D-alanine ligase 3319526 7.04E-08 1.98403E-07 Control 21 38.89 7 9.46 90.01 Faecalibacterium replicative DNA helicase prausnitzii L2-6 4171064 4.69E-08 1.45363E-07 Control 29 53.70 10 13.51 94.94 Faecalibacterium cytidine deaminase prausnitzii L2-6 370640 4.06E-06 4.49308E-06 Control 12 22.22 0 0.00 99.4 Bacteroides clarus NA YIT 12056 1804565 7.31E-07 9.85539E-07 Control 16 29.63 1 1.35 NA NA branched-chain amino acid transport system ATP-binding protein 3399273 4.88E-07 8.40846E-07 Control 41 75.93 23 31.08 NA NA two-component system, LytT family, response regulator 3531210 9.76E-06 9.75675E-06 Control 8 14.81 0 0.00 NA NA GDP-L-fucose synthase 3611706 1.67E-06 1.91677E-06 Control 13 24.07 0 0.00 NA NA anti-repressor protein 3840474 9.76E-06 9.75675E-06 Control 6 11.11 0 0.00 NA NA NA 4148945 5.46E-07 8.4594E-07 Control 23 42.59 8 10.81 NA NA NA 4165909 1.60E-06 1.90465E-06 Control 8 14.81 0 0.00 NA NA N-acetylmuramoyl-L-alanine amidase 4256106 3.69E-07 6.72327E-07 Control 21 38.89 4 5.41 NA NA integrase/recombinase XerD 181682 6.97E-07 9.82079E-07 Control 27 50.00 8 10.81 99.25 Roseburia intestinalis NA L1-82 1559769 2.83E-07 5.48673E-07 Control 17 31.48 5 6.76 88.65 Coprococcus catus polar amino acid transport GD/7 system substrate-binding protein

TABLE-US-00010 TABLE 7 CRC index estimated in CRC, T2D and IBD patients and healthy cohorts. Comparison with CRC patients Cohort/group Median CRC index P-value q-value CRC patients 6.420958803 NA NA CRC controls -5.476945331 1.96E-21 2.44E-21 T2D patients -0.108110996 1.33E-27 2.21E-27 T2D controls -1.471692382 6.21E-31 3.11E-30 IBD patients -2.214296342 2.38E-10 2.38E-10 IBD controls -4.724156396 7.56E-29 1.89E-28

TABLE-US-00011 TABLE 10 IMG and mOTU species associated with CRC with q-value <0.05 Control Case Enrichment rank rank (1: Control; mean mean 0: Case) P-value q-value 30 IMG species Peptostreptococcus stomatis 37.25926 84.37838 0 1.29E-12 3.34E-09 Parvimonas micra 38.43519 83.52027 0 1.13E-11 1.46E-08 Parvimonas sp. oral taxon 393 39.81481 82.51351 0 1.28E-10 1.10E-07 Parvimonas sp. oral taxon 110 43.52778 79.80405 0 4.71E-08 3.04E-05 Gemella morbillorum 43.87037 79.55405 0 7.77E-08 4.01E-05 Burkholderia mallei 45.19444 78.58784 0 4.84E-07 0.000156 Fusobacterium sp. oral taxon 370 45.02778 78.70946 0 3.93E-07 0.000156 Fusobacterium nucleatum 45.09259 78.66216 0 4.33E-07 0.000156 Leptotrichia buccalis 45.60185 78.29054 0 7.30E-07 0.000209 Beggiatoa sp. PS 46.53704 77.60811 0 2.79E-06 0.000601 Prevotella intermedia 46.47222 77.65541 0 2.67E-06 0.000601 Streptococcus dysgalactiae 47.06481 77.22297 0 3.09E-06 0.000613 Streptococcus pseudoporcinus 47.5 76.90541 0 8.58E-06 0.001581 Paracoccus denitrificans 47.48148 76.91892 0 9.35E-06 0.001608 Solobacterium moorei 47.66667 76.78378 0 1.17E-05 0.001884 Streptococcus constellatus 48.2037 76.39189 0 2.20E-05 0.003153 Crenothrix polyspora 48.76852 75.97973 0 4.20E-05 0.005697 Filifactor alocis 49.06481 75.76351 0 5.84E-05 0.007533 Sulfurovum sp. SCGC AAA036-O23 52.12037 73.53378 0 6.60E-05 0.008105 Clostridium hathewayi 49.68519 75.31081 0 0.000115 0.013431 Lachnospiraceae bacterium 5_1_57FAA 50.10185 75.00676 0 0.000178 0.019084 Peptostreptococcus anaerobius 50.14815 74.97297 0 0.000186 0.019221 Streptococcus equi 50.58333 74.65541 0 0.00029 0.027747 Streptococcus anginosus 50.66667 74.59459 0 0.000316 0.029114 Leptotrichia hofstadii 50.99074 74.35811 0 0.000342 0.030424 Peptoniphilus indolicus 51.2963 74.13514 0 0.000581 0.048307 Eubacterium ventriosum 80.98148 52.47297 1 1.77E-05 0.00269 Adhaeribacter aquaticus 77.06481 55.33108 1 0.000271 0.026839 Eubacterium eligens 77.90741 54.71622 1 0.000482 0.041404 Haemophilus sputorum 77.66667 54.89189 1 0.000608 0.048977 21 mOTU species Parvimonas micra 46.2963 77.78378 0 4.11E-08 1.80E-05 Peptostreptococcus stomatis 46.25 77.81757 0 6.56E-08 1.80E-05 motu_linkage_group_731 50.42593 74.77027 0 1.08E-06 0.000198 Gemella morbillorum 47.93519 76.58784 0 1.57E-06 0.000215 Clostridium symbiosum 48.66667 76.05405 0 1.89E-05 0.00173 Solobacterium moorei 51.22222 74.18919 0 6.31E-05 0.004331 Fusobacterium nucleatum 54.62037 71.70946 0 9.15E-05 0.004565 unclassified Fusobacterium 54.22222 72 0 0.000176 0.00806 Clostridium ramosum 50.92593 74.40541 0 0.000289 0.012202 Clostridiales bacterium 1_7_47FAA 51.27778 74.14865 0 0.000365 0.013366 Bacteroides fragilis 51.09259 74.28378 0 0.00045 0.01371 motu_linkage_group_624 51.01852 74.33784 0 0.000448 0.01371 Clostridium bolteae 51.81481 73.75676 0 0.000952 0.026134 motu_linkage_group_407 81.13889 52.35811 1 6.00E-06 0.000659 motu_linkage_group_490 80.46296 52.85135 1 3.06E-05 0.002403 motu_linkage_group_316 79.61111 53.47297 1 8.17E-05 0.004487 motu_linkage_group_443 79.66667 53.43243 1 7.63E-05 0.004487 Eubacterium ventriosum 78.09259 54.58108 1 0.000325 0.012757 motu_linkage_group_510 77.84259 54.76351 1 0.000443 0.01371 motu_linkage_group_611 77.2963 55.16216 1 0.000606 0.017499 motu_linkage_group_190 75.16667 56.71622 1 0.001694 0.044273

TABLE-US-00012 TABLE 11 List of 86 MLG species formed after grouping MLGs with more than 100 genes using the species annotation when available. Control Case Enrichment rank rank (1: Control; mean mean 0: Case) P-value q-value Parvimonas micra 38.40741 83.54054 0 3.16E-12 2.75E-10 Fusobacterium nucleatum 40.32407 82.14189 0 2.97E-11 1.29E-09 Solobacterium moorei 42.2037 80.77027 0 3.85E-09 1.12E-07 Clostridium symbiosum 46.31481 77.77027 0 1.64E-06 3.56E-05 CRC 2881 51.25926 74.16216 0 2.57E-06 4.46E-05 Clostridium hathewayi 46.77778 77.43243 0 3.92E-06 5.69E-05 CRC 6481 52.09259 73.55405 0 1.36E-05 0.000107 Clostridium clostridioforme 50.2037 74.93243 0 1.27E-05 0.000107 Clostridiales bacterium 1_7_47FAA 48.16667 76.41892 0 2.02E-05 0.000135 Clostridium sp. HGF2 48.27778 76.33784 0 2.36E-05 0.000147 CRC 2794 51.03704 74.32432 0 3.50E-05 0.000179 CRC 4136 50.99074 74.35811 0 5.22E-05 0.000233 Bacteroides fragilis 49.09259 75.74324 0 5.97E-05 0.000236 Lachnospiraceae bacterium 5_1_57FAA 49.96296 75.10811 0 7.37E-05 0.000273 Desulfovibrio sp. 6_1_46AFAA 53.33333 72.64865 0 0.000214 0.000546 Coprobacillus sp. 3_3_56FAA 50.53704 74.68919 0 0.000265 0.000623 Cloacibacillus evryensis 52.73148 73.08784 0 0.000359 0.000801 CRC 2867 52.31481 73.39189 0 0.000552 0.001162 Fusobacterium varium 54.57407 71.74324 0 0.000586 0.001186 Clostridium bolteae 51.39815 74.06081 0 0.000647 0.001223 Subdoligranulum sp. 4_3_54A2FAA 51.56481 73.93919 0 0.000758 0.001373 Clostridium citroniae 51.71296 73.83108 0 0.000861 0.001529 Lachnospiraceae bacterium 8_1_57FAA 51.88889 73.7027 0 0.001024 0.001782 Streptococcus equinus 54.52778 71.77703 0 0.001581 0.002457 CRC 4069 53.7963 72.31081 0 0.001632 0.00249 Lachnospiraceae bacterium 3_1_46FAA 52.53704 73.22973 0 0.00178 0.002612 Dorea formicigenerans 52.98148 72.90541 0 0.002703 0.003409 Synergistes sp. 3_1 syn1 54.37963 71.88514 0 0.003358 0.004002 Lachnospiraceae bacterium 3_1_57FAA_CT1 54.07407 72.10811 0 0.004478 0.005109 CRC 3579 54.05556 72.12162 0 0.005638 0.006289 Alistipes indistinctus 54.50926 71.79054 0 0.008262 0.008766 Con 10180 82.03704 51.7027 1 4.87E-06 6.05E-05 Coprococcus sp. ART55/1 80.85185 52.56757 1 8.22E-06 8.94E-05 Con 7958 75.27778 56.63514 1 1.36E-05 0.000107 butyrate-producing bacterium SS3/4 80.57407 52.77027 1 1.98E-05 0.000135 Haemophilus parainfluenzae 80.49074 52.83108 1 2.54E-05 0.000148 Con 154 80.35185 52.93243 1 3.30E-05 0.000179 Con 4595 77.21296 55.22297 1 4.17E-05 0.000202 Con 1617 76.12963 56.01351 1 5.61E-05 0.000233 Con 1979 79.94444 53.22973 1 5.62E-05 0.000233 Con 1371 78.46296 54.31081 1 7.54E-05 0.000273 Con 1529 75.05556 56.7973 1 9.25E-05 0.00031 Eubacterium eligens 79.53704 53.52703 1 9.03E-05 0.00031 Con 1987 79.42593 53.60811 1 0.000101 0.000324 Con 5770 79.39815 53.62838 1 0.000104 0.000324 Con 1197 75.42593 56.52703 1 0.000128 0.000383 Con 4699 78.78704 54.07432 1 0.000152 0.000441 Clostridium sp. L2-50 76.37963 55.83108 1 0.000167 0.000469 Con 2606 77.5 55.01351 1 0.000189 0.000514 Eubacterium ventriosum 78.62963 54.18919 1 0.000207 0.000545 Bacteroides clarus 75.55556 56.43243 1 0.000247 0.000597 Eubacterium biforme 74.68519 57.06757 1 0.000247 0.000597 Faecalibacterium prausnitzii 78.25926 54.45946 1 0.00034 0.000779 Con 563 72.7037 58.51351 1 0.000556 0.001162 Con 6037 77.5463 54.97973 1 0.000561 0.001162 Con 8757 77.17593 55.25 1 0.000634 0.001223 Ruminococcus obeum 77.53704 54.98649 1 0.000629 0.001223 Con 1513 76.59259 55.67568 1 0.000701 0.001298 Roseburia intestinalis 76.99074 55.38514 1 0.001079 0.001841 Ruminococcus torques 76.92593 55.43243 1 0.001186 0.001984 Con 4829 76.7963 55.52703 1 0.001335 0.002151 Con 569 73.41667 57.99324 1 0.001334 0.002151 Con 10559 76.59259 55.67568 1 0.001561 0.002457 Con 1604 71.92593 59.08108 1 0.001781 0.002612 Con 2494 74.35185 57.31081 1 0.001802 0.002612 Con 1867 76.38889 55.82432 1 0.001908 0.002722 Con 1241 76.27778 55.90541 1 0.002132 0.00294 Con 5752 73.65741 57.81757 1 0.002163 0.00294 Con 7367 76.23148 55.93919 1 0.002112 0.00294 Con 6128 76.22222 55.94595 1 0.002274 0.003043 Con 5615 76.07407 56.05405 1 0.002372 0.003104 Klebsiella pneumoniae 74.7037 57.05405 1 0.00239 0.003104 Con 4909 75.72222 56.31081 1 0.002685 0.003409 Con 356 75.94444 56.14865 1 0.002808 0.00349 Eubacterium rectale 75.90741 56.17568 1 0.002953 0.003619 Con 6068 75.74074 56.2973 1 0.003338 0.004002 Con 4295 74.98148 56.85135 1 0.004171 0.004904 Con 2703 74.55556 57.16216 1 0.00437 0.005069 Con 2503 74.14815 57.45946 1 0.004522 0.005109 Con 631 70.01852 60.47297 1 0.006178 0.006804 Con 561 70.5 60.12162 1 0.008137 0.00874 Con 8420 72.64815 58.55405 1 0.008068 0.00874 Con 425 73.19444 58.15541 1 0.008397 0.008802 Con 7993 73.74074 57.75676 1 0.009358 0.009692 Burkholderiales bacterium 1_1_47 72.37963 58.75 1 0.009707 0.009935 Con 600 69.53704 60.82432 1 0.026354 0.02666

TABLE-US-00013 TABLE 12 IMG and mOTU species makers. IMG and mOTU species markers identified using the random forest method among species associated with CRC. Species markers were listed by their importance reported by the method. Control Case Enrichment rank rank (1: Control; mean mean 0: Case) P-value q-value 16 IMG species makers Peptostreptococcus stomatis 37.25926 84.37838 0 1.29E-12 3.34E-09 Parvimonas micra 38.43519 83.52027 0 1.13E-11 1.46E-08 Parvimonas sp. oral taxon 393 39.81481 82.51351 0 1.28E-10 1.10E-07 Parvimonas sp. oral taxon 110 43.52778 79.80405 0 4.71E-08 3.04E-05 Gemella morbillorum 43.87037 79.55405 0 7.77E-08 4.01E-05 Fusobacterium sp. oral taxon 370 45.02778 78.70946 0 3.93E-07 1.56E-04 Burkholderia mallei 45.19444 78.58784 0 4.84E-07 1.56E-04 Fusobacterium nucleatum 45.09259 78.66216 0 4.33E-07 1.56E-04 Leptotrichia buccalis 45.60185 78.29054 0 7.30E-07 2.09E-04 Prevotella intermedia 46.47222 77.65541 0 2.67E-06 6.01E-04 Beggiatoa sp. PS 46.53704 77.60811 0 2.79E-06 6.01E-04 Crenothrix polyspora 48.76852 75.97973 0 4.20E-05 5.70E-03 Clostridium hathewayi 49.68519 75.31081 0 1.15E-04 1.34E-02 Lachnospiraceae bacterium 5_1_57FAA 50.10185 75.00676 0 1.78E-04 1.91E-02 Eubacterium ventriosum 80.98148 52.47297 1 1.77E-05 2.69E-03 Haemophilus sputorum 77.66667 54.89189 1 6.08E-04 4.90E-02 10 mOTU species makers Peptostreptococcus stomatis 46.25 77.81757 0 6.56E-08 1.80E-05 Parvimonas micra 46.2963 77.78378 0 4.11E-08 1.80E-05 Gemella morbillorum 47.93519 76.58784 0 1.57E-06 0.000215 Solobacterium moorei 51.22222 74.18919 0 6.31E-05 0.004331 unclassified Fusobacterium 54.22222 72 0 0.000176 0.00806 Clostridiales bacterium 1_7_47FAA 51.27778 74.14865 0 0.000365 0.013366 motu_linkage_group_624 51.01852 74.33784 0 0.000448 0.01371 motu_linkage_group_407 81.13889 52.35811 1 6.00E-06 0.000659 motu_linkage_group_490 80.46296 52.85135 1 3.06E-05 0.002403 motu_linkage_group_316 79.61111 53.47297 1 8.17E-05 0.004487

TABLE-US-00014 TABLE 13 21 MLG species markers identified using the random forest method from 106 MLGs with a gene number over 100. 21 MLG species makers Control Case Enrichment rank rank (1: Control; mean mean 0: Case) P-value q-value Parvimonas micra 38.40741 83.54054 0 3.16E-12 2.75E-10 Fusobacterium nucleatum 40.32407 82.14189 0 2.97E-11 1.29E-09 Solobacterium moorei 42.2037 80.77027 0 3.85E-09 1.12E-07 CRC 2881 51.25926 74.16216 0 2.57E-06 4.46E-05 Clostridium hathewayi 46.77778 77.43243 0 3.92E-06 5.69E-05 CRC 6481 52.09259 73.55405 0 1.36E-05 0.000107 Clostridiales bacterium 1_7_47FAA 48.16667 76.41892 0 2.02E-05 0.000135 Clostridium sp. HGF2 48.27778 76.33784 0 2.36E-05 0.000147 CRC 4136 50.99074 74.35811 0 5.22E-05 0.000233 Bacteroides fragilis 49.09259 75.74324 0 5.97E-05 0.000236 Clostridium citroniae 51.71296 73.83108 0 0.000861 0.001529 Lachnospiraceae bacterium 8_1_57FAA 51.88889 73.7027 0 0.001024 0.001782 Dorea formicigenerans 52.98148 72.90541 0 0.002703 0.003409 Con 10180 82.03704 51.7027 1 4.87E-06 6.05E-05 Con 7958 75.27778 56.63514 1 1.36E-05 0.000107 butyrate-producing bacterium SS3/4 80.57407 52.77027 1 1.98E-05 0.000135 Haemophilus parainfluenzae 80.49074 52.83108 1 2.54E-05 0.000148 Con 154 80.35185 52.93243 1 3.30E-05 0.000179 Con 1979 79.94444 53.22973 1 5.62E-05 0.000233 Con 5770 79.39815 53.62838 1 0.000104 0.000324 Con 1513 76.59259 55.67568 1 0.000701 0.001298

Although explanatory embodiments have been shown and described, it would be appreciated by those skilled in the art that the above embodiments can not be construed to limit the present disclosure, and changes, alternatives, and modifications can be made to the embodiments without departing from the nature, principles and scope of the present disclosure.

Sequence CWU 1

1

401816DNAPeptostreptococcus anaerobius 653-Lisolated from gut, Peptostreptococcus anaerobius 653-L 1atggccaaaa cacctatcgt agataagggg tgcttcatat cgaatgatgt taaaaggtca 60atagttttaa acctatgtga gactaagtca atggatctaa ttgcaagaga acactgtgta 120tctcctagta gtgttgccag aatacttcgt ttaactgaag ataggagaag aaaaaattat 180cttcctagga ttctatcaat agacgaattc aagtcagtaa atacagttga tgcgtctatg 240agtgtaaatt taactgattt agaaggcggt catatttttg atatcctggt ggataggagg 300caaagatacc tctttgagta ctttaattcc tatcccttga aggtcagaaa aagggtagaa 360tatgtgacta cagacatgta taagccatat attgatcttg ccaagaaggt ctttccaaat 420gccaatattg tggtagataa attccatata gtacagctct tgacaagaga gctaaacaag 480ttaaggataa atgagatgaa gaagcttaat accaggtcta gagagtataa aatactgaag 540agatactgga aaatacccct taggaagaag agagacttaa acagtatata tttttacaag 600aataggcact ttaaaaatat gaccagttca attgatatat tagactatat gttaaaggaa 660tttcccaact taaaagaggc ctatgatttt tatcaaaact tcctattaag tatatctaat 720aatgatgtcg ctatgcttga agacattcta aatactagga ctgatgaaat tcccatgtgt 780tttaggaaga gtataaaaag ccttaaaaag cttaga 8162504DNAClostridium symbiosum WAL-14163isolated from gut, Clostridium symbiosum WAL-14163 2atggttgcac ttgtatggct actgattgaa atgaaatata aaatcagtgt cccatctcca 60ctgttgctca gcatggttta caaacttttg cttccggcta tgcctgccta tcttctggct 120aaaatcccct ctgggaaatt aacggccagc ttgagaagaa tgccgatttc tacccatatc 180atgcttgtat tgatcgtcat gctccgcttt gcgccgactg tgctgcatga atttggagaa 240gtcagggaag ccatgaaaat tcgtggcttc ttaaaatcgg tcggtaatgt tttgaggcat 300ccaatggaca cgttggaata cgccattgtt ccgatggtgt tccgctcctt aaagatcgcg 360gacgagttag cagcttctgc catagtcagg ggaattgaaa gcccctacaa gaaagaaagc 420tactatgtca gccggatcgc tgcgctggat tactttttga ttgttgtcag cgtgggagct 480gccgtgtgct gctgtctttt atag 50431305DNAUnknownisolated from gut, unidentified 3atgttagcaa tcgtaggttt attaactatc ctggtcgtaa tgtttctgat tatgacaaaa 60aaatgttcga ctctggtcgc actgattgca gttcccatga ttgcatgtgt tattgtgggt 120cagggcgccg atatgggagg gtacataacg gccggtatca aaagtgtggc cgccaccgga 180gtcatgttta tttttgcagt ggcctttttc ggtgtcatgg gtgatgtggg tgcatttgaa 240atcgtagtga ataaaatact caggattatt gggaaagatc ctttgaaaat ctgtatcggc 300acgctgatta tcacattgat gacccacctg gacggctccg gcgcaacgac atttttgatc 360acaataccgg cgctgctgcc gatatacgat aaattgaaga tggatcggcg tgtgctggca 420actatagtgg cggcaggagc aggaaccatg aatctcgtcc cttggggagg gccgacgatc 480cgagcagcga cggcactgga ggtctcactg accgagcttt acaatcctat gattgtccct 540cagctttgcg gagtcgccgc ctgcgtgaca gtggcggtga tgtttggcct gaaggaacgg 600aaacgtttaa aagggactct ggaatctgtt tcggtagagc ctccgaaatt tgaggactta 660ccggaggagg agagagtgaa acgccgtccc caccttgtct ggtttaacat tctgctcatt 720atagttacaa ttgtgtcatt ggttatggag cttttgccgc cggccggctg ttttatggcg 780gcgctgtgca tcgcaatgct ggttaactac cgtgatttaa aggatcaggg aaaacggatg 840gacgagcatg cggtagcggc catgatgatg gcatccaccc tgtttggcgc aggctgcttt 900accggtatcc tgggaggctg cggcatgctg gaagcgatgg cccagggact ctgtgatatt 960ctcccggtag ccattatggg tcacattgcg attttggtgg cagttttctc catgcctctg 1020tcgctgatgt tcgatccgga cagcttctac tatgcagtac ttccggtaat tgcagtggcg 1080gccgaggtgg ccggtgttcc ggcattggca gtgggccgcg cggcgatatg cggacagatt 1140actgttggat tccccatttc accactgact ccatccacct tccttctgac aggactaacg 1200ggcgtggatc tcggggacca tcagaagcac agtttcgtgt ggctgtggct gatttccctg 1260acgattgtgc tggttgccgt ggtgatgggc gtaattccgg tatag 130541401DNAFaecalibacterium prausnitzii L2-6isolated from gut, Faecalibacterium prausnitzii L2-6 4atgccgaacg aacgacatta ctccaatgaa ctgaatctgg aaagcgtggg catcaatctg 60ccctacaaca tgcaggccga gcagagcgtg ctgggtgcgg tgctgctcaa gccggaaaca 120ctgaccgacc tggttgagat catccggccg gaaatgttct acacccggca gaacgcccaa 180atttattcgg aaatgctccg gctgttcacc agcgaccaga ccattgattt cgtcaccctg 240ctggacgcgg tcatctcaga cggcgtgttt cccagcgcgg acgaggcgaa agtctacctg 300accggtctgg ccgagacggt gcccagcatc tccaacgtga aagcctacgc ccagatcgtg 360caggaaaaat atctggtccg ccagctcatg ggtgtggcga aagatatctt gcaggatgcg 420ggcgacgagc cggacgcgga cctgctgctg gaaaacgccg agcagcgcat ttatgagatc 480cgctccgggc gggattccag cgccctgacg cccctttctt ccagcatggt ggaaacgctg 540accaatctgc agaagatcag cggcccggat gccgataagt acaagggcat ccctacaggc 600ttccgcctgc tggacaccgt gctcaccggc cttggccgcg gcgaccttat tattctggct 660gcccgccccg gtatgggcaa gaccagtttt gcgctgaaca ttgccacccg cgtggccatg 720cagcagaaag taccggtggc catcttcagc ctcgaaatga ccaaggagca gctgaccaac 780cggatcctct cggcggaggc cggcatcgac agccaggcgt tccgcaccgg cgccctccgg 840gcggaggact gggagtacct ggcccttgcc accgagaagc tccatgacgc gcccatttat 900atggatgaca cctcgggcat caccatcacc gagatgaaag ccaagatccg ccgggtgaac 960caggacccca gccgccccaa tgtggggctc atcgtcatcg actatctgca gctgatgacc 1020acgggccagc gcaccgagaa ccgtgtacag gagatcagct ccatcacccg aaacctcaag 1080atcatggcca aagagatgaa tgtgcccatc attgcgctga gccagctgtc ccgtgcggtg 1140gaaaagcagg gcaacaactc ctcccaccgc ccccagctgt ccgacctgcg tgattccggt 1200tccatcgagc aggacgccga ctgcgtgctg ttcctctacc gtgattctta ttacgccagc 1260cagaacccgg acggtgccga ggtggacgcc gacacggccg agtgcatcgt ggccaaaaac 1320cgccacggtg agaccagtac cgtgccgctg ggctgggatg gtgcccacac ccgctttatg 1380gatgtggact tcaaacgctg a 14015858DNAFaecalibacterium cf. prausnitzii KLE1255isolated from gut, Faecalibacterium cf. prausnitzii KLE1255 5atctccaaac tggaaaaaac gctgcgggca cggttcccga aaacgcagca gggcgaactg 60ctggccgggg cggtgctggc cttctgcctg ccggtgggca cctttctgct cacaagcgcc 120gtgtgccttc tggcggcaaa aatcagcccc tggctcggcc ttgccgtgca gatgttctgg 180tgcgggcagg cgctggcggc aaagggactt gtgcaggaga gccggaacgt ttacaacaag 240ctggtaaagc ccgacctgcc cgccgcccgc aaggccgtga gccgcatcgt ggggcgggac 300accgagaacc tgaccgccga gggcgtgacc aaggctgccg tggagactgt ggccgagaat 360gccagcgacg gcgtgattgc gccgctgctg tacatgctgc tgggcggcgc gccgctggcg 420ctgacctaca aggccgtcaa caccatggac agcatggtgg gctacaaaaa cgagacctat 480ctctacttcg gccgggcggc ggcaaagctg gacgatatgg caaactacat tcccagccgc 540cttgccgccc tgctgtgggc ggcggctgct gccctgaccg gcaacgatgc caaaggcgcg 600tggcgcatct ggcggcggga ccggcgcaat cacgccagcc ccaacagcgc ccagaccgaa 660agcgcctgcg ccggtgcgct gggcgtgcag ctggccgggc cggcctacta ctttggcgaa 720tactacccga aacccaccat cggcgatgcc ctgcgcccca ttgagccgca ggacatcctg 780cgggccgacc gcatgatgta cgccgccagc attctggcgc tggtgctcgg gcttgtgata 840cgggggttcg ttgtatga 8586930DNAParvimonas micra ATCC 33270isolated from gut, Parvimonas micra ATCC 33270 6aatcaattta gaattggttt atcaagaatg gagagagttg ttagagaaag aatgtcaact 60caagatccag accttgctac gcctcaagga cttattaata taagacctct tgttgcgtct 120ttaaaagaat tcttcggttc ttcacaatta tcacaattca tggatcaaaa caatccactt 180gcagaactta ctcataagag aagattatca gcattaggac ctggtggtct tagtagagat 240agagcaggat acgaagtaag agacgttcat gaaagtcact acggaagaat ttgtccgata 300gaaactccag aaggtccaaa catcggtctt attacttctc ttacaactta tgcaagagtt 360gatcaatatg gatttattga aacaccatat cgtgttgtaa ataatggaat tgctacaaag 420gacattgttt atttaactgc tgatgaagaa gatgaagtta ttatcgctca agccaatgaa 480ccacttgatg aaaatggacg ttttgtaaac gaaagagtaa gtggtcgtgg tattaatggc 540gaaaatgata tttatccaag agatacaatt caacttatgg acgtttctcc tcaacaaatt 600gtatcagttg gtacagcaat gattcctttc cttgaaaatg acgatgctac tcgtgcgttg 660atgggttcaa acatgcaaag acaagcagtg cctctacttg ttactgaagc tcctattgta 720ggaaccggta tagaacataa agcggcaaga gatagtggtg ttgttatcat tgctaaaaat 780tcaggaattg ttacaaaagt tgatagtgat gaaattcata ttaaaagaga tttagataat 840gtagttgata aatatagatt acttaaattt aaacgttcaa atcaaggaac aacaattaat 900caaagaccta tagttaatga aaatgacaga 9307336DNACoprobacillus sp. 8_2_54BFAAisolated from gut, Coprobacillus sp. 8_2_54BFAA 7atggcgattg atactgaatt agcaaaaaga ttacgttcat atcgtaattt taaacattta 60acacaaaaag atgttgctgc gcatttaaat gttcctcatt ctgcaatttc cgatatagaa 120aatggtaaaa gagacattac tgttagcgag ttaaaagtgt tttcaaattt atatggtaga 180agtgtagaag aaattatgag cgggaaaaaa tatgactatt ataatattgc caatatcgct 240cgtttactta ctgaacttcc tgatgatgat ttaaaagaaa tcatgtttat tattgaatat 300aaaagaaaaa gaaatgaaga acgtcatttg aaataa 3368594DNAUnknownisolated from gut, unidentified 8atggcaatgc tcactgtaga aaatatcaat gtatattacg gcgtgatcca cgcccttaaa 60gacatctcct ttcaggtaaa cgaaggcgag atcgtcgcac tgatcggcgc aaacggtgcc 120ggcaaaacca ccaccctgca gactgtcagc ggcatgctga gcgcaaagtc cggttcgatc 180cgatttcagg atcaggagat ttccagaatg ccggagcaca aaatcgtgaa gcagggaatt 240tcccacgtcc ccgaaggacg ccggatgttc tccaatctga cggttttgga aaacctgaaa 300atgggcgctt acaccagaaa agacaagcag gaaatcaaca attccctgga aatggtttat 360gagcggtttc cccgcttaaa ggaacgtacc cgccagctgg caggaactct ttccggcggt 420gaacagcaga tgcttgcaat gggacgtgca ctgatgtctc atccgaagat catccttctg 480gatgaaccgt ctatgggact ttcaccgatt tttgtaaatg agattttcga aattatcaag 540aaagtcagtg cagccggcac gaccgtactt ctggtagagc agaatgcaaa gaaa 5949432DNAPeptostreptococcus anaerobius 653-Lisolated from gut, Peptostreptococcus anaerobius 653-L 9tatttttaca agaataggca ctttaaaaat atgaccagtt cagttgatat attagattat 60atgttaaaag aatttcccaa cttaaaagat gcctatgatt tttatcaaaa cttcctatta 120agtatatcta ataatgatgt ggctatgctt gaagatattc taaatactag gactgataaa 180ataccaatgt gttttaggaa gagtataaaa agccttaaaa agtttagaaa gtatgtggta 240aattcactga aatatgacta tacgaatgcc atggtggagg gtaaaaacaa caagataaag 300gtaattaaaa gagtatccta cggatatagg agttttagga attttaaggc aaggataatg 360ctaatggaaa ggtataaaat acaaaagggc aacatccata gttatcagtt tgctatggat 420gctgccgcat aa 432101935DNAUnknownisolated from gut, unidentified 10aatatccgat atggcaacgg agctctggta gtagtccggg caagggaaaa ccttgtacat 60ggcgaagcag agcagattac cttcaatact aaaatattag aaaggtgcgt gaggcatttg 120agaaatccga ttgaagtatt gaaaactcta caagagaaag caggcaacga gaactatcaa 180tttgaacgcc tgtaccgaaa tctgtacaac gaggagtttt tcctattggc atacggaaat 240ctctctgcaa aagagggaaa tctgaccaag ggaacagacg gcgccacaat agacggaatg 300ggaatggagc ggattcgcaa gctgattgaa agcctgcgga accacagtta ccagccgtcc 360cctgcgagac gtgcctatat cccaaaatct aatggaaaac ggcgtccgtt aggcataccc 420tctgttgacg ataagctggt gcaggaagtt gtgaggttaa ttctcgaaag tgtgtatgaa 480agcaattttt ctgaacattc gcatggtttt agaccgaaca ggagctgtca cacggcactg 540acccagattc aaagaaactt cacaggggtt aaatggttca ttgaggggga catcaaaggt 600tattttgaca ccatcgacca ccatatcctt gtggatattt taagaaggcg cataaaggac 660gaatacctaa tctcgctgat atggaaattt ctgaaagccg gatacttaga agactggaaa 720ttcaatccta cctattccgg cactccgcaa ggctcggtca tcagtccaat acttgccaat 780atctacctta acgaattcga tacctatgtt gaagaataca tagagaaatt caaccgtggt 840aaaagacgtg aaagaaacag tgagtatcgc ttttatagtg atggcgcatc gaaactgagg 900gtaaagtacc gcgggttatg ggaaataatg acagccgatg aaaaagaaaa agccaaatgt 960gaagtaaatg agctcatgaa aaaagcaaaa cagattccag ctatgaatcc gatggacagc 1020aattaccgcc gtctgctcta ttgcaggtat gcggatgatt ttatttgcgg agtaatcgga 1080agcaaggaag atgcagaaac catcaaggct gattttagcc ggtacctgaa agaaaagctg 1140ggactggata tgtcggaaga aaagacactg attacacact caaacgaaaa agcggcgttc 1200cttggctacg aaatcgctgt ttccagaagc aatgaataca aaaagataag caacggacag 1260aaggcaagaa cctttaatgg gcgtgttcat ctatttatgc cacataataa atgggttaag 1320aagctgacca gttgcggagc aatggaaatc aaacagcagg acggcaaaga aatatggaaa 1380ccgcaggcga ggaaagacct catcaacaaa gagccgattg aaatcctaag catttacaat 1440gccgaaattc gtgggctgta caattattat tgtttggcaa gcaacgtatg caagctgcag 1500aaatattact acatcatgga atacagcatg taccagacgt ttgcagcgaa gtaccgtgat 1560aatttgcgga aaacgattaa caagcatacc cgaaacggcg tgtttggtgt cagctacact 1620acaaaaaccg gcaacgagaa acgggcgaca ttcgtgaaag gaagcttcca aaaacggact 1680gtcagcttag attacagtga tgaaatcccc tcttatcctg ccgcaaaata tagtcggaaa 1740aacggcttaa ttgagcggtt acagggtgga aaatgtgaac tatgcggaca gcagaccgac 1800aatgtaaaag ttcatcatgt caggaagctg aaagaattag ccggtatgaa agaatgggaa 1860agaaaaatgg ttcagatgaa cagaaaaact ctggttgttt gtaatacatg ttatggaaac 1920ataacaggca agtaa 1935111062DNARoseburia intestinalis L1-82isolated from gut, Roseburia intestinalis L1-82 11atggaaaaag taaaggcatt ttgtaaacgg aaaaacattg agatatccgt caagcgctac 60ctgattgatg cacttggtgc gatggcacag ggattatttg catcgctttt gatcggaacg 120atcatcagta cacttggaac gcagcttaat attccgattc ttgtgacagt cgggacttac 180gcgaaagcgg cagtcggacc ggcaatggcg atcgcaatcg gatatgcact gcaggcagcg 240cctttagtac tgttttcact tgcggcagtc ggtgcggcgg caaatgaact tggcggggca 300ggcggaccgc ttgcggtact tgtggttgca atttttgcag cagaatttgg aaaagcagtt 360tccaaagaga caaaaatcga tattattgtc actccgtttg tgaccatttt tgtcggggtc 420gcgctttcta tctggtgggc tccggcgatc ggtgcggcag cgagtgcagt cggtaatgcg 480atcatgtggg caaccgagct gcagccgttt ttcatgggaa tcattgtatc tgtgatcgtc 540gggattgcac tgacactgcc gatcagcagc gcagcaatct gtgcagcact tggactgacc 600ggattagccg gtggtgcagc acttgccgga tgctgtgcgc agatggtcgg atttgcagtg 660gcaagtttcc gtgaaaataa atggggcgga ttgtttgcac agggaatcgg tacatccatg 720cttcagatgg gtaatatcgt gaaaaatccg cgcatctggc tgccggcgac attggcgtct 780gcaatcaccg gaccgatcgc aatgtgtctg ttccatttac agatgaatgg tgcagcagtt 840tcctccggta tgggaacctg tggactggtc ggacagattg gtgtctatac gggatggatc 900gcagatattg aagcgggaag caaagctgcc attacaccga tggactggat cggactgatt 960ttcgtaagct ttcttctgcc gggcgtttta tcatggcttt ttagtgtgtt attccgtaag 1020atcggctgga tcaaagaagg cgatatgagg ctggacttat aa 106212873DNAUnknownisolated from gut, unidentified 12atgaaacgta ttttattaac tggagcaagt ggatttatag gtaaaaacat taaagagaca 60ttaaacagta aatatgacat atggagcccg tcaagccagg agctggattt aaaagatacc 120gaatgcgttg aagcatattt gaagcagcat tctttcgatg taatattgca tgcagcaaat 180tgtaatgata caaggaattc catatcagca tacgatgtac tcaatggaaa tctcagaatg 240ttttttaacc tagagagatg ttctcactat tatggaaaaa tgatttattt tgggtctggg 300gcagaatatg acagaagtaa taacatccct aatatgtcag aggactattt tgataccagt 360gttccgaaag atgcttacgg actttcaaaa tatattatgg caaaagcctg tttaaatcag 420aagaacattt atgaattgtg tttatttgga gtatacggaa aatatgagga atgggagaga 480agatttatct ctaatgcgat atgtcgtgca ttaaagggta tggatattac gcttcataaa 540aatgtatact ttgattattt gtgggtagat gacctcataa aaattatttc ttttttcatt 600gagaaagata acttgaggta caagaggtac aatgtgtgta gaggcgagaa ggttgatcta 660tattcgctgg cagtacaggt aaagaagact ttggatagcg aatgttcaat attagttggt 720gagcctggat ggaagaggga gtatactgcg gataacaata gaatgttgaa cgaaatgaat 780ggtttatctt ttacaaaact ggaagtgacg atagctgaat tgtgtgaata ttataaagag 840catttatcag aaatagttac tgaaaaattg taa 87313777DNAUnknownisolated from gut, unidentified 13atgaagaata tgataaaaat atttgaaaat gacgaattcg gaaaagtgag aacagtcatt 60aaggacggcg aaccgtggct tgtaggaaaa gatgttgcgg aaattttagg gtattccaac 120acaagggacg ctctttcacg tcatgtggat accgaggata aaaccaccgt cgtgatttcc 180gacagtggtt caaattacaa gagcaagacc actattatca atgaaagcgg cttttacagc 240ttagttctct caagcaaaat gccgagagcc aaagagttca ggcgttgggt gaccgccgaa 300gtcctcccca ccatcagacg caccggcggc tacgtttcca acgaggatat gttcatcaaa 360aactatctcc cctttctcga cgagccatac cgtgacctgt tccgacttca aatgaccatt 420atcaacaagc tgaatgaacg tatccgccac gatcagccgc tggtggagtt tgcgaatcag 480gtgtcaaata ccgataatct tatcgacatg aacgcaatgg caaagcttgc gagagcggaa 540aatatccccg tcggcagaaa caagctttac ggctggctga aaggaaaagg tgtgcttatg 600gcaaacaatc tgccgtatca ggcttttatc gaccgcggat atttttccgt aaaggagtcg 660gtgtttgaaa ctgcgactat gacaaagact tatcagcaga cgtttgttac gggcaggggg 720cagcagttcg tcataaattt gctgaagaaa tattatggga aggaggtttt gcaataa 77714687DNAFusobacterium nucleatum vincentii ATCC 49256isolated from gut, Fusobacterium nucleatum vincentii ATCC 49256 14tctgcaaaag aaaaagttgc tgcattagtt gctgcattaa aagcagatgg atatgatttt 60actgttggta tccctcttga tacaccaata ggaaaatctg aaagagttgt aagtgctggt 120aaagggattg gagataaaaa gaatatgaag ctaattgaaa acttagcaaa acaagctgga 180gcttctattg gttcttctcg tccagtggca gaaacattgc aatatgtacc tcttgaccgt 240tatgtaggaa tgtcaggaca aaaatttgtt ggaaaccttt atatagcttg tggaatttca 300ggagctttac aacatttaaa aggaattaaa gatgcaacaa caatagttgc tataaataca 360aactcaaatg ctccaatatt taagaatgca gactatggaa tagttggaga tttagcagaa 420attttacctt tattaactaa ggaattagat aatggagaag ctaaaaaaga tgcaccacct 480atgaagaaaa tgaagagagt tatacctaga gtagtgtata gtcctcatgt atatgtatgt 540agtggttgtg gacatgaata caatcctgat ttaggagatg aagattctga cataaaacca 600ggaactagat ttaaagattt accagaagat tggacttgtc ctgattgtgg agatccaaaa 660tctggatata tagatgcaaa aaaataa 687151206DNAUnknownisolated from gut, unidentified 15atgaggttat tttttgatat ggtatgtaac ggcagggcat tgcaaaatgt acaaatgtat 60aaattgaata tggttttaga tgtacacccc tatgctatta cagcaccgtc aaaaactggt 120ggccgttggc agacatatgt aaaggaaggt gataagcgta agattataag ggcttcttca 180aaggaaaaac taatggacaa attatatact gcctattttg ttcaaaatgg tgtttctggt 240atgaccatgg acaagctttt tctcgaatgg ttagcttata aggaatgtat cacaaatagt 300atgaatacga ttcgcagaca tgaacaacac tggaaaaagt attttcagga tatttcccca 360aataaggtat cttcctatga tcgtctggaa ttgcagaaag aatgtaatca gttaataaaa 420gttaataacc tttcttccaa agaatggcag aatgtaaaaa caattctttt aggtatgttt 480gactatgcct ttgaaaaagg atatattaat acaaacccca tgcccagtat taaaatcact 540gttaaattcc gtcaggtcaa taaaaagagt ggtaggactg aaacatatca gacagacgaa 600tacaaagcac ttatgcaata tctagatgca gaatatacag ctacagaaga ccttgcttta 660ttggctgtta aatttgattt ttttattgga tgccgtgttg ctgagttggt agctctcaag 720tggtgtgatg ttgaaaatct acggcattta catatttgta gggaagaggt taaagagtct 780gtccgtgttg gtgatacctg gaaagatgtt tataccgttt cagagcatac taagacatat 840acagaccggt ctataaattt agttcctaat gcgattgcta ttttaaatca tatccgtctt 900aaaatggctt ataatgtatc tgacgatgat tatatcttta cccggaacgg ttcccggatc 960acttcacgcc agattaatta tattcttgaa aaagcatgta caaaactggg aattatgatt 1020aagaggtcgc

ataaggtaag aaaaacggtt gcaagtcgtc tcaatgtcgg tgaggttccg 1080ttagattcta ttcgtgagct gttaggtcat gcaaatttaa gcactacact aagttatatt 1140tataatccgt tatcggaaaa agaaacctat aacctgatgt ccagagcctt ggggaaagtt 1200caatag 120616945DNAFaecalibacterium prausnitzii L2-6isolated from gut, Faecalibacterium prausnitzii L2-6 16atgaacagag aaacggtgaa catggtgcgc agtccgattt ctgtggaggg gaacatccgg 60cttgttccgt attatccggc ctacgataca gcacttgcgt ggtatcagga tgcacagctc 120tgcaaacagg tagataacag ggacttcgtt tatgatttgc cgctgctgaa gcggatgtat 180cattatctgg acacacacgg ggaactgttt tatattgagt atcggggtgt gctttgtggt 240gacgtcagcc tgcggacgac cggcgagctg gccatcgtca tctgcaagga gtaccagaat 300aaacacatcg ggcggaaggt catcgaaaaa atgctggagc tggctcggga aaggggcttg 360gcggagtgct tcgcgcacat ctattctttc aatacccagt cgcagaaaat gtttgaatcc 420attggctttg tcccacagga cgaagaacgc tatatctaca aattgcaaaa aggagaaccg 480actatgacaa aactgactct ggaagaaaag caggagctca tccggatggc ccttgcggcc 540agggagaggg cttacgtgcc ttacagcgac tttatggtgg gcgctgccct gcgcgccgag 600gatggccgtg tctttaccgg ctgccatgtg gagaatgccg cctttacccc caccagctgc 660gccgagcgca ccgcgctgtt caaagccgtg agcgagggcg tgaccaaatt tacggacatc 720gccgtggtag gctcccgccg gggcgagatc aatcagcaga tcacctcgcc ctgcggcgtc 780tgccgtcagg cactgtttga gtttggcggc ccggagctga acgtcatcat ggccaaaacg 840ccggatgatt tcatggagcg cagcatggat gagctgctgc cctttggctt cggtccctcc 900aatgtggcgg gcaacaaggc cgtggaagag gaagaaaaag gctga 94517627DNAClostridium hathewayi DSM 13479isolated from gut, Clostridium hathewayi DSM 13479 17atgcctatac ttcagcagct tctcacatta gtagagcagc acttcggtaa caaatgcgaa 60atcgtgcttc atgatctgac aaaggattac aaccatacca ttgtcgatat ccgaaacgga 120gacattaccc atcgttccat cgggggctgc ggaagcaact tagggctgga agtcctgcgc 180ggaaccgtgc tggatgggga tcgttttaac tatgttacca ccacacagga cggaaagatt 240ctccgttcct catcgatcta tctaaaaaat gatcagggcg aggtcatcgg atcgatctgc 300gtgaacctgg atatcacaga gacacttcag tttgaagggt atttacgcca gtttaaccag 360tttgacagct ttacttccaa cgacgaggag attttcgctc ccgacgtgaa taatcttctc 420agccatctga ttcagatggg acaggaacag atcggaaagc ctgcgctgga gatgaacaag 480aacgagaaga ttgagtttat ccgtttcctt gaccagaaag gagcattcct catcacgaag 540tccggggaac agatctgtga acttctggga atcagcaaat ttacctttta taattacctt 600gaaagcagcc gcagccagtc ggattcg 62718708DNAEubacterium ventriosum ATCC 27560isolated from gut, Eubacterium ventriosum ATCC 27560 18gcagcttcaa actacgacct ttgtacaaca atccttagaa atgaatgggg atacgatggt 60atcgtaatga ctgactggtg ggccaagatg aacgacgttg tagaaggtgg cgaagaatca 120aatcaggata caagagatat ggttcgctca cagaacgacg tatatatggt tgtaaacaat 180aacggcgcag aagttaactc aaacaacgac aacacagaga aatcaattaa agagggaaga 240cttacaatcg gagaacttca gcgagctgca atcaacatct gcaacttcat tctttcagca 300cctgttattg aaagagaatt agttgacaca gacgttgcaa aacattacga ttcagttcca 360aatgatcagg ccaagtatga agtatttaac attgaaaaag acaataaggt aatgttcaat 420agcggagcag aagcaacatt ggaagttgaa gacgaagggg aatacacaat tattgttaac 480atctcatttg acaagtccaa cttatcacag tcaacagtaa acgttaatgc caacggcaca 540acaatggtag taatccagac taatggaaca gacggcaact ggattacaca gaagctttgc 600aaggttaaac ttgacaaggg tgtatacaac ttaaaacttg aagaagtatt agcaggaatc 660aaagttaaat atattcagtt taagaagatt cctaagaaaa ataaataa 708191161DNABacteroides clarus YIT 12056isolated from gut, Bacteroides clarus YIT 12056 19atgaaaatca aacaattagc gaaaagcgca tcattcttgc tggtggcagg ttttatcagt 60tttactattc cgtcgtgtag cagtgaagaa gaaatcatca tccttcagga tgtaaaagta 120aacagtgaaa gcttcaatct ggccgaagac ggcagtacga ccatagaagt caaggtagta 180cccgaaaata ctccaatagc caaagccgta ctcagcacat cattatttaa tgaaagcggt 240gttttcgaag taacccgact cactcccaaa ggtaacggtg tatggcagat agcagcaaaa 300gtaaaggact tctcacgcat tcaaaacggt caggacgtaa tactttccgt ctatcaggaa 360gataatatgt atatccaaac cacattgaaa ataaacgacc catatagcat cgagggtaaa 420tatacaccgg tccatccgca agcctttact ttctacagtg ccgaagacgg caaactgatg 480gagattccgt tcatcatcac agccgacaac gcagccgacc ttgccgccat cagctacgac 540aatataaagg tagtcaatgg caccggaagc tctacaccca gcataagtat cacacatttc 600gcaatagctc cgatgacagg taaaacaggc ttctatctgc aagtggataa cgcccaactc 660gaaacggtaa aaaaagccat cacaaccatc gcttttttgg actgccgggt tatgataacc 720ggccctaacg gccgtgttgc ctatactcct gtgcgcctca ttgtttcttc tccgaagtgc 780atcatcaagg acgaccaact cagcctgctg catacagaat tgtccgcccc ggagtttaat 840agacaaatca ccatagatat gacccacgat ttttatcgtt tgggcaaaca gaatgataaa 900acaacctttg aggcgtttga aaaccgaggc ttgtataact cacaaggaga aatggcagat 960gcagaccctc agttcatttc gttgggttat accactcagg gcaaaaatac aacatgtaac 1020gtaactttaa aacatgatgc cacaattcct gcaatcggca cttaccacat ggtagaacgc 1080ctaaaaggat attgggaata tgacggaaag aaatatccga ccgtttgtac agacctgcaa 1140ttccaaatca cgattaaata a 116120750DNACoprococcus catus GD/7isolated from gut, Coprococcus catus GD/7 20atgaaaggaa aaagagttat tgcaggcatt ctgcttgcag gaattttagc agttaccctg 60gcagggtgta aaaacacaga taacactaaa gaagaatcag aaaagccggt tattaccctc 120ggcagcgata gctatccacc atacaattat ctgaatgagg atggtgtacc gacgggcata 180gatgtggaac tagctacaga agctttcaaa agaatgggat atcaggtgaa tgtcgtccaa 240atcaactggg aggagaaaaa agaactggta gagagtggaa agatcgattg tatcatgggt 300tgtttttcta tggaaggacg tcttgacgat taccgctggg caggggcgta catagcaagc 360cgtcaggttg tagcggtaaa tgaggacagt gatatttata aattgagtga ccttgaggga 420aagaacctgg ctgtccagtc cacaactaaa ccggaagtta tatttctgaa ccggttggat 480aagagaatcc acaaactggg aaatctgatc agtcttggac accgcgagct gatatataca 540tttcttggga aaggatatgt agatgcagtt gccgcacatg aggaatcaat catccagtat 600atgaaggatt atgacataga cttccgtatc ctggaagaat cgctgatgat tacggggata 660ggtgttgctt tcgcaaaaga tgatgacaga ggaattgtga gcagatggac cagacccttg 720aagaaatgcg taaggatggc acgtctttga 75021696DNABurkholderiales bacterium 1_1_47isolated from gut, Burkholderiales bacterium 1_1_47 21atgattgctg aacaaatact tttttctcag gggtttggaa cccgccatga atgcacggga 60cttattcttc aaggccgatt tcatgtcaac ggcactgcag tgactgatcc cgatgaggat 120atcccaacgg aaaatctgac cttcgaggtt gacggggttg aatggccttt ttttgaaaaa 180gccatcattc tgctgaacaa acccgagcac tatgaatgct ctttgaagcc aattcatcat 240ccgagtgtgc tctctctgct gcctccgccc ctgcgtgtca gaaaagtcca gccggtgggc 300cgtctcgatg aagacaccac aggactgctt ttattaacgg atgacggaaa gctgattcat 360cggctcacgc accctaagaa acatgtcacc aaaatctatc gggttgcact taagcatccg 420atcaccgaaa agcaaatcgc tcatctcctt aagggagtac agcttgcaga ttcgccggat 480atcgtcaaag ccgtcagctg cgaaaaagtc tccgaactcg tcattgatct cggcattacg 540caaggcaagt atcatcaggt caaacgcatg atggctgccg tatctaatcg agtcgtcgcg 600ctggaacgaa tccgtttcgg aaacctctcg ctgccggaag acttaaaacc gggagaatgg 660acctgggtca aatccgtcaa ggaaattacc ggatga 69622912DNAClostridium hathewayi DSM 13479isolated from gut, Clostridium hathewayi DSM 13479 22atgagcctgc gaaccatgat caaaggagga tttacaatga aaaaaatgat cgctgggtta 60ttgtgtggct gtatgatcgc ggcttcttta acgggctgtg gaaaagcgcc tgcttccgat 120ggcggtgcga cagaaaaggc tgccggtgcg gaggcagaaa aagcagcgga taaaaccgaa 180gcttcttccg attccggttc caaagtaatt aatgtctggt cgtttaccga cgaagtgcca 240aagatgattg aaaagtacaa agaaatgcat ccggattttg attatgagat taaaacaaca 300attattgcga ctactgatgg cgcgtaccag ccggcgctgg atcaggcact ggcatccggc 360ggcagtgatg cgccggatat ctactgtgcg gaagccgcat ttgtcctgaa atatacgcag 420ggtgacgcca gccgttatgc cgcgccatac gaagatctgg gaattgacgc ggatggtaag 480attaaatcct ctgagatcgc acagtatgcg gtcgatatcg gaacgaatcc tgacggtaaa 540gtggtggcgc tgggctacca ggcaaccggc ggagcgttta tctatcgccg ttccatcgcc 600aaggacacct ggggaaccga tgatccgaag gaaattggtg caaagcttgg cgcaggcacc 660aatgactgga cacaattctt taacgcggca gaagagctga agggcaaggg ctacggcatt 720gtatccggcg acggagatat ctggcacgca gtggaaaaca gctcggacaa aggctggatt 780gtggacggaa aattaaacat cgatccaaag agagaggcat ttctggattt atccaagaag 840ttaaaggaca acggctatca caatgacaca caggactggc aggatgcggg gtttgccgat 900atgaagggag aa 91223483DNAFusobacterium sp. 7_1isolated from gut, Fusobacterium sp. 7_1 23atggaaaata taatgacaat atttcaacaa tatggtggag tagtatttgg agttttaggg 60gcagctcttg cagttttatt atctggtatt ggttcagcaa gaggagttgg aattgcaggg 120caggcagcag caggtttagt tattgatgaa cctgaaaagt ttggtaaagc tatggtactt 180caacttttac ctggaacaca aggactttat ggatttgtaa tagggctttt tattatgttt 240agacttacac ctgaaatgac aatagcagaa ggtttgtatt tgttaatggc aggacttcca 300gttggttttg ttggattaag atcagctcta tatcaagggc aagttgcagt agcaggtatt 360aacattctag caaaaaatga acctcatcaa acaaaaggaa taatacttgc agtaatggtt 420gaaacttatg caattttagc atttgctatg tctttcctat tactaaatca agtaaaattt 480taa 483241155DNAUnknownisolated from gut, unidentified 24actgacgctg tgaacatgct ttcggcactg ggtgtcatca acggttacga cgatggctct 60tacaagccgg acgcaactgt tactcgtgcg gagatggcaa agatgatctt tgttgtccgc 120aacaataaga ttgacgattc ggcttacaag aacaactcta ctaagctgac cgacgtcaac 180aagcactggg ctgcaggcta catcaagttc tgcgaatccc agggcatcat cgcaggcaag 240ggcaacaaca agtttgaccc ggatgcaacc gttaccggcg tagaagcagc taagatgctg 300ctcgtagttt ccggttacga tgctcagaag gctggtctga ccggttctgc atggcagact 360aacgtcctga agtacgctgg cgctgctggc attctggacg gcgttaactc cgctctggag 420tctggcctgc cgcgtcagta cgctgctcag atgatctaca acaccctcga cgttaaccgt 480gtaaagtggt ccgaagactc caagtccttc gacgacgttc tcaacggcgg cgttaaggag 540actgttggta aggcttacat gggcctgtgc tacgattacg gtactctgac cgaaatcgat 600accgattctc tgaccatcaa gctcgactct gactacgact ctgacaacta ccacaactct 660gaccgcaact acaagggtgg cgacaaggtt tccttcacca aggtaggcga ggactacacc 720gcactgctcg gccagaaggt taaggtaatg ttcaaggacg gcaagaccaa caatgttctg 780ggcgtatact ccatctctga caacaaggtt tacaccaccc ggatgaacaa ggttgagctg 840gacggtcaga agatcaagtt cggcggtact tcttactctg ttgataacac caagaagatc 900gacctgacct tcatcggcgt taacggcacc aagaacgaga ctgttggtat tgcttacttt 960gacaaggacg gcgcactgaa cgacgataag tccaacggtg taacttctct gtctgaggta 1020accttcgttg ataccgacgg caacaacaag atcgataccg ctctggttat cgagaaggtt 1080gctggtgaag ttaccaacgt tgcttccgac aagatcacct tcgctggtaa gacctataag 1140ttcgctgacg agcag 115525768DNABacteroides sp. 2_1_56FAAisolated from gut, Bacteroides sp. 2_1_56FAA 25atgcgggtag gttccgtaca agaaacggag caagagaaag gctgtgccca tttcctcgaa 60cacgtaactt tcggcggtac ccgccatttt cctaaacgct ctttagtaga gtacctcgag 120tccttaggaa tgaagtacgg acaagatatc aacgctttca ccggtttcga ccgtacaatc 180tatatgttcg cagttcccac cgatcatgcc aaagacgaag ttctcgatcg ttcattacta 240atcctatgcg attggttgga cggtgtcact atagatccgg aaaaagtaga gaatgaaaaa 300ggaatcattc ttgaagaact acgcggattc gatccggaag acgatttcta tccgctcaaa 360atcggacaag gcatattcag tcaccgtatg cctttgggca caacagacga tatccgcaag 420gtcaccccgc aagtgctcaa aaattattat cgcaaatggt atgtaccctc tttggcaaca 480ttggtcattg taggcgacat atctcccttg gagatcgaat ctaaaatcaa agaacgtttc 540aaatccctgc ccggacgtcc ggtcaatgac ttccggacct acccgttaga gtacacccgg 600ggaatccatc tggcctccat acgagactcg ctgcaaaccc gtacaaaagt cgaattaatg 660attccacacc cttgcacagt agagcgcacc atggaagacg ctataacaaa gagaaaggac 720gcctgctcgt cagtgccatt tcttcacgat tccgtgcccg gaaactaa 768261218DNAFaecalibacterium prausnitzii A2-165isolated from gut, Faecalibacterium prausnitzii A2-165 26atggataaaa tcagtgcaag cgttctttta aaaggcctgg ccccggaggg ctttgccgat 60gaggcagcca tcgaccttgt gaccaccgac agccgcgagg tgcgccccgg gtgcatcttt 120gtggcgttcc ccggtgaaaa atttgacggc cacgatttcg cggccaaagc actggaagaa 180ggggccgaat atgtcgtcct caaccacccg gtgaaggatg tcccggcggg gaagtgtttc 240ctctgcccgg acagctaccg cgccatgatg atgatgggtg ccaactaccg ccgccagttt 300tcccccaagg tagtgggcgt gacgggcagc gtgggcaaga ccacgaccaa acagatgacc 360tacgccgcca ttgcgggctt tggcaatacc atcaagaccg agggcaacca gaacaacgag 420ctgggcctgc cccgcaccat gttccgcatc ggcaaagaga cggagtacgc ggtggtggag 480atgggcatga gccaccgggg cgagatcgag cggctgagcc gctgcgcccg cccggatgtg 540ggcatcatca cctgcatcgg cgtgtcgcac attggcaacc tgggcagcca ggagaatatc 600tgcaaggcca agctggagat ctgcgagggc ctgcccaacg gtgccccgct ggtgctcaac 660ggggatgacc cgttcctgcg ggccgcgaag ctgccggagc atgtccaccc ggtctggttc 720agcctggggg atgagaacgc ggacgtctgc gccctgaaca tccggcagga ggacgacggc 780atgaccttta cgctggaaga ccgggaggaa gacaccaccg aggtgcacat cccggccatg 840ggccgccaca atgtggccaa cgcgctggcg gcttacgccg ccgccacccg gctgggcctg 900aacgccaagc gggtcattgc ggggctgagc cagttccagc agaccgggat gcgccagaaa 960gtgatccaca gcaagggtgt ggatgtcatc gaggactgtt acaacgccaa ccccgacagc 1020atgaaagccg cgctggccat gttcaaagag tacccctgca agcgccgctt tgccctgctg 1080ggcgatatgc tggagctggg cgagatcagc ccggaagccc atgagacggt gggcaaacag 1140gctgcggaat acggcgtgga cttcctcgtg gcctatggcc cggaggcgaa acgcacggcc 1200caggctgccg cggctgcc 121827522DNAParvimonas micra ATCC 33270isolated from gut, Parvimonas micra ATCC 33270 27gaaatgccaa acaaaaaaat aaatgaagat cctgtagcac ctcaagttgc agcagaaatt 60atcagagaat acttaaaaac agaaggaaat gcaacacaaa atcttgcaac tttctgtcaa 120acttatatgg aaccaactgc aacagcattg atggcagaga attttgaaaa aaatgcaatt 180gataaagatg aatatgcaat gactgcagac cttgaaaaca gatgtgtcga tattattgga 240aacttatggc atatgaatcc aaaagaagaa cctataggaa catctactgt cggttcatca 300gaagcttgta tgctaggtgg actagctatg cttttcagat ggaaacatct tgcagataaa 360gcaggagtta atagattcac caaaaaaaga cctaatcttg taatttcttc aggatatcaa 420gtatgttggg aaaaattctg tcgctactgg gatattgaaa tgagaactgt tccactagat 480atggaacatt tatctctcaa tatggataca gtaatggatt at 52228729DNAUnknownisolated from gut, unidentified 28atgaagaaaa tgatgaaatt aaatatagcg atttgcgatg acaataaatt agtattggac 60aatgagaaaa cactgattga agaaacgttg aaagagatgg gaataccgta caagatggac 120aagtatcaaa atcctgaaaa tcttatcaaa aatgcatggc aatacgatat ggtgttttta 180gatgtggaaa tggatgaagt caacggaatt atggcggcgg aaagtattca caatatcaat 240aaggaatgtt tgctgttttt cgtaaccaac cacgaggttt atatggacta tgctatgaac 300gagtatgcat tcagattttg ggtaaagcct atgtcgaaag aaaaactgaa atttgggtta 360gaatcggcat tgaaacggtt ggagagcgat aacaaatgca tagaattcaa caccgacaga 420aatgttgtga atataccgat aaataaaatt atttttatat gtgccgagaa caaaaagacg 480actattgtta cggtagatga acaatttgta attgaccgtc cgtataaagt ggtcaaagat 540atgataaatt catatttctt ctatgagtca cacgcaagtt actatgtgaa cttaaattat 600gtaaaggcgt attcaccgtc gcacgtcaaa tgcggaatag gaaatcatga atatgaaatt 660catatgtcgc gaagaaaata tacggaattt aataaatatt ttatagattg gatgggtgaa 720caaaaatga 729292529DNAUnknownisolated from gut, unidentified 29atggaaaata taaagataca attcaaggga ataacccgta acactgacga tggaataagt 60gctgacggtg aatgcatgga gcttattaat gctcgcgtga acaattcaag tatagaaccg 120atcggtaaac cgataatgct aaagcagact gcacacacgt attccaagat ataccatcat 180tctatagcta aaaggtatat aggaataacc gagtccggtc agatgtacga aatgccggag 240gatctttcat cagaaactat aatgaccggt gatttgaagg caaaaagcat agaatttatc 300ggaaatacaa tatcggtaat aacagatgaa ggtataaggt atatcctttt caggaacggt 360tcatatattt atcttggtga aattcctgac gtacctgagt tcggaattga taaggaagtg 420aaagctgttt ccgttgaaat agatgaaata tcagataacg atgatgaagt aaggtatgga 480aacttcacta aagttcttag cgaagctaat aaaaacgggt gttactgcta ttctgcagcg 540ttttgcgcgg ctttcaggat gtttgacgga agttatatca agtcaactga aatacagatt 600atattccttg attctgatga ttcagtgact attacttatg gagacaggaa taaccctcag 660aatattgaat tgtccggagg ttattcaaat cagttttttg ctcaaacaaa ttcgaatgga 720gttatgcagg cacatatact ttgctttaag ccttcattct tttttgaaga atatgatctt 780tccgcatgga gcgatattat aataggaata gaaatatttt ccactgataa ttttaggaca 840agactgcaga aagattattt cggagtctat atatcacagt ttgagatgaa ctgcaagaaa 900ccgattgaaa gggcgaataa tatcagcctg atgtataata ttacatcgtt aaaacttggt 960gaaacaaaaa aatctgttga tattgacgtt tctatagata accttgcaac ccttccgcac 1020atggttgaca gttttaacac gcatcattca atattgccaa aatcgtctta ttcatataac 1080aacaggcttc atcttatcgg aataaagaga actctttcaa gtggtgtaag agtgtcttct 1140accgcaaagg aatatcaatt cctgatacac atatacattc atgcttcaga cggtgataaa 1200gttatagaga aatgggaaat aggtaaatac ataaggacat tcatcatgta ccctgacagc 1260agggcataca agatgatcat atatagatat gaatataatg ttccggttgt aggaatccag 1320attgatttga aaaaaagcga ttactttgat ttttcatttt attgtaagga atatgaatat 1380gaaagaggaa gcgttaaaca aaatacaggg ttctttgacg ttataaagat gagcgatttt 1440gaaagcatgg aagttggaga aacaacagac aacatggatt acgaaaaagg aaatgtaatg 1500tatgtttcaa acctgaacaa tccgtttttc tttcctgctg accaggttta tcagttcaat 1560actgatattg tcggagtaca gtcaaacgtc gtggccctat ctcaaggaca gttcggccag 1620ttccctcttt acgtattcac caaagacggt atatacgcca tgaatgtagg aagcggagaa 1680gtcgcatatt caaatcagac acctgttacg cgtgacgtgt gcaacaatcc ggattctata 1740tgcggacttg atactatggt cgcattttca accgaccgcg gtcttatggt aattaacgga 1800actgttacag agctaatctc ggaaaagata tacggattcc ttccttcatg ttccgtatct 1860tcacctataa tagttaagat attagatgta gcttctctgg gtgacgatat atcaagcgtt 1920gtgttccctg actatataga agaagcaaag ataggataca actatgaagc aaaggaaatt 1980gttgttgcaa acatgaattt tccttattcg tacgtttatt cattgaagac cggggaatgg 2040cataaaatat cacagaatat agattcattc gtcaactcct acccttacac gtgggctgta 2100agcggaaacc agatacttga ccttaacaac acccatagaa gcgtgtctac catagcactt 2160ataagcaggc ctatcaagat gggtactctt acacacaagc gaatacttca gacagcttta 2220aggggaatag taaaaagaag cctttccgac ctttacataa aaggtgagcc ggtaatgttc 2280agaggtgaca cggtagatat attttctgac gtaggaatgt atgtacttgc ttcaaatgat 2340gccgaacatt ttgagctggt tgcaaagaag gaaaagatgg tagatataag ggacctggtg 2400acgaagatga acaagagcaa gccttataaa tacttcatgg tatgtcttgt aggaggtgta 2460aggactgacg tttcaataaa ctacatagaa atgaatgtgg atgaaagctt tacgaacagg 2520cttagatag 252930147DNAUnknownisolated from gut, unidentified 30aggattgcag ccctctttcg agcggctttg cagggagggg tcttgccccc tccgcggggc

60gtcagggaca ccgcccctta cagcaccgac gcgcccggca ggcgggcact gacgagaatt 120ttcacgaaaa acatcggttt tccgtaa 14731543DNAClostridium hathewayi DSM 13479isolated from gut, Clostridium hathewayi DSM 13479 31atgagcatgc aacaagatga agatatgttg ttacagtcga tttatgaaga gtatcaggga 60acgctccgcc ggattgcgag agcgctgaat gttcccaaca tggaactgga agatgtagtt 120caggaaacgt ttattgctta ttttaggaag tattcattaa catggtcgcc aacgcttaag 180aaggcgatgc tggtgaaaat cttaaaggga aaagcaattg actgtctcag aaagaatgga 240cattatgaaa aggtcagtct tgatgaggag aattcaataa gatgtattga gatgctgacc 300acctatgtgg taacagatcc cattgatatt attatcagtg aggaatcgat acagaggatt 360actacggaaa tagccaatat gaggcaggaa tggaaagaga tggctgtttt gtattttctc 420gagcagagaa ccattccgga gatttgtgaa atgctggaga taccgggaac ggtttgccgc 480tcccggattt acaggacaag aatgtgtctg aaaaagattc tcggaccgaa atacgatatt 540taa 5433229DNAUnknownprimer 32aagaatggag agagttgtta gagaaagaa 293326DNAUnknownprimer 33ttgtgataat tgtgaagaac cgaaga 263430DNAUnknownprimer 34aactcaagat ccagaccttg ctacgcctca 303524DNAUnknownprimer 35ttgtaagtgc tggtaaaggg attg 243626DNAUnknownprimer 36cattcctaca taacggtcaa gaggta 263730DNAUnknownprimer 37agcttctatt ggttcttctc gtccagtggc 303820DNAUnknownprimer 38aatgggaatg gagcggattc 203921DNAUnknownprimer 39cctgcaccag cttatcgtca a 214026DNAUnknownprimer 40aagcctgcgg aaccacagtt accagc 26

* * * * *

File A Patent Application

  • Protect your idea -- Don't let someone else file first. Learn more.

  • 3 Easy Steps -- Complete Form, application Review, and File. See our process.

  • Attorney Review -- Have your application reviewed by a Patent Attorney. See what's included.