References of "Botta, Vincent"
     in
Bookmark and Share    
Full Text
Peer Reviewed
See detailExploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
Botta, Vincent ULg; Louppe, Gilles ULg; Geurts, Pierre ULg et al

in PLoS ONE (2014)

The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however ... [more ▼]

The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification. [less ▲]

Detailed reference viewed: 27 (7 ULg)
Full Text
See detailA walk into random forests: adaptation and application to Genome-Wide Association Studies
Botta, Vincent ULg

Doctoral thesis (2013)

Understanding underlying mechanisms of common diseases is one of the major goals of current research in medicine. As most of these disorders are linked to genetic factors, identification of the associated ... [more ▼]

Understanding underlying mechanisms of common diseases is one of the major goals of current research in medicine. As most of these disorders are linked to genetic factors, identification of the associated variants forms an excellent strategy towards the elucidation of molecular and cellular dysfunctions, and in fine could lead to better personalised diagnostics and treatments. Genome-Wide Association Studies (GWAS) aim to discover variants spread over the genome that could lead, in isolation or in combination, to a particular trait or an unfortunate phenotype such as a disease. The basic idea behind these studies is to statistically analyse the genetic differences between groups of healthy (controls) and diseased (cases) individuals. Advances in genetic marker technology indeed allow for dense genotyping of hundreds of thousands of Single Nucleotide Polymorphisms (SNPs) per individual. This allows to characterise representative samples composed of several hundreds to several thousands of cases and controls, each one characterised by up to a million of genetic markers sampling the genomic variations among these individuals. The standard approach to genome wide association studies is based on univariate hypothesis tests. In this approach each genetic marker is analysed in isolation from the others, in order to assess its potential association with the studied phenotype, in practice by the computation of so-called p-values based on some statistical assumptions about the data-generation mechanism. Because of the very high ratio between the large number of SNPs genotyped and the limited number of individuals, multiple-testing corrections need to be applied when carrying out these analyses, leading to reduced statistical power. While this standard approach has been at the basis of many novel loci unravelled in the last years for several complex diseases, it has several intrinsic limitations. A first limitation is that this approach does not directly account for correlations among the explanatory variables. A second intrinsic limitation of GWAS is that they can't account for genetic interactions, i.e. causal effects that are only observed when specific combinations of mutations and/or non-mutations are present at the same time. The third limitation of univariate approaches is that they do not directly allow to assess the genetic risk, since many of the identified markers (with similarly small p-values) actually account for the same underlying causal factor: exploiting their information to predict the genetic risk is hence far from straightforward. Within bioinformatics, machine learning has actually become one of the major potential sources of progress. As a matter of fact, biology has become nowadays one of the main drivers of research in machine learning, and is by itself already a very competitive research field. Among the subfields of machine learning, supervised learning and its extensions such as semi-supervised learning, stand out as the most mature and at the same time most rapidly evolving area of research. Within this context, the purpose of this thesis was to study the application of random forest types of methods to genome wide association studies, with the twofold goal of (i) inferring predictive models able to asses disease risk and (ii) to identify causal mutations explaining the phenotype. The choice of this family of methods was originally motivated by the fact that these methods are a priori well suited for that kind of analysis due to some of their interesting properties. They are indeed able to deal efficiently with very large amounts of data without relying on strong assumptions about the underlying mechanisms linking genetic and environmental factors to phenotypes, and they can also provide interpretable information, in the form of scorings and/or rankings of SNPs so as to help in the identification of causal genetic loci. In the first part of this manuscript, we analyse the state-of-the art in the application field of genome wide association studies and in supervised machine learning, and subsequently describe in details the three tree-based ensemble methods that we have implemented and applied in our research; in Part II, we report our empirical investigations, in three successive steps, namely i.) a preliminary study on simulated datasets yielding controlled conditions with known ground-truth and allowing for a first sanity check of the T-Trees methods, in ideal conditions; ii.) a detailed study on a given real-life dataset concerning Crohn's disease, where we try to understand the main features of the three different algorithms in terms of predictive accuracy and capability of identification of relevant genetic information, and their sensitivity with respect to various kinds of quality control procedures and algorithmic parameters; iii.) a systematic replication study, where we confirm, on 7 different datasets from the Wellcome Trust Case Control Consortium, the main outcomes of our study on the Crohn's disease, while using default parameter settings. [less ▲]

Detailed reference viewed: 156 (33 ULg)
Full Text
Peer Reviewed
See detailRaw genotypes vs haplotype blocks for genome wide association studies by random forests
Botta, Vincent ULg; Hansoul, Sarah ULg; Geurts, Pierre ULg et al

in Proc. of MLSB 2008, second workshop on Machine Learning in Systems Biology (2008, September)

We consider two different representations of the input data for genome-wide association studies using random forests, namely raw genotypes described by a few thousand to a few hundred thousand discrete ... [more ▼]

We consider two different representations of the input data for genome-wide association studies using random forests, namely raw genotypes described by a few thousand to a few hundred thousand discrete variables each one describing a single nucleotide polymorphism, and haplotype block contents, represented by the combinations of about 10 to 100 adjacent and correlated genotypes. We adapt random forests to exploit haplotype blocks, and compare this with the use of raw genotypes, in terms of predictive power and localization of causal mutations, by using simulated datasets with one or two interacting effects. [less ▲]

Detailed reference viewed: 109 (35 ULg)
Full Text
See detailPrediction of genetic risk of complex diseases by supervised learning
Botta, Vincent ULg; Geurts, Pierre ULg; Hansoul, Sarah et al

Scientific conference (2008, May)

Detailed reference viewed: 8 (2 ULg)
Full Text
Peer Reviewed
See detailA novel formulation of inhaled doxycycline reduces allergen-induced inflammation, hyperresponsiveness and remodeling by matrix metalloproteinases and cytokines modulation in a mouse model of asthma
Guéders, Maud ULg; Bertholet, P.; Perin, Fabienne ULg et al

in Biochemical Pharmacology (2008), 75(2), 514-26

Background In this study, we assess the effectiveness of inhaled doxycycline, a tetracycline antibiotic displaying matrix metalloproteinases (MMP) inhibitory effects to prevent allergen-induced ... [more ▼]

Background In this study, we assess the effectiveness of inhaled doxycycline, a tetracycline antibiotic displaying matrix metalloproteinases (MMP) inhibitory effects to prevent allergen-induced inflammation, hyperresponsiveness and remodeling. MMPs play key roles in the complex cascade of events leading to asthmatic phenotype. Methods Doxycycline was administered by aerosols by the mean of a novel formulation as a complex with hydroxypropyl-gamma-cyclodextrin (HP-gamma-CD) used as an excipient. BALB/c mice (n = 16–24 in each group) were sensitized and exposed to aerosolized ovalbumin (OVA) from day 21 to 27 (short-term exposure protocol) or 5 days/odd weeks from day 22 to 96 (long-term exposure protocol). Results In the short-term exposure model, inhaled doxycycline decreased allergen-induced eosinophilic inflammation in bronchoalveolar lavage (BAL) and in peribronchial areas, as well as airway hyperresponsiveness. In lung tissue, exposure to doxycycline via inhaled route induced a fourfold increase in IL-10 levels, a twofold decrease in IL-5, IL-13 levels and diminished MMP-related proteolysis and the proportion of activated MMP-9 as compared to placebo. In the long-term exposure model, inhaled doxycycline significantly decreased the extent of glandular hyperplasia, airway wall thickening, smooth muscle hyperplasia and subepithelial collagen deposition which are well recognized features of airway remodeling. Conclusion Doxycycline administered by aerosols decreases the allergen-induced airway inflammation and hyperresponsiveness and inhibits the development of bronchial remodeling in a mouse model of asthma by modulation of cytokines production and MMP activity. [less ▲]

Detailed reference viewed: 107 (37 ULg)