Reference : Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
Scientific journals : Article
Engineering, computing & technology : Multidisciplinary, general & others
http://hdl.handle.net/2268/165977
Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
English
Botta, Vincent mailto [Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation >]
Louppe, Gilles mailto [Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation >]
Geurts, Pierre mailto [Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Algorith. des syst. en interaction avec le monde physique >]
Wehenkel, Louis mailto [Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation >]
2-Apr-2014
PLoS ONE
Public Library of Science
Yes (verified by ORBi)
International
1932-6203
San Franscisco
CA
[en] machine learning ; data mining ; random forest ; snp ; Genome-wide association studies ; genetics ; linkage disequilibrium ; correlation ; decision trees
[en] The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification.
http://hdl.handle.net/2268/165977
10.1371/journal.pone.0093379

File(s) associated to this reference

Fulltext file(s):

FileCommentaryVersionSizeAccess
Open access
journal.pone.0093379.pdfPublisher postprint630.93 kBView/Open

Bookmark and Share SFX Query

All documents in ORBi are protected by a user license.