[en] One of the long-standing open challenges in computational systems biology is the topology inference of gene regulatory networks from high-throughput omics data. Recently, two community-wide efforts, DREAM4 and DREAM5, have been established to benchmark network inference techniques using gene expression measurements. In these challenges the overall top performer was the GENIE3 algorithm. This method decomposes the network inference task into separate regression problems for each gene in the network in which the expression values of a particular target gene are predicted using all other genes as possible predictors. Next, using tree-based ensemble methods, an importance measure for each predictor gene is calculated with respect to the target gene and a high feature importance is considered as putative evidence of a regulatory link existing between both genes. The contribution of this work is twofold. First, we generalize the regression decomposition strategy of GENIE3 to other feature importance methods. We compare the performance of support vector regression, the elastic net, random forest regression, symbolic regression and their ensemble variants in this setting to the original GENIE3 algorithm. To create the ensemble variants, we propose a subsampling approach which allows us to cast any feature selection algorithm that produces a feature ranking into an ensemble feature importance algorithm. We demonstrate that the ensemble setting is key to the network inference task, as only ensemble variants achieve top performance. As second contribution, we explore the effect of using rankwise averaged predictions of multiple ensemble algorithms as opposed to only one. We name this approach NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) and show that this approach outperforms all individual methods in general, although on a specific network a single method can perform better. An implementation of NIMEFI has been made publicly available.
Disciplines :
Computer science
Author, co-author :
Ruyssinck, Joeri; Ghent University > Department of Information Technology > iMinds
Huynh-Thu, Vân Anh ; Université de Liège - ULiège > GIGA-Management : Coordination ALMA-GRID
Geurts, Pierre ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Algorith. des syst. en interaction avec le monde physique
Dhaene, Tom; Ghent University > Department of Information Technology > iMinds
Demeester, Piet; Ghent University > Department of Information Technology > iMinds
Saeys, Yvan; VIB Inflammation Research > Department of Respiratory Medicine > Laboratory of Immunoregulation
Madhamshettiwar PB, Maetschke SR, Davis MJ, Reverter A, Ragan MA (2012) Gene regulatory network inference: evaluation and application to ovarian cancer allows the prioritization of drug targets. Genome Medicine 4: 41.
Michoel T, De Smet R, Joshi A, Van de Peer Y, Marchal K (2009) Comparative analysis of module-based versus direct methods for reverse-engineering transcriptional regulatory networks. BMC Systems Biology 3: 49.
De Smet R, Marchal K (2010) Advantages and limitations of current network inference methods. Nat Rev Micro 8: 717-729.
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95: 14863-14868. (Pubitemid 29003722)
Butte AJ, Kohane IS (2000) Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In: Pacific Symposium on Biocomputing. 418-429.
Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, et al. (2007) Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles. PLoS Biol 5: e8.
Margolin AA, Wang K, Lim WK, Kustagi M, Nemenman I, et al. (2006) Reverse engineering cellular networks. Nature Protocols 1: 662-671.
Meyer P, Kontos K, Lafitte F, Bontempi G (2007) Information-theoretic inference of large transcriptional regulatory networks. EURASIP journal on bioinformatics & systems biology.
Ding C, Peng H (2003) Minimum Redundancy Feature Selection from Microarray Gene Expression Data. In: J Bioinform Comput Biol. 523-529.
Altay G, Streib F (2010) Inferring the conservative causal core of gene regulatory networks. BMC Systems Biology 4: 132.
de Matos Simoes R, Emmert-Streib F (2012) Bagging statistical network inference from large-scale gene expression data. PLoS ONE 7: e33624.
Küffner R, Petri T, Tavakkolkhah P, Windhager L, Zimmer R (2012) Inferring gene regulatory networks by ANOVA. Bioinformatics 28: 1376-1382.
Haury AC, Mordelet F, Vera-Licona P, Vert JP (2012) TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC Systems Biology 6: 145.
Soranzo N, Bianconi G, Altafini C (2007) Comparing association network algorithms for reverse engineering of large-scale gene regulatory networks: synthetic versus real data. Bioinformatics 23: 1640-1647. (Pubitemid 47244454)
Hache H, Lehrach H, Herwig R (2009) Reverse engineering of gene regulatory networks: a comparative study. EURASIP J Bioinformatics Syst Biol 2009: 8: 1-8: 12.
Narendra V, Lytkin NI, Aliferis CF, Statnikov A (2011) A comprehensive assessment of methods for de-novo reverse-engineering of genome-scale regulatory networks. Genomics 97: 7-18.
Marbach D, Costello J, Kuffner R, Vega N, Prill R, et al. (2012) Wisdom of crowds for robust gene network inference. Nat Meth 9: 796-804.
Marbach D, Schaffter T, Mattiussi C, Floreano D (2009) Generating Realistic In Silico Gene Networks for Performance Assessment of Reverse Engineering Methods. Journal of computational biology 16: 229-239.
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, et al. (2010) Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences 107: 6286-6291.
Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, et al. (2010) Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges. PLoS ONE 5: e9202.
Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P (2010) Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLoS ONE 5.
Van den Bulcke T, Van Leemput K, Naudts B, van Remortel P, Ma H, et al. (2006) SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics 7: 43.
Schaffter T, Marbach D, Floreano D (2011) GeneNetWeaver: In silico benchmark generation and performance profiling of network inference methods. Bioinformatics 27: 2263-2270.
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3: 1157-1182.
Saeys Y, Inza IN, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23: 2507-2517. (Pubitemid 350048351)
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26: 392-398.
Bach FR (2008) Bolasso: model consistent Lasso estimation through the bootstrap. In: Proceedings of the 25th international conference on Machine learning. New York, NY, USA: ACM, ICML '08, pp. 33-40.
Saeys Y, Abeel T, Van de Peer Y (2008) Robust Feature Selection Using Ensemble Feature Selection Techniques. In: Machine Learning and Knowledge Discovery in Databases SE - Lecture Notes in Computer Science, Berlin, Heidelberg: Springer Berlin/Heidelberg, volume 5212. pp. 313-325.
Breiman L (2001) Random Forests. In: Machine Learning. pp. 5-32.
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 46: 389-422. (Pubitemid 34129977)
Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2: 27: 1-27: 27.
Zou H, Hastie T (2005) Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B 67: 301-320. (Pubitemid 40465877)
Friedman J, Hastie T, Tibshirani R (2009) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33: 1-22.
Smits G, Kotanchek M (2005) Pareto-Front Exploitation in Symbolic Regression. In: OReilly UM, Yu T, Riolo R, Worzel B, editors, Genetic Programming Theory and Practice II SE - 17, Springer US, volume 8 of Genetic Programming. pp. 283-299.
Vladislavleva K, Veeramachaneni K, Burland M, Parcon J, O'Reilly UM (2010) Knowledge mining with genetic programming methods for variable selection in avor design. In: Proceedings of the 12th annual conference on Genetic and evolutionary computation. New York, NY, USA: ACM, GECCO '10, pp. 941-948.
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. Belmont, CA: Wadsworth International Group.