[en] Random forests have been widely used for their ability to provide so-called importance measures, which give insight at a global (per dataset) level on the relevance of input variables to predict a certain output. On the other hand, methods based on Shapley values have been introduced to refine the analysis of feature relevance in tree-based models to a local (per instance) level. In this context, we first show that the global Mean Decrease of Impurity (MDI) variable importance scores correspond to Shapley values under some conditions. Then, we derive a local MDI importance measure of variable relevance, which has a very natural connection with the global MDI measure and can be related to a new notion of local feature relevance. We further link local MDI importances with Shapley values and discuss them in the light of related measures from the literature. The measures are illustrated through experiments on several classification and regression problems.
Disciplines :
Computer science Mathematics
Author, co-author :
Sutera, Antonio ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Méthodes stochastiques
Louppe, Gilles ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Huynh-Thu, Vân Anh ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Wehenkel, Louis ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Geurts, Pierre ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Language :
English
Title :
From global to local MDI variable importances for random forests and when they are Shapley values
Kellie J Archer and Ryan V Kimes. Empirical characterization of random forest variable importance measures. Computational_statistics_&_data_analysis, 52(4):2249-2260, 2008.
Lidia Auret and Chris Aldrich. Empirical comparison of tree ensemble variable importance measures. Chemometrics_and_Intelligent_Laboratory_Systems, 105(2):157-170, 2011.
Manfred Besner. Axiomatizations of the proportional shapley value. Theory_and_Decision, 86(2): 161-183, 2019.
Leo Breiman. Random forests. Machine_learning, 45(1):5-32, 2001.
Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification_and_regression trees. CRC press, 1984.
Ian Covert, Scott Lundberg, and Su-In Lee. Understanding global feature contributions with additive importance measures. Advances_in_Neural_Information_Processing_Systems, 33, 2020.
Robin Genuer, Jean-Michel Poggi, and Christine Tuleau-Malot. Variable selection using random forests. Pattern_recognition_letters, 31(14):2225-2236, 2010.
Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine_learning, 63(1):3-42, 2006.
Hemant Ishwaran et al. Variable importance in binary regression trees and forests. Electronic_Journal of_Statistics, 1:519-537, 2007.
Yacine Izza, Alexey Ignatiev, and Joao Marques-Silva. On explaining decision trees. arXiv_preprint arXiv:2010.11034, 2020.
Ron Kohavi, George H John, et al. Wrappers for feature subset selection. Artificial_intelligence, 97 (1-2):273-324, 1997.
Xiao Li, Yu Wang, Sumanta Basu, Karl Kumbier, and Bin Yu. A debiased mdi feature importance measure for random forests. In Advances_in_Neural_Information_Processing_Systems, pages 8047-8057, 2019.
G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts. Understanding variable importances in forests of randomized trees. In Advances_in_Neural_Information_Processing_Systems_26, pages 431-439, 2013.
Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in_neural_information_processing_systems, pages 4765-4774, 2017.
Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees. Nature_machine_intelligence, 2(1):2522-5839, 2020.
Mário Popolin Neto and Fernando V Paulovich. Explainable matrix-visualization for global and local interpretability of random forest classification ensembles. arXiv_preprint_arXiv:2005.04289, 2020.
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the_Journal_of_machine_Learning_research, 12:2825-2830, 2011.
Ando Saabas. Interpreting random forests. 2014. URL http://blog.datadive.net/interpreting-random-forests/. Last access: February 2021.
Lloyd S Shapley. A value for n-person games. Contributions_to_the_Theory_of_Games, 2(28):307-317, 1953.
Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC_bioinformatics, 8(1):25, 2007.
Erik Strumbelj and Igor Kononenko. An efficient explanation of individual classifications using game theory. J._Mach._Learn._Res., 11:1-18, March 2010. ISSN 1532-4435.
Antonio Sutera. Importance_measures_derived_from_random_forests:_characterisation_and_extension. PhD thesis, Université de Liège, Liège, Belgique, 2019.
Antonio Sutera, Célia Châtel, Gilles Louppe, Louis Wehenkel, and Pierre Geurts. Random subspace with trees for feature selection under memory constraints. In International_Conference_on_Artificial Intelligence_and_Statistics, pages 929-937, 2018.
René van den Brink, René Levínskỳ, and Miroslav Zelenỳ. On proper shapley values for monotone tu-games. International_Journal_of_Game_Theory, 44(2):449-471, 2015.
H Peyton Young. Monotonic solutions of cooperative games. International_Journal_of_Game_Theory, 14(2):65-72, 1985.