A new way to train decision trees: tree alternating optimization (TAO)

Miguel Á. Carreira-Perpiñán

Friday, January 18, 2019

10:45am Meeting room 302 (Mountain View), level 3

Miguel Á. Carreira-Perpiñán, Professor, University of California at Merced, USA

A new way to train decision trees: tree alternating optimization (TAO)

Abstract:

Decision trees with hard decision nodes stand apart from other machine learning models in their interpretability, fast inference and automatic feature selection – in addition to their ability to handle naturally nonlinear data, multiclass classification and regression, discontinuous predictor functions, and discrete input features. This has made decision trees widespread in certain practical applications for decades, in spite of the fact that their predictive accuracy is generally considered inferior to that of other models.

In fact, this perceived lower accuracy is greatly due to the lack of a good learning algorithm. Learning a decision tree from data is a difficult optimization problem because of the search over discrete tree structures and the lack of gradient information over the node parameters. The most widespread tree learning algorithms in practice (CART, C4.5 and variations of them) date back to the 1980s and are based on greedily growing the tree structure and setting the node parameters via a heuristic criterion.

We present a new algorithm, tree alternating optimization (TAO), that makes it practical for the first time to learn decision trees with a guarantee that the classification error is systematically reduced. Starting with a given tree structure and node parameters, TAO iteratively updates the parameters so that the resulting tree has the same or a smaller structure and provably reduces or leaves unchanged the classification error. TAO can be applied to both axis-aligned and oblique trees and our experiments show it consistently outperforms various other algorithms while being highly scalable to large datasets and trees.

Further, TAO can handle a sparsity penalty, so it can learn “sparse oblique trees”, having few nonzero parameters at each node. This combines the best of axis-aligned and oblique trees: flexibility to model correlated data, low generalization error, fast inference, smaller trees and interpretable nodes that involve only a few features in their decision. This makes decision trees attractive even in tasks where they were traditionally unsuitable, such as MNIST handwritten digit classification.

This is joint work with my PhD students Pooya Tavallali and Arman Zharmagambetov.