Tree-based models excel with medium-sized tabular data

In our recent customer use case, “Achieve accurate lot cycle time predictions for more on-time deliveries,” we noted we determined the most appropriate ML model for the customer’s needs was a gradient boosted tree-based machine learning model, particularly the Light Gradient Boosted Machine implementation. This decision is supported by recent research conducted by Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux at Inria Saclay Centre and Sorbonne University, whose work concluded decision trees outperform deep learning on medium-size tabular data.

Noting that deep learning has “enabled tremendous progress on text and image datasets,”¹ researchers stated it had not been proven to be superior at processing these datasets. To compare the performance of the models, they collected 45 tabular datasets, each comprised of more than 3,000 real-world examples. They then trained standard and novel deep learning methods such as vanilla neural network, ResNet, and two Transformer-based models, as well as tree-based models including XGBoost, gradient boosting machines and Random Forests, among others. Each model was trained 400 times, searching randomly through a predefined hyperparameter space.

In assessing the models’ performance, the best tree-based models performed 20 to 30 percent better than the best deep learning models, when averaged across all tasks. They also found neural networks to be much more susceptible to random or less important data features than decision trees. When the authors removed uninformative features, the performance of the two models was more similar. When adding random features to the datasets, the neural networks showed a sharp decline.

The authors concluded, “Results show that tree-based models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed.”

REFERENCE

1. Grinsztajn, L., Oyallon, E., Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? NeurIPS22 Datasets and Benchmarks Track, Nov 22, New Orleans, United States. hal-03723551v2
https://hal.archives-ouvertes.fr/hal-03723551v2

Advantages of decision-tree models

by SmartFactory Automation Solution Experts Team

About the Author

Are you ready to advance your factory automation with SmartFactory automation solutions?

Semiconductor categories

Pharmaceutical categories

Semiconductor Solutions

Pharmaceutical Solutions

Automation Software

Follow SmartFactory

Help and Support

Thanks for Downloading