Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach
Marcin Czajkowski , Marek Krętowski
AbstractThe problem of underfitting and overfitting in machine learning is often associated with a bias-variance trade-off. The underfitting most clearly manifests in the tree-based inducers when used to classify the gene expression data. To improve the generalization ability of decision trees, we are introducing an evolutionary, multi-test tree approach tailored to this specific application domain. The general idea is to apply gene clusters of varying size, which consist of functionally related genes in each splitting rule. It is achieved by using a few simple tests that mimic each other’s predictions and built-in information about the discriminatory power of genes. The tendencies to underfit and overfit are limited by the multi-objective fitness function that minimizes tree error, split divergence and attribute costs. Evolutionary search for multi-tests in internal nodes, as well as the overall tree structure, is performed simultaneously. This novel approach called Evolutionary Multi-Test Tree (EMTTree) may bring far-reaching benefits to the domain of molecular biology including biomarker discovery, finding new gene-gene interactions and high-quality prediction. Extensive experiments carried out on 35 publicly available gene expression datasets show that we managed to significantly improve the accuracy and stability of decision tree. Importantly, EMTTree does not substantially increase the overall complexity of the tree, so that the patterns in the predictive structures are kept comprehensible.
|Journal series||Expert Systems with Applications, ISSN 0957-4174, e-ISSN 1873-6793, (N/A 140 pkt)|
|Publication size in sheets||0.6|
|Keywords in English||Data mining,Evolutionary algorithms,Decision trees,Underfitting, Gene expression data|
|ASJC Classification||; ;|
|Internal identifier||ROC 19-20|
|Score||= 140.0, 04-03-2020, ArticleFromJournal|
|Publication indicators||: 2018 = 2.696; : 2018 = 4.292 (2) - 2018=4.577 (5)|
|Citation count*||1 (2020-04-03)|
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.