Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach

Marcin Czajkowski , Marek Krętowski

Abstract

The problem of underfitting and overfitting in machine learning is often associated with a bias-variance trade-off. The underfitting most clearly manifests in the tree-based inducers when used to classify the gene expression data. To improve the generalization ability of decision trees, we are introducing an evolutionary, multi-test tree approach tailored to this specific application domain. The general idea is to apply gene clusters of varying size, which consist of functionally related genes in each splitting rule. It is achieved by using a few simple tests that mimic each other’s predictions and built-in information about the discriminatory power of genes. The tendencies to underfit and overfit are limited by the multi-objective fitness function that minimizes tree error, split divergence and attribute costs. Evolutionary search for multi-tests in internal nodes, as well as the overall tree structure, is performed simultaneously. This novel approach called Evolutionary Multi-Test Tree (EMTTree) may bring far-reaching benefits to the domain of molecular biology including biomarker discovery, finding new gene-gene interactions and high-quality prediction. Extensive experiments carried out on 35 publicly available gene expression datasets show that we managed to significantly improve the accuracy and stability of decision tree. Importantly, EMTTree does not substantially increase the overall complexity of the tree, so that the patterns in the predictive structures are kept comprehensible.
Author Marcin Czajkowski (FCS / SD)
Marcin Czajkowski,,
- Software Department
, Marek Krętowski (FCS / SD)
Marek Krętowski,,
- Software Department
Journal seriesExpert Systems with Applications, ISSN 0957-4174, e-ISSN 1873-6793, (N/A 140 pkt)
Issue year2019
Vol137
Pages392-404
Publication size in sheets0.6
Keywords in EnglishData mining,Evolutionary algorithms,Decision trees,Underfitting, Gene expression data
ASJC Classification1702 Artificial Intelligence; 1706 Computer Science Applications; 2200 General Engineering
DOIDOI:10.1016/j.eswa.2019.07.019
Internal identifierROC 19-20
Languageen angielski
Score (nominal)140
Score sourcejournalList
ScoreMinisterial score = 140.0, 04-03-2020, ArticleFromJournal
Publication indicators Scopus SNIP (Source Normalised Impact per Paper): 2018 = 2.696; WoS Impact Factor: 2018 = 4.292 (2) - 2018=4.577 (5)
Citation count*1 (2020-04-03)
Cite
Share Share

Get link to the record


* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.
Back
Confirmation
Are you sure?