A rough-granular approach to the imbalanced data classification problem
Katarzyna Borowska , Jarosław Stepaniuk
AbstractMore than two decades ago the imbalanced data problem turned out to be one of the most important and challenging problems. Indeed, missing information about the minority class leads to a significant degradation in classifier performance. Moreover, comprehensive research has proved that there are certain factors increasing the problem’s complexity. These additional difficulties are closely related to the data distribution over decision classes. In spite of numerous methods which have been proposed, the flexibility of existing solutions needs further improvement. Therefore, we offer a novel rough–granular computing approach (RGA, in short) to address the mentioned issues. New synthetic examples are generated only in specific regions of feature space. This selective oversampling approach is applied to reduce the number of misclassified minority class examples. A strategy relevant for a given problem is obtained by formation of information granules and an analysis of their degrees of inclusion in the minority class. Potential inconsistencies are eliminated by applying an editing phase based on a similarity relation. The most significant algorithm parameters are tuned in an iterative process. The set of evaluated parameters includes the number of nearest neighbours, complexity threshold, distance threshold and cardinality redundancy. Each data model is built by exploiting different parameters’ values. The results obtained by the experimental study on different datasets from the UCI repository are presented. They prove that the proposed method of inducing the neighbourhoods of examples is crucial in the proper creation of synthetic positive instances. The proposed algorithm outperforms related methods in most of the tested datasets. The set of valid parameters for the Rough–Granular Approach (RGA) technique is established.
|Journal series||Applied Soft Computing, [Applied Soft Computing Journal], ISSN 1568-4946, e-ISSN 1872-9681, (N/A 200 pkt)|
|Publication size in sheets||5280.35|
|Keywords in English||Data preprocessingClass imbalanceGranular computingInformation granulesRough setsSMOTE|
|Internal identifier||ROC 19-20|
|Score||= 200.0, 04-03-2020, ArticleFromJournal|
|Publication indicators||: 2018 = 2.369; : 2018 = 4.873 (2) - 2018=4.858 (5)|
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.