Logistic regression methods for classification of imbalanced data sets
Classification of imbalanced data sets is one of the important researches in Data Mining community, since the data sets in many real-world problems mostly are imbalanced class distribution. This thesis aims to develop the simple and effective imbalanced classification algorithms by previously improv...
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2012
|
Subjects: | |
Online Access: | http://umpir.ump.edu.my/id/eprint/3649/ http://umpir.ump.edu.my/id/eprint/3649/1/CD6314_SANTI_PUTERI_RAHAYU%28DR%29.pdf |
Summary: | Classification of imbalanced data sets is one of the important researches in Data Mining community, since the data sets in many real-world problems mostly are imbalanced class distribution. This thesis aims to develop the simple and effective imbalanced classification algorithms by previously improving the algorithms performance of general classifiers i.e. Kernel Logistic Regression Newton-Raphson (KLR-NR) and Regularized Logistic Regression NR (RLR-NR) which are Logistic Regression (LR)based methods. Both LR-based methods have strong statistical foundation and well known classifiers which have simple solution of unconstrained optimization problem in performing the good performance as well as Support Vector Machine (SVM) which is determined as state-of-the art classifier in Kernel methodology and Data Mining community. However, the imbalanced LR-based methods are not extensively developed such as imbalanced SVM-based methods. Hence, it is required to develop effective imbalanced LR-based methods to be widely used in data mining applications. Numerical results have showed that the use of Truncated Newton method for KLR-NR and RLR-NR which respectively resulted in Newton Truncated Regularized KLR (NTR-KLR) and NTR RLR (NTR-LR), is effective in handling the numerical problems on the huge matrix of linear system of Newton-Raphson update rule i.e. the training time and the singularity problem. These results can be seen as further explanation on the success of Truncated Newton method in TR-KLR and TR Iteratively Re-weighted Least Square (TR-IRLS) algorithm respectively, because of the equivalence of iterative method used by these algorithms. Moreover, only with the use of simple solution of unconstrained optimization problem, numerical results have demonstrated that proposed NTR-KLR and proposed NTR-LR respectively have comparable classification performance with RBFSVM (SVM with Radial Basis Function Kernel). The imbalanced problem of both proposed general classification algorithms which is the limitation of accuracy performance specifically in classifying on the minority class has motivated this research to improve their classification performance on imbalanced data sets. In general, numerical results have showed that the use of adapted Modified AdaBoost methods for NTR-KLR and NTR-LR which respectively resulted in AdaBoost NTR Weighted KLR (AB-WKLR) and AB NTR Weighted RLR (AB-WLR) is significantly successful in improving the accuracy and stability performance of general classifiers i.e. NTR-KLR and NTR-LR respectively. The improvements on both error by g-means and standard deviation of g-means with 5-Fold SCV could be achieved as high as more than 60. Furthermore, numerical results have demonstrated that proposed AB-WKLR and proposed AB-WLR respectively have comparable performances with AdaBoostSVM in classifying imbalanced data sets, only with the use of simple solution of unconstrained weighted optimization problem. Thus, both proposed imbalanced LR-based methods is simple and effective for classification of imbalanced data sets and have promising results. |
---|