Keywords
|
Data mining, Logical decision trees, imbalanced data sets, ensemble classification |
INTRODUCTION
|
Imbalanced class distributions are frequently encountered in real-world classification problems, arising from fraud detection, risk management, text classification, medical diagnosis, and many other domains. Such imbalanced class data sets differ from balanced class data sets not only in the skewness of class distributions, but also in the increased importance of the minority class. Despite their frequent occurrence and huge impact in day-to-day applications, the imbalance issue is not properly addressed by many standard machine-learning algorithms, because they assume either balanced class distributions or equal misclassification costs [1]. |
Various approaches have been proposed to deal with imbalanced classes to cope with these issues, including over/under-sampling [2], [3], Synthetic Minority Oversampling Technique (SMOTE), cost-sensitive [4], modified kernel based, and active learning methods [5], [6]. Although these methods do somewhat alleviate the problem, they are often based on heuristics rather than on clear guidance. For instance, in the oversampling technique, the optimal oversampling ratio varies largely across data sets, and is usually determined through multiple cross validations or other heuristics. |
Recent findings suggest that the class imbalance problem should be approached from multiple angles as well as using different evaluation metrics [7]. These findings show that the degree of imbalance is not the only factor hindering the learning process. Rather, the difficulties reside with various other factors such as overlapping classes, lack of representative data, and small disjuncts. Even worse, the effect of such factors becomes amplified when the distribution of classes is highly imbalanced [1]. |
The decision tree learning method is one of the methods that are used for classification or diagnosis. As for many other machine learning methods, the learning in decision trees is done by using a data set of already classified instances to build a decision tree which will later be used as a classifier. The set of instances used to “train” the decision tree is called the training set. |
RELATED WORK
|
In [1] authors the continuous expansion of data availability in many large-scale, complexes, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. In [2] authors used all reduction methods improved identification of small classes (20-30%), but the differences were insignificant. However, significant differences in accuracies, truepositive rates and true-negative rates obtained with the 3-nearest neighbor method and C4.5 from the reduced data favored Neighborhood Cleaning Rule (NCL). The results suggest that NCL is a useful method for improving the modeling of difficult small classes, and for building classifiers to identify these classes from the real-world data. In [3] discussed the study compares the performance of classifiers generated from unbalanced data sets with the performance of classifiers generated from balanced versions of the same data sets. This comparison allows us to isolate and quantify the effect that the training set’s class distribution has on learning and contrast the performance of the classifiers on the minority and majority classes. The second study assesses what distribution is "best" for training, with respect to two performance measures: classification accuracy and the area under the ROC curve (AUC). In [4] authors discussed the Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances (e.g. in gene profiling and detecting credit card fraud). This paper discusses the factors behind this failure and explains why the common strategy of under-sampling the training data may not be the best choice for SVM. In [5] authors discussed the problem occurs when there are significantly less number of observations of the target concept. Various real-world classification tasks, such as medical diagnosis, text categorization and fraud detection suffer from this phenomenon. The standard machine learning algorithms yield better prediction performance with balanced datasets. In this paper, we demonstrate that active learning is capable of solving the class imbalance problem by providing the learner more balanced classes. In [6] authors considered the large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. |
PROPOSED ALGORITHM
|
A. Splitting Decision Trees |
Decision tree learning is a common method used in data mining. Most of the commercial packages offer complex Tree classification algorithms, but they are very much expensive. Decision tree algorithms generate tree-structured classification rules, which are written in a form of conjunctions and disjunctions of feature values (or attribute values). These classification rules are constructed through |
? selecting the best splitting feature based on a certain criterion, |
? partitioning input data depending on the best splitting feature values, |
? Recursively repeating this process until certain stopping criteria are met. |
The selected best splitting feature affects not only the current partition of input data, but also the subsequent best splitting features as it changes the sample distribution of the resulting partition. Thus, the best splitting feature selection is arguably the most significant step in decision tree building, and different names are given for decision trees that use different splitting criteria, for example, C4.5 and ID3 for Shannon entropy-based splitting criteria such as and Information Gain ratio and CART for the Gini impurity measure. |
B. Bagging-Based Ensembles |
The bagging consists in training different classifiers with bootstrapped replicas of the original training data-set. That is, a new data-set is formed to train each classifier by randomly drawing (with replacement) instances from the original data-set (usually, maintaining the original data-set size). |
A bagging algorithm does not require re-computing any kind of weights; therefore, neither is necessary to adapt the weight update formula nor to change computations in the algorithm. In these methods, the key factor is the way to collect each bootstrap replica (Algorithm 1), that is, how the class imbalance problem is dealt to obtain a useful classifier in each iteration without forgetting the importance of the diversity. |
|
C. First-Order Logical Decision Trees |
The first-order logical decision tress l is a keel database; a test in a node corresponds to checking whether a query ← C succeeds in l ^ LEAT (with Lift-Boosting Ensemble of α-Trees (LEAT) the background knowledge). Note that it is not sufficient to use for C the conjunction conj in the node itself. Since conj may share variables with nodes higher in the tree, C consists of several conjunctions that occur in the path from the root to the current node. Therefore, when an example is sorted to the left, C is updated by adding conj to it. When sorting an example to the right, C need not be updated: a failed test never introduces new variables. |
A first-order logical decision tree (FOLDT) is a binary decision tree in which the nodes of the tree contain a conjunction of literals, and different nodes may share variables, under the following restriction: a variable that is introduced in a node (which means that it does not occur in higher nodes) must not occur in the right branch of that node. An example of a logical decision tree is shown in Fig. 1. It encodes the target hypothesis of Example 1. |
PSEUDO CODE
|
Procedure Logical classify (s : sample) returns class: |
Step 1: C := true |
Step 2: N := root. |
Step 3: while N ≠ leaf(c) do |
Step 4: let N = inode(conj, left, right) |
Step 5: if C ^ conj succeeds in Tree ^ s then |
C := C ^ conj |
N := left |
else N := right |
Step 8: return c |
SIMULATION RESULTS
|
The simulation studies work has been evaluated the 20 data sets used in this paper are from the 1) KEEL Imbalanced Data Sets. Many of the KEEL data sets are derived from UCI Machine Learning Repository by either choosing imbalanced datasets, or collapsing/removing some classes to make them to imbalanced binary data sets. paw02a, 03subcl5, and 04clover are artificial data sets, which are not from the UCI repository. They are designed to simulate noisy, borderline, and imbalanced examples in [43]. Aside from the stated modifications, each data set from the KEEL Repository is used “as is.” |
Table 1 shows that logical decision tree classification, α-diversified classifiers using EAT framework: EAT (α - diversified single tree), BEAT (α -diversified bagged trees), and UBEAT (α -diversified under-bagged trees) when K = 3. Note that the EAT framework can be plugged into any bagging-based ensemble algorithms such as over-bagging. Along with the two baseline algorithms and a single C4.5, these six algorithms are applied to the data sets, and AUROCs from 5 × 2 cross validations are recorded. |
The results show that α -diversified ensembles result in better AUROC performance than the corresponding baseline approaches |
CONCLUSION AND FUTURE WORK
|
The simulation results showed that the proposed algorithm performs better with the existing semi-supervised clustering algorithms. The proposed work of a new ensemble framework of First-order Logical relationships called FOLR that consistently results in higher AUROC values over the class imbalanced data sets. The proposed work modifies traditional pruning rules in decision tree algorithms to directly reflect an evaluation metric based on conditions. The First order logical relationship tree is a good generalization for unobserved instance, only if the instances are described in terms of features that are correlated with the target concept. The Decision is provide a clear indication of which fields are most important for prediction or classification. |
As the future work the system need to tested with a large amount of data sets and require further training and classification to solve the problem of the comparative sentences that expressed in different domains. |
Tables at a glance
|
|
Table 1 |
|
Figures at a glance
|
|
|
Figure 1 |
Figure 2 |
|
References
|
- H. He and E.A. Garcia, “Learning from Imbalanced Data,” IEEETransactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263-1284, Sept.2009.
- J. Laurikkala, “Improving Identification of Difficult Small Classes by Balancing Class Distribution,” Proc. Eighth Conf. AI in Medicine inEurope: Artificial Intelligence Medicine, pp. 63-66, 2001.
- G. Weiss and F. Provost, “The Effect of Class Distribution on Classifier Learning: An Empirical Study,” technical report, Dept. of ComputerScience Rutgers, Univ., 2001.
- K. McCarthy, B. Zarbar, and G. Weiss, “Does Cost-Sensitive Learning Beat Sampling for Classifying Rare Classes?” Proc. Int’l WorkshopUtility-Based Data Mining, pp. 69-77, 2005.
- R. Akbani, S. Kwek, and N. Japkowicz, “Applying Support Vector Machines to Imbalanced Data Sets,” Proc. 15th European Conf. MachineLearning, 2004.
- S. Ertekin, J. Huang, and C.L. Giles, “Learning on the Border: Active Learning in Imbalanced Data Classification,” Proc. 30th Ann. Int’lACM SIGIR Conf., pp. 823-824, 2007.
- N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A Systematic Study,” Intelligent Data Analysis, vol. 6, no. 5, pp. 429-449, 2002.
- K. Napierala, J. Stefanowski, and S. Wilk, “Learning from Imbalanced Data in Presence of Noisy and Borderline Examples,” Proc . SeventhInt’l Conf. Rough Sets and Current Trends in Computing, pp. 158-167, 2010.
|