ISSN ONLINE(2320-9801) PRINT (2320-9798)

All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

A Pragmatic Approach of Preprocessing the Data Set for Heart Disease Prediction

Dr. Durairaj.M, Sivagowry.S
  1. Assistant Professor, Department of Computer Science, Engineering and Technology, Bharathidasan University, India.
  2. Research Scholar, Department of Computer Science, Engineering and Technology, Bharathidasan University, India.
Corresponding Author: SHARMA VIVEK, E-mail: vivek03sharma@rediffmail.com
Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

Medical Ecosystem is originated with rich information database, but inadequate in techniques to extract the information from the database. This is because of the lack of effective analysis tool to discover hidden relationships and trends in them. By applying the data mining techniques, valuable knowledge can be extracted from the health care system. Extracted knowledge can be applied for the accurate diagnosis of disease and proper treatment. Heart disease is a group of condition affecting the structure and functions of the heart and has many root causes. Heart disease is the leading cause of death in all over the world over past ten years. Researchers have developed many hybrid data mining techniques for diagnosing heart disease. This paper describes a preprocessing technique and analyzes the accuracy for prediction after preprocessing the noisy data. It is also observed that the accuracy has been increased to 91% after preprocessing. Swarm Intelligence techniques hybrided with Rough Set Algorithm are to be taken as future work for exact reduction of relevant features for prediction.

Keywords

Data Mining, Artificial Neural Network, Multilayer Perceptron, Radial Basis Function, Root Mean Squared Error, Prediction Accuracy, Redunt Features

INTRODUCTION

Data Mining is the exploration of large datasets to extract hidden and previously unknown patterns, relationships and knowledge that are difficult to detect with traditional statistics [17]. Data mining techniques are the result of a long process of research and product development [5]. Data Mining involves few steps from raw data collection to some form of new knowledge. The iterative process consists of the following steps like Data Cleaning, Data Integration, Data Selection, Data Transformation, Data Mining, Pattern Evaluation and Knowledge Representation. The Figure 1 shows that the Data Mining is the core of the various processes involved in Knowledge Discovery Process. Knowledge Discovery is a process of getting high level knowledge from low level data [38].
Medical Data Mining is a domain of challenge which involves a lot of imprecision and uncertainty. Provision of quality services at affordable cost is the major challenge faced in the health care organization. The poor clinical decision may lead to disastrous consequences. Health care data is massive. Clinical decisions are often made based on the doctor’s experience rather than on the knowledge rich data hidden in the database. This in some cases results in errors, excessive medical cost which affects the quality of service to the patients [33]. Medical history data comprise of a number of tests essentials to diagnose a particular disease. It is possible to gain the advantage of Data mining [21], [25], [26] in health care by employing it as an intelligent diagnostic tool [34]. Accuracy of the prediction can also be improved by using Data Clustering Algorithms [4]. The researchers in the medical field have succeeded in identifying and predicting the disease with the aid of Data mining techniques [29]. Association rules of Data Mining have been significantly used [7], [11], and [12].
The paper is organized as follows: Section II gives description about heart disease and its impact on the society. Section III is about the data set collected for experimentation. Section IV describes the preprocessing techniques and training of the data set using Multilayer Perceptron Networks and Radial Basis Function. Section V concludes the paper.

HEART DISEASE

The riseof health care cost is one of the world’s most important problems [21]. Heart attack happens when there is irregularity in the flow of blood and heart muscle is injured because of inadequate oxygen supply [47]. World Health Organization in the year 2008 reported that 30% of total global deaths are due to Cardio Vascular Disease (CVD). By 2030, almost 25 million people will die from CVDs, mainly from heart disease and stroke [10], [14], [49] . These are projected to remain the single leading cause of death. CVD is expected to be the leading cause of deaths in developing countries due to changes in lifestyle, work culture and food habits. Hence, more careful and efficient methods of cardiac diseases and periodic examination are of high importance [27], [28].

RELATED WORK

Genetic Algorithm [1], [16] is used to determine the attributes for the diagnosis of heart disease. Feature extraction is done with the aid of Genetic Algorithm (GA). The attribute number is reduced to 6 by using GA. Naïve Bayes [35], Classification by Clustering and Decision Tree, are the classifiers which are used for testing the reduced data set. It is observed that the Decision Tree outperforms but takes more time to build the model. Naïve Bayes has performed consistently before and after the reduction of attributes. Classification via Clustering is poor in performance. Weka tool is used for evaluation.
Weighted Fuzzy rule based Clinical Decision Support System (CDSS) is proposed [2], [13]. It consists of two phases. The first phase is automated approach for the generation of fuzzy rules and the second is developing a fuzzy rule based on Decision Support System. The CDSS is compared with Neural Network based system by Sensitivity, Specificity and Accuracy. Cleveland, Hungarian and Switzerland Data sets are used. The sensitivity of Neural Network (NN) and CDSS is 52.47% and 45.22%, Specificity is 52.46% and 68.75%, Accuracy is 53.86% and 57.85%.
Data Mining Classification [3], [9] is based on a supervised machine learning algorithm. Tanagra tool is used to classify the data and evaluated using 10 fold cross validation. Naïve Bayes, K-nn [32], Decision List Algorithm is taken and the performance of these algorithms is analyzed based on accuracy and time taken to build the model. Naïve bayes is considered to be better since it takes only lesser time to calculate accuracy than other algorithms. It also resulted in lower error rates. The Naïve Bayes algorithm gives 52.23% of accurate result.
Intelligent Heart Disease Prediction System (IHDPS) [25] is developed using Data Mining Techniques namely Decision Tree, Naïve Bayes and Neural Network. Each technique has its own strength in realizing the objectives of Data Mining. DMX Query language is used which answers complex “What if” queries where Decision Support System can’t. Five Data mining rules are defined and evaluated using the three models. Naïve bayes [26], [16] is found to the most effective in Heart Disease diagnosis.
MisClassification Analysis [41] is used for Data Cleaning. The Complementary Neural Network is used to enhance the performance of network classifier. Two techniques are used. Falsity NN is obtained by complementing the target output of training data. True NN and False NN are trained for membership values. In the first technique, new training data are obtained by eliminating all misclassification patterns. And in the second technique, only the misclassification patterns are eliminated. The classification accuracy is improved after Data cleaning. Technique II showed much accuracy than Technique I.
Ripper Incremental Pruning to Produce Error Reduction (RIPPER) [19], [20], Support Vector Machine (SVM), Decision Tree and Artificial Neural Network are compared. The performances of the algorithms are compared with each other based on Sensitivity, Specificity, Accuracy, Error rate, True Positive rate and False Positive rate. SVM predicts with least error rate and highest accuracy.
SubhagataChattopadhyay [48] has mined some important pre-disposing factors of heart attack. 300 real world cases have been taken for study. 12 factors are taken. Divisive Hierarchical Clustering (DHC) techniques has been used to cluster the sample as ‘single’ , ‘average’ and ‘complete’ linkage. It has been also observed that male with age group of 48-60 are prone to suffer severe and moderate heart attack, where women over 50 years are affected mostly with mild attacks.

DATA SET

The Data set used for experimentation is reserved from Data mining repository of the University of California, Irvine (UCI). Data set from Cleveland Data set, Hungary Data set, Switzerland Data set, Long beach and Statlog Data set are collected. Cleveland, Hungary, Switzerland and Va long beach data set contains 76 attributes. Among all the 76 attributes, 14 attributes are taken for experimentation. Cleveland data set and Statlog data set are the most commonly used data set for testing purpose by the researchers in the medical domain. This is because all the other data set has more number of missing values than Cleveland data set [46].
A. Attributes Used
The Table 1 shows the attributes used for the purpose of the heart disease prediction.
The data set gathered for mining will contain either numeric attributes or nominal attributes. The data set collected for Heart disease encompasses both the numeric and nominal attributes. From the above 14 attributes, the listed features such as age, trestbps, Chol, thalach and oldpeak are numeric attributes and the remaining 9 comes under nominal. The relative importance of the variables in predicting the heart disease is shown in the Figure 2.

PROPOSED WORK

The proposed work methodology for preprocessing is as follows:
a. As a preliminary stage, the data set is preprocessed by using NumerictoNominal and Replace Missing Value techniques.
b. After cleaning, the data set is trained for accuracy.
c. The next stage is the extraction of redundant feature for prediction.
d. This is to be effected by using the Swarm Intelligence Techniques hybrided with Rough set Algorithm.
e. The data set is validated to acquire the optimal Redunt feature for prediction. The Figure 3 describes the proposed work methodology.

THE DATA SET PREPROCESSING AND TRAINING

An Artificial Neural Network is a mathematical model inspired by biological Neural Networks [8], [23], and [24]. A Neural Network consists of an interconnected group of artificial neurons, and it processes information using a connectionist approach to computation [10]. A neural network is considered to be an adaptive system that changes its structures during its learning phase [6]. Neural Networks are used to model complex relationships between inputs and outputs or to find patterns in data. Weights of the interconnections are adjusted to produce the desired output [38].
Multilayer Perceptron (MLP) is a Neural Network which is based on Supervised Learning method and the network is trained by using the Back Propagation Algorithm [15]. Back propagation algorithm is the most popularly used Neural Network Algorithm. Feed Forward Neural Network or Multilayer Perceptron is the most widely studied network algorithms for classification purposeMLP uses the non-linear activation function. The hidden neurons make the network active for highly complex tasks [30], [31].
The Figure 4 gives the architecture of the MLP network. One of the most important characteristics of a Perceptron network is the number of neurons in the hidden layer(s). If an inadequate number of neurons are used, the network will be unable to model complex data, and the resulting fit will be poor.If too many neurons are used, the training time may become excessively long, and worse, the network may overfit the data. When over fitting occurs, the network will begin to model random noise in the data. The result is that the model fits the training data extremely well, but it generalizes poorly to new, unseen data. Validation must be used to test for this.
The Radial Basis Function Network is an ANN which works based on the Radial Basis Function as activation functions. Radial basis function (RBF) networks typically have three layers: an input layer, a hidden layer with a nonlinear RBF activation function and a linear output layer. The input can be modeled as a vector of real numbers image The output of the network is then a scalar function of the input vector, image and is given by
image
Where N is the number of neurons in the hidden layer,ci is the centre vector for neuron I, and ai is the weight of neuron I in the linear output neuron. Functions that depend only on the distance from a centre vector are radically symmetric about that vector, hence the name radial basis function. In the basic form all inputs are connected to each hidden neuron.
Figure 5 shows the architecture of Radial Basis Function Network. The main features of RBF are they are two layered feed-forward networks. The hidden node implements a set of Radial Basis Function. The output node implements a set of linear summation functions as in MLP. The network training is divided into two stages. In the first stage, the weight from the input to the hidden layer is determined. And in the second stage, the weight from the hidden to the output layer is determined. The networks are very good at interpolation.
Data Preprocessing [45] plays a significant role in Data Mining. The training phase in the Data Mining during Knowledge Discovery will be very difficult if the data contains irrelevant or redundant information or more noisy and unreliable data. The medical data contain many missing values. So preprocess is an obligatory step before training the medical data. A total of 303 instances are trained before preprocessing.
The Table 2 compares the performance of both MLP and RBF before preprocessing the data. The RMSE values obtained are evaluated. It is initiated that the RMSE and Correlation Coefficient values are much lower. Correlation Coefficient is a measure of statistical correlation between predicted and actual values. If the Correlation Coefficient is 1, it is a perfect statistical Correlation and there is no correlation if it is 0. The Correlation value of the RBF is nearer to 0. Hence, training a network with MLP will yield better results than RBF. The Table 2 compares the performance of both the networks in terms of Correlation, Mean Absolute Error, RMSE, RAE and Root relative squared error.
The heart disease data set has both numeric and nominal data sets. The primary step of preprocessing involves the conversion of numeric attribute to nominal attribute. The NumerictoNominal conversion is used for renovating the attributes as Nominal. The Table 1 shows the result before preprocessing the data set. Before preprocess the data set has no such categorization as absence or presence of disease. The num attribute only depicts the output. But after converting the numeric attributes to nominal, the presence and absence of disease is easily analyzed. So nominal conversion of attributes is found to an effective preprocessing technique. The MLP and RBF networks are trained after preprocessing. The results obtained are also having promising results.
The Data is again subjected to Classification by using both MLP and RBF. The prediction accuracy is calculated in both the cases. It is witnessed that MLP outperforms RBF in accuracy. The Kappa Statistic value is higher in MLP. If the Kappa Statistic value is 0.7 or greater than 0.7, then it is said to good statistic correlation. The correlation is found to be better in the case of high Kappa value. Table 3 compares both the network for performance after converting numeric attributes to nominal.
From the table 3, it is inferred that both MLP and RBF perform better. But MLP outperforms Radial Basis Function network in accuracy and also in Relative Absolute Error. It is also observed that the RAE value is better after preprocessing the data set. Replace Missing value is another preprocessing technique used in the analysis. After replacing the value, the network is again trained to evaluate its performance. Multilayer Perceptron network outperforms Radial Basis Function network. It is revealed from the following figures.

CONCLUSION

Heart disease prediction is a major challenge in the health care industry. Selecting less number of attributes without affecting the accuracy of diagnosis is a challenging task in Data Mining. Removing and correcting all the noisy data and extracting information from the medical data would help medical practitioners in many ways. Apart from the removal of noisy data, feature extraction is a significant task for prediction of Cardio Vascular Disease. It is observed from the experiments that the preprocessing of data yields promising results. The preprocessing of data enhances the diagnosing and prediction accuracy and it was nearly 91%. The MLP network prediction is has high accuracy with low error rates when compared with RBF. Extracting relevant features and proper training of the network will result in more promising diagnosis. The future direction in the research is the extraction of relevant Redunt feature which would further improve the prediction accuracy.
 

Tables at a glance

Table icon Table icon Table icon
Table 1 Table 2 Table 3
 

Figures at a glance

Figure 1 Figure 2 Figure 3 Figure 4
Figure 1 Figure 2 Figure 3 Figure 4
Figure 1 Figure 2 Figure 3
Figure 5 Figure 6 Figure 7
 

References