Impact of Encryption Techniques on Cassification Algorithm for Privacy Preservation of Data

Jharna Chopra; Sampada Satav

Impact of Encryption Techniques on Cassification Algorithm for Privacy Preservation of Data

Jharna Chopra¹, Sampada Satav²

M.E. Scholar , CTA, SSGI, Bhilai, Chhattisgarh, India
Asst.Prof, CSE, SSGI, Bhilai, Chhattisgarh, India

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Science, Engineering and Technology

Abstract

In this paper, the Naïve Bayesian and K-Nearest neighbour algorithms have been implemented for classification and AES, Triple DES and Rijndael on nine real-world datasets. The goal of the research is to evaluate the performance of the classification algorithms when the data set is encrypted using a variety of performance metrics: classification accuracy, precision, recall (sensitivity), specificity and lift charts/gain charts and to determine the impact of encryption on these algorithms. We found that aside from the obvious time penalty the implementation of an encryption algorithm to protect user privacy the performance of the classification algorithms remained the same in most of the datasets. However, the time penalties for encrypting the data before it could be used for classification varied greatly depending on the type of algorithm used to encrypt the data.

Keywords

Classification, data mining, statistical methods, logistic regression, regression trees, discriminant analysis

INTRODUCTION

In the recent past, there has been an exponential increase in the amount of stored data. Managers and decision makers are faced with the problem of information overload. For example, in 1992, Frawley, Piatetsky-Shapiro and Matheus reported that the amount of data in the world doubles every twenty months. Cios, Pedrycz, and Swiniarski in 1998 reported that, Wal-Mart alone uploads twenty million point of sale (POS) transactions every day. Today we have far more information stored than we can handle. But as data volume increases, making meaningful decisions becomes increasingly difficult. To address these issues, researchers turned to a new research called Data Mining and Knowledge Discovery in Databases. In the past decades data mining methods have been widely used for the purpose of extracting knowledge from large data. Classification, a supervised method used to partition variables into several classes, represents the most widely used data mining method. But with this increasing volume of data comes the question of data privacy. How can we process this huge volume of data while keeping user privacy intact?

There have been several studies on comparing classification algorithms. However, most of these studies have been performed without taking into account the privacy issue. The theme of this paper is to classify various datasets using two of the most popular classification algorithms but the datasets will be encrypted to understand the pros and cons of enabling privacy preservation.

STATEMENT OF THE PROBLEM

An abundance of classification algorithms have been developed to solve data classification problems. Machine learning and data mining are among the most highly researched fields in today's world. However, the applications of the algorithms vary greatly with the scenario under consideration. A number of commercial tools are also available today which provide a wide range of classification techniques. No single algorithm, in all scenarios, has been demonstrated to be superior. Similarly, since a lot of the data being classified using these algorithms is personal to the user, it is also very important to consider which encryption algorithm should be used. “What is the impact of encryption on performance of data classification algorithms?” The primary focus of my research will be to evaluate the impact of encryption on the performance of two of the most popular classification algorithms using both statistical and machine learning methods on multiple datasets. An important aspect of my thesis is to use a variety of performance criteria to evaluate the learning methods. The performance criteria we have chosen to evaluate the algorithms are precision, recall and specificity.The dataset chosen for the projectis the Newsgroup dataset.

EXPERIMENTAL PROCEDURE

The following classification algorithms have been selected within the scope of this paper. They are:

 K Nearest Neighbour and

 Naïve Bayesian

The following encryption algorithms have been selected within the scope of this paper. They are:

 AES,

 Triple DES and

 Rjindael

There are three phases to building this project : Building the software which implements all the above mentioned algorithms by using standard implementations, verifying the algorithms by running the algorithms on a sample dataset and checking the results and finally running the classification algorithms on datasets encrypted with the above mentioned encryption algorithms. The models were evaluated using the following evaluation methods:

 Precision,

 Recall/Sensitivity, and

 Specificity

Fig 1 describes the proposed methodology for privacy preservation of data using various encryption algorithms.

RESULTS AND DISCUSSION

In this section the performance results of each algorithm will be discussed and the research question will be addressed. The performances of the selected algorithms were evaluated on the publicly available dataset.

A) Classification Accuracy

Table 1 shows, for each dataset, the estimated classification accuracy of the algorithms with and without encryption. As one can see from Table 1, the classification accuracy of the Naive Bayesian algorithm tends to be better while encryption is being used. The results also show that the effects of enabling encryption on the accuracy of the algorithms is minimal. AES and Rjindael show slightly better accuracy than TripleDES. The classifiers show comparable results even with encryption enabled.

B) Recall, Precision and Specificity

Table 2 shows the confusion matrix for the neural networks (NN) classifier trained on the white wine dataset. We will use this tableto illustrate our evaluation techniques for recall, precision and specificity. The table cells represent the number counts in the test dataset. The columns represent the predicted class and the rows represent the actual class in the dataset. We can see from the table that the NN could not predict classes 8 and 9. For example, the number of samples with actual label 6 that were incorrectly predicted as 5 or 7 is 101 and 50 respectively. The ATotal column indicates the number of test samples whose actual label is specified by the row. For example, suppose we are interested in class 6. From the table, 558 samples were actually labeled 6: the cell shaded green is the number of true positives (TP). The cells shaded orange represent falsepositives (FP), the cells shaded yellow represent the false negatives (FN) and the cells shaded blue indicate true negatives (TN).The PTotal row indicates the number of test samples whose predicted label is specified by the column label. For example 720 samples had been predicted as 6. From Table 2, the TP=405 and the FP = 315 =3+10+135+136+30+1 see color coding. Therefore the precision for class 6 is:

Here a total of 315 samples were incorrectly predicted as 6. The precision, recall and specificity for each class are calculated. The overall precision, recall and specificity are computed as a weighted average. The results of the recall, precision and specificity are tabulated in different tables.

C) Performance by Dataset

Tables 3 below, show the statistics of the models for each problem (dataset). The C5.0 algorithm has the best accuracy for the adult, house, segment, white wine and the red wine datasets; naïve Bayes (NB) has the best accuracy for the NHANES, and cars datasets; neural networks (NN) has the best accuracy for the credit and the vehicle. Logistics regression (LR) tied with NB for the best accuracy for the NHANES dataset. CHAID, support vector machines (SVM), discriminant analysis (DA), QUEST, classification and regression trees (CART) never produced best accuracy result for any of the datasets. Overall classification accuracy alone does not distinguish between types of errors the classifier makes (i.e. False Positives versus False Negatives). For example two or more classifiers may exhibit the same accuracy but may behave differently on each category.

CONCLUSION

In this paper, classification algorithm have been implemented on nine datasets. The goal of the research was to evaluate the performance of the classification algorithms on both multiple and binary classification problems using a variety of performance metrics: classification accuracy, precision, recall, and specificity, lift charts gain charts.According to the experimental results, the C5.0 model proved to have the best performance. It performed better in many of the datasets used. Neural networks, naïve Bayes and logistic regression also performed well. However, there is no universally best learning algorithm. From the analysis none of the algorithms outperformed the others in every problem. The performance of classification algorithm depends on the performance matrix and the characteristics dataset.

References

Atlas, L., Connor, Park, J. , El-Sharkawi, D. , Marks , M. , Lippman, R., Muthasamy, A.Y. , ÃÂ¢Ãâ¬ÃÂ A Performance Comparison of Trained Multi-layer Perceptions and Trained Classification TreesÃÂ¢Ãâ¬ÃÂ. Systems, man, and cybernetics: proceedings of the IEEE international conference, 915-920 , 1991.
Berardi, V. L., Patuwo, B. E., and Hu, M. Y. ,ÃÂ¢Ãâ¬ÃÅA principled Approach for Building and Evaluating Neural Network Classification ModelsÃÂ¢Ãâ¬ÃÂ. DecisionSupport Systems, 233-246, 2004.
Bhattacharyya, S., andPendharkar, P. C. ,ÃÂ¢Ãâ¬ÃÅInductive, Evolutionary and Neural Computing Techniques for Discrimination: A Comparative StudyÃÂ¢Ãâ¬ÃÂ ,1998.
Breiman, L. , Friedman, J. H., Olshen, R. A., and Stone, C. J.,ÃÂ¢Ãâ¬ÃÅClassification and Regression TreesÃÂ¢Ãâ¬ÃÂ. Wadsworth,Belmont,1984.
Brown, D., Corruble, V., and Pittard, L. , ÃÂ¢Ãâ¬ÃÅA Comparison of Decision Tree Classifiers with Backpropagation Neural Networks for Multimodal Classification Problems". Pattern Recognition, 26, 953-961.
Burges, C. , ÃÂ¢Ãâ¬ÃÅA Tutorial on Support Vector Machines for PatternRecognitionÃÂ¢Ãâ¬ÃÂ. Data Mining and Knowledge Discovery. Kluwer AcademicPublishers. Boston, 1998.
Caruana, R., andNiculescu-Mizil, A. , ÃÂ¢Ãâ¬ÃÂAn Empirical ComparisonofSupervised Learning Algorithms.ÃÂ¢Ãâ¬ÃÂ Proceedings of the 23rd InternationalConference on Machine Learning,2006.