Partitioning Clustering Algorithms for Data
Stream Outlier Detection

Dr. S. Vijayarani; Ms.P.Jothi

Partitioning Clustering Algorithms for Data Stream Outlier Detection

Dr. S. Vijayarani¹, Ms.P.Jothi²

Assistant Professor, Department of Computer Science, School of Computer Science and Engineering, Bharathiar University, Coimbatore, Tamilnadu, India
M.Phil Research Scholar, Department of Computer Science, School of Computer Science and Engineering, Bharathiar University, Coimbatore, Tamilnadu, India

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

Recently many researchers have focused on mining data streams and they proposed many techniques and algorithms for data streams. They are data stream classification, data stream clustering, and data stream frequent pattern items and so on. Data stream clustering techniques are highly helpful to cluster the similar data items in data streams and also to detect the outliers, so they are called cluster based outlier detection. The main objective of this research work is to perform the clustering process and detecting the outliers in data streams. In this research work, two partitioning clustering algorithms namely CLARANS and E-CLARANS (Enhanced Clarans) are used for clustering and detecting the outliers in data streams. Two performance factors such as clustering accuracy and outlier detection accuracy are used for observation. By examining the experimental results, it is observed that the proposed ECLARANS clustering algorithm performance is more accurate than the existing algorithm CLARANS.

Keywords

Data stream, Data stream clustering, Outlier detection, CLARANS, E-CLARANS

INTRODUCTION

A data stream is an unremitting, immediate, stream flow of sequence of items and it is not possible to control the order in which data item arrive, or not possible to store these entire data items. Some of the applications of areas in which data streams generated are sensor networks, traffic management, call detail records, blogging and twitter posts [1].Due to be short of resources where as this type of huge data, the modern data mining systems are not sufficient and equipped to deal with them. Data stream clustering is a well-known task in mining data stream, clustering is known as grouping related objects into a cluster. With the help of data stream clustering method [2], we can detect the outliers, and the outlier is nothing but it is an object that does not fulfil with the behaviour of normal data objects. Applications of outlier detection are web logs, fraud detection and click streams, communication of telecoms and web document. Clustering based outlier mining [14] methods are called as unsupervised in nature and its main objective is to find the outlier from the data stream using partitioning cluster based method. The object which does not belong to any cluster or belongs to a small cluster is affirmed as outlier, and the outlier detection process highly depends upon the clustering technique.

The remaining section of this paper is organized in the following way. Section 2 illustrates the review of literature. Section 3 describes how the CLARANS and E-CLARANS clustering algorithms are used to detect outliers in data streams. Section 4 discussed about the experimental results and Conclusions are given in Section 5.

RELATED WORK

In this paper [8] the author presented a clustering algorithm called CLARANS which is based on randomize search. The authors had developed two spatial data mining algorithms SD (CLARANS) and NSD (CLARANS). The experimental results and analysis indicated that both algorithms are effective, and can lead to discoveries that are difficult to obtain with existing spatial data mining algorithms. Finally, their experimental results showed that CLARANS is more efficient than existing clustering methods.

The paper [4] discussed a literature of several clustering procedures and multivariate outlier procedures. And also the features of multivariate outliers are also discussed, as well as the applications are highlighted in this survey. Finally the authors discussed about further research challenges on multivariate outliers.

In this paper [5] authors conversed about partitioning clustering based outlier detection for data streams. In this each and every data are entered into a specify size of window, and also they reported each and every data as outlier and also store the data. By using K means algorithm, they have been found small cluster, which is faraway to other clusters and termed as outlier.

In this paper [9] authors compared two partitioning clustering approaches namely CLARANS and FUZZY C MEANS. By measuring the clustering accuracy and outlier accuracy, the performance of clustering and outlier detection is better in CLARANS clustering algorithms.

METHODOLOGY

In data stream, the clustering technique is applied for grouping the data items and also detecting the outliers. Clustering and Outlier detection are most important problems in data streams. The main objective of this research work is to analyse the performance of the two partitioning clustering algorithms namely CLARANS and E-CLARANS for detecting the outliers. The system architecture of the research work is as follows as

A. DATASET

Dataset which have been used in this research work is Pima Indian data set; it contains 768 instances and 8 attributes. This dataset is taken from UCI machine learning repository [3]. Data stream is an abundant flawless sequence of data and it is not possible to store the complete data stream, due to this reason we divide the data into chunks of same size in different windows.

B. CLUSTERING

Cluster analysis is used in a various number of applications; they are stock market analysis, data analysis, image processing and financial market analysis 14]. In data streams the clustering is one of the sub-process areas which are used to group the objects as well as it is used to detect the outliers efficiently and also clustering is one of the unsupervised action in data streams. The data stream clustering are different types of approaches they are distance based, grid based, partition based, hierarchical based and so on.

C. OUTLIER DETECTION

Outlier detection over streaming data is active research area from data stream mining that aims to detect object which have different [5] behaviour, exceptional than normal object. An outlier is an item that is notably unrelated or incompatible to other data object whereas weblogs click stream telecommunication, fraud detection, documents of web are the application areas of outlier detection in data streams. The other specified names of outlier detection are termed as noise, anomalies, indifferent, not catchable to the related object, and unknown. The clustering based outlier detection is a best technique to manage this problem. For our research we have used partitioning cluster based outlier detection algorithms CLARANS and E-CLARANS.

D.CLARANS

This method involves partitioning clustering algorithm in data streams [9]. First the data’s are splitted into chunks of same size in different windows, after that consider each database(s) into data point (dp), partition of size=s/p, along with max neighbor of k=3. Then the minimum cost for each data point (dp) identifies the neighbor value, and it follows the condition i=1and j=1.Then the distance for each data point is calculated and also choose maximum distance (n) for each data points, if (s) has a lower cost, set current to(s), are increment j by 1.when j > max neighbor, compare the cost of current with minimum cost. If the cost value is less than (<) min cost, set minimum cost to current of cost value. Finally group the cluster, in order to satisfy the threshold value≤ min cost. Finally nodes are clustered and outliers are identified.

E. E-CLARANS (Enhanced Clarans)

In E-CLARANS, first the data are splitted into chunks of same size in different windows, after that consider each database(S) into data point (dp), partition of size=s/p, along with max neighbor of k=3. Then the minimum cost for each data point (dp) is identified the neighbor value, and it follows the condition i=1and j=1.Then calculate the distance for each data points and also choose maximum distance (n) for each data points. Set current to an arbitrary node in n: k, for each data point we have to set j to 1along with a random neighbor (s) of current value, and also calculate the cost differential of the two nodes. If (s) has a lower cost, set current to(s) is increment j by 1. when j > max neighbor, compare the cost of current with minimum cost. If the cost value is less than (<) min cost, set minimum cost to current of cost value. Finally group the cluster, in order to satisfy the threshold value≤ min cost. Then lastly nodes are clustered and detect outliers.

EXPERIMENTAL RESULTS

We have implemented these two partitioning clustering algorithms in MATLAB 7.10 (R2010a). In order to evaluate the performance of the algorithms, the two factors namely clustering accuracy and outlier accuracy are used. The different sizes of the window are 3 and 5.

A. CLUSTERING ACCURACY

From the above figure-2, it is observed that proposed E-CLARANS clustering algorithm performs better than CLARANS clustering algorithm.

B. OUTLIER ACCURACY

From the above figure-3, it is observed that proposed E-CLARANS clustering algorithm performs better than CLARANS clustering algorithm.

CONCLUSION

Data streams are fast and limitless arrival of ordered and unordered data, by using of data streams clustering technique we can handle those data. Detecting outliers in data stream is one of the challenging research problems. In this paper, we have analysed the performance of CLARANS and E-CLARANS clustering algorithm for detecting the outliers. In turn to find the best clustering algorithm for outlier detection two performance measures are used. From the experimental results it is come to know that the outlier detection and clustering accuracies are more efficient in proposed E-CLARANS while compared to CLARANS clustering.

Tables at a glance


Table 1	Table 2

Figures at a glance


Figure 1	Figure 2	Figure 3

References

Aggarwal.C, Ed., ?Data Streams ? Models and Algorithms?, Springer, 2007.

Aggarwal.C.C, J. Han, J. Wang, and P. S. Yu,?A framework for clustering evolving data streams,? In Proc. of VLDB, pages 81-92, 2003.

C. J. Merz and P. M. Murph, UCI Repository of Machine Learning Databases Univ. of CA,Dept. of CIS, Irvine.

G. S. David Sam Jayakumar and Bejoy John Thomas, ?A New Procedure of Clustering Based on Multivariate Outlier Detection?, Journal of Data Science 11(2013).

Hossein Moradi Koupaie , Suhaimi Ibrahim, Javad Hosseinkhani, ?Outlier Detection in Stream Data by Clustering Method?, International Journal of Advanced Computer Science and Information Technology (IJACSIT)Vol. 2, No. 3, Page: 25-34,2013.

J. Chandrika, Dr. K.R. Ananda Kumar, ?Dynamic Clustering Of High Speed Data Streams?, IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 2, No 1, March 2012.

Rajendra Pamula, Jatindra Kumar Deka,Sukumar Nandi ?An Outlier Detection Method based on Clustering?, Second International Conference on Emerging Applications of Information Technology, 2011.

Raymond T. Ng and J. Han. Efficient and effective clustering method for spatial datamining, VLDB'94.

S. Vijayarani, P. Jothi, ?A New Approach for Detecting Outliers in Data Streams?, International journal of engineering sciences & research Technology, ISSN: 2277-9655, Pg no: 3128-3133, November 2013.

Shifei Ding, Fulin Wu, Jun Qian, Hongjie Jia, ?Research on data stream clustering algorithms? in Artificial Intelligence Review, springer 2013.

Sudipto Guha, Adam Meyerson, Nine Mishra and Rajeev Motwani, ?Clustering Data Streams: Theory and practice,? IEEE Transactions onKnowledge and Data Engineering, vol. 15, no.3, pp. 515-528, May/June, 2003.

T. Soni Madhulatha, ?overview of streaming-data algorithms?, Advanced Computing: An International Journal (ACIJ), Vol.2, No.6, November, 2011.

Yi-hong lu, Yan huang, ?Mining DataStreams Using Clustering?, Proceedings of the Fourth International Conference on Machine Learning and Cybernetics,vol.4, pp. 18-21,2005.

Yogita, Durga Toshniwal, ?Clustering Techniques for Streaming Data?A Survey? in proc. Of the IEEE, 2012.