ISSN ONLINE(2320-9801) PRINT (2320-9798)

All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

Hybrid Clustering and Classification for Entropy Reduction: A Review

Palwinder kaur1,Usvir kaur2 ,Dr.Dheerendra Singh3
  1. Student, Department of computer science and engineering, S.G.G.S.W. University, fatehgarh sahib, Punjab, India.
  2. Assistant professor, Department of computer science and engineering, S.G.G.S.W.University, fatehgarh sahib, Punjab. India.
  3. Professor, Department of computer science and engineering, Shaheed udham singh college of engineering and technology, Tangori, Mohali, Punjab, India.
Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

Clustering is the unsupervised learning problem. Better Clustering improves accuracy of search results and helps to reduce the retrieval time. Clustering dispersion known as entropy which is the disorderness that occur after retrieving search result. It can be reduced by combining clustering algorithm with the classifier. Clustering with weighted k-mean results in unlabelled data. Unlabelled data can be labelled by using neural network and support vector machines. A neural network is an interconnected group of nodes, for classifying data whereas SVM is the classification function to distinguish between members of the two classes in the training data. For classification we use neural networks and SVM as they can recognize the patterns.

Keywords

Clustering, Weighted k-mean, Neural network classifier, SVM classifier, Entropy reduction system.

I. INTRODUCTION

Web mining is the process of retrieving the data from the bulk amount of data present on web according to user need. This is important to the overall use of data mining for companies and their internet/ intranet based applications and information access. Usage mining is valuable not only to businesses using online marketing, but also to e-businesses whose business is based solely on the traffic provided through search engines. The use of this type of web mining helps to gather the important information from customers visiting the site. But in the search results, there is often a lot of randomness and inaccuracy due to improper clustering and classification.
Clustering is the unsupervised learning problem. Clusters are made on the basis of similar characteristics or similar features. Clustering is defined as the process to maximize the intercluster dissimilarity and minimize the intracluster dissimilarity. After clustering, the classification process is performed so as to determine the labels for the data tuples that were unlabelled (no class). But Entropy is the disorderness that occurs after retrieving search results. This occurs as due to dispersion in clustering.

II. RELATED WORK

A.CLUSTERING-The unlabelled data from the large dataset can be classified in an unsupervised manner using clustering algorithms. Cluster analysis or clustering is the assignment of a set of observation into subsets (called clusters) so that observations in same cluster are similar in some sense. A good clustering algorithm results in high intra cluster similarity and low inter cluster similarity [1].
B. TYPES OF CLUSTERING ALGORITHMS
Exclusive clustering-In this data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. It includes k mean algorithm.
Overlapping clustering- the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. It includes fuzzy c mean.
Hierarchical clustering- This is based on the union between the two nearest clusters.
Probabilistic clustering- This kind of clustering use a completely probabilistic approach. It includes the mixture of Gaussian algorithm.

III. NEED FOR HYBRID CLUSTERING AND CLASSIFICATION ALGO

 Accuracy-search results must be accurate.
 Fast retrieval-Data retrieval time must be fast.
 Entropy reduction-Dispersion in the retrieved data leads to unreliability.
 Outlier detection-Detection of irrelevant or exceptional data.

IV. WEIGHTED K MEAN FOR CLUSTERING

In text clustering, clusters of documents of different topics are categorized by different subsets of terms or keywords. Data sparsity problem is faced in clustering high-dimensional data. In this algorithm, we extend the k-means clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimensions that categorize different clusters [2].
It include three steps-
A. Partitioning the objects- After initialization of the dimension weights of each cluster and the cluster centers, a cluster membership is assigned to each object [2].The distance measure minkowski is used over Euclidean distance because it can be used for higher [2]dimensional data.
image (1)
where d=dimensionality of data
p=1 for manhattan metric.
xi,xj=distance betweentwo clusters.
B. Updating cluster centers-To update cluster centers is to find the means of the objects in the same cluster [2].
C.Calculating dimensions weights-whole data set is analyzed to update dimension weights [2].

V. NEURAL NETWORK AS CLASSIFIER

A neural network is an interconnected group of nodes, a kin to the vast network of neurons in a brain. It is used for pattern recognition. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one neuron to the input of another. Neural network is of two types-
 Artificial method.
 Feed forward method
A.Advantages of using neural-
 High tolerance of noisy data [3].
 Can classify the data on which it has not been trained.
 Classifier that can reduce entropy effectively.
 Takes the input of clustering algorithm.

VI. SVM AS CLASSIFIER

SVM stands for support vector machine. It is very useful in Information retrieval (IR) target recognition [11]. It offers one of the most robust and accurate results and is insensitive to the number of dimensions.SVM is the classification function to distinguish between members of the two classes in the training data [9].
A.Linear SVM-For a linearly separable dataset, a linear classification function is used that corresponds to a separating hyper plane f(x) that passes through the middle of the two classes, separating the two. Once this function is determined, new data instance xn is produced that can be further classified by simply testing the sign of the function f(xn). To ensure that the maximum margin hyper planes are actually found, an support vector machine classifier attempts to maximize the following function[9] with respect to w and b:
image (2)
where t is the number of training examples. αi,i = 1, . . . , t, are non-negative numbers ,Lp is called the Lagrangian,Vectors w and constant b define the hyper plane [9].
B. Non linear SVM
In this kernel function is added and it make SVM more flexible [10].
C. Advantages of SVM-
1) Can provide good generalization[10].
2) SVM can greatly reduce the entropy of retrieved results[10].

VIII. ENTROPY FACTOR OF CLUSTERING

It is very important theory in the case of information theory (IT), which can be used to reflect the uncertainty of systems. From Shannon’s [4] theory, that information is the eliminating or reducing of people understanding the uncertainty of things [4]. He calls the degree of uncertainty as entropy.
Supposing a discrete random variable X, which has x1, x2 ,..., xn , a total of n different values, the probability of xi appears in the sample is defined as P( xi ), then the entropy of random[4] variable X is:
image
Entropy value ranges between 0 and 1. If H (P) = 0 (means close to 0), it indicates the lower level of uncertainty, and the higher similarity in the sample. On the other hand, if H (P) =1, it indicates the higher level of uncertainty, the lower similarity in the sample. For instance, in the real network environment, for a particular type of network attack, the data packets show a certain kind of characteristics. For example, DoS attacks, the data packets sent in a period of time are quite more similar in comparison to the normal network packets, which show smaller entropy, that is, the lower randomness. Another example is a network probing attack, which scans frequently a specific port in a certain period of time, so the destination ports will get smaller entropy compared with the random port selection of normal packets. As an effective measure of uncertainty, the entropy, proposed by Shannon [5], has been a useful mechanism for characterizing the information content in various modes and applications in many diverse fields. In order to measure the uncertainty in rough sets, many researchers have applied the entropy to rough sets, and proposed different entropy models in rough sets. Rough entropy is an extend entropy to measure the uncertainty in rough sets. Given an information system IS = (U, A, V, f), where U is a non-empty finite set of objects, A is a non-empty finite set of attributes. For any B⊆A, let IND(B) be the equivalence relation as the form of U/IND(B) = {B1,B2, ...,Bm}. The rough entropy E(B) of equivalence relation IND(B) is defined by[5]:
image (3)
where|Bi| |U| denotes the probability of any element x∈ U being in equivalence class Bi; 1 <= i <=m. And |M| denotes the cardinality of set M.

IX.CONCLUSION

In this paper, we have presented weighted k mean clustering algorithm as it is suitable for high dimensional data and also outlier detection occur efficiently. In order to label the unlabelled data, we have presented classification by neural networks because neural can be effectively used for noisy data and it can also work on untrained data. Using this hybrid technique, entropy of the retrieved data can be reduced and also retrieval time, accuracy can be greatly enhanced.

Tables at a glance

Table icon
Table 1
 

Figures at a glance

Figure 1 Figure 2 Figure 3
Figure 1 Figure 2 Figure 3
 

References