Keywords 
Clustering, Weighted kmean, Neural network classifier, SVM classifier, Entropy reduction system. 
I. INTRODUCTION 
Web mining is the process of retrieving the data from the bulk amount of data present on web according to user need. This
is important to the overall use of data mining for companies and their internet/ intranet based applications and information
access. Usage mining is valuable not only to businesses using online marketing, but also to ebusinesses whose business is
based solely on the traffic provided through search engines. The use of this type of web mining helps to gather the
important information from customers visiting the site. But in the search results, there is often a lot of randomness and
inaccuracy due to improper clustering and classification. 
Clustering is the unsupervised learning problem. Clusters are made on the basis of similar characteristics or similar
features. Clustering is defined as the process to maximize the intercluster dissimilarity and minimize the intracluster
dissimilarity. After clustering, the classification process is performed so as to determine the labels for the data tuples that
were unlabelled (no class). But Entropy is the disorderness that occurs after retrieving search results. This occurs as due to
dispersion in clustering. 
II. RELATED WORK 
A.CLUSTERINGThe unlabelled data from the large dataset can be classified in an unsupervised manner using clustering
algorithms. Cluster analysis or clustering is the assignment of a set of observation into subsets (called clusters) so that
observations in same cluster are similar in some sense. A good clustering algorithm results in high intra cluster similarity
and low inter cluster similarity [1]. 
B. TYPES OF CLUSTERING ALGORITHMS 
Exclusive clusteringIn this data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster
then it could not be included in another cluster. It includes k mean algorithm. 
Overlapping clustering the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or
more clusters with different degrees of membership. It includes fuzzy c mean. 
Hierarchical clustering This is based on the union between the two nearest clusters. 
Probabilistic clustering This kind of clustering use a completely probabilistic approach. It includes the mixture of
Gaussian algorithm. 
III. NEED FOR HYBRID CLUSTERING AND CLASSIFICATION ALGO 
Accuracysearch results must be accurate. 
Fast retrievalData retrieval time must be fast. 
Entropy reductionDispersion in the retrieved data leads to unreliability. 
Outlier detectionDetection of irrelevant or exceptional data. 
IV. WEIGHTED K MEAN FOR CLUSTERING 
In text clustering, clusters of documents of different topics are categorized by different subsets of terms or keywords. Data
sparsity problem is faced in clustering highdimensional data. In this algorithm, we extend the kmeans clustering process
to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important
dimensions that categorize different clusters [2]. 
It include three steps 
A. Partitioning the objects After initialization of the dimension weights of each cluster and the cluster centers, a cluster
membership is assigned to each object [2].The distance measure minkowski is used over Euclidean distance because it can
be used for higher [2]dimensional data. 
(1) 
where d=dimensionality of data 
p=1 for manhattan metric. 
xi,xj=distance betweentwo clusters. 
B. Updating cluster centersTo update cluster centers is to find the means of the objects in the same cluster [2]. 
C.Calculating dimensions weightswhole data set is analyzed to update dimension weights [2]. 
V. NEURAL NETWORK AS CLASSIFIER 
A neural network is an interconnected group of nodes, a kin to the vast network of neurons in a brain. It is used for pattern
recognition. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of
one neuron to the input of another. Neural network is of two types 
Artificial method. 
Feed forward method 
A.Advantages of using neural 
High tolerance of noisy data [3]. 
Can classify the data on which it has not been trained. 
Classifier that can reduce entropy effectively. 
Takes the input of clustering algorithm. 
VI. SVM AS CLASSIFIER 
SVM
stands for support vector machine. It is very useful in Information retrieval (IR) target recognition [11]. It offers one
of the most robust and accurate results and is insensitive to the number of dimensions.SVM is the classification function to
distinguish between members of the two classes in the training data [9]. 
A.Linear SVMFor a linearly separable dataset, a linear classification function is used that corresponds to a separating
hyper plane f(x) that passes through the middle of the two classes, separating the two. Once this function is determined, new
data instance xn is produced that can be further classified by simply testing the sign of the function f(xn). To ensure that
the maximum margin hyper planes are actually found, an support vector machine classifier attempts to maximize the
following function[9] with respect to w and b: 
(2) 
where t is the number of training examples. αi,i = 1, . . . , t, are nonnegative numbers ,Lp is called the Lagrangian,Vectors
w and constant b define the hyper plane [9]. 
B. Non linear SVM 
In
this kernel function is added and it make SVM more flexible [10]. 
C. Advantages of SVM 
1) Can provide good generalization[10]. 
2) SVM can greatly reduce the entropy of retrieved results[10]. 
VIII. ENTROPY FACTOR OF CLUSTERING 
It is very important theory in the case of information theory (IT), which can be used to reflect the uncertainty of systems.
From Shannon’s [4] theory, that information is the eliminating or reducing of people understanding the uncertainty of
things [4]. He calls the degree of uncertainty as entropy. 
Supposing a discrete random variable X, which has x1, x2 ,..., xn , a total of n different values, the probability of xi appears
in the sample is defined as P( xi ), then the entropy of random[4] variable X is: 

Entropy value ranges between 0 and 1. If H (P) = 0 (means close to 0), it indicates the lower level of uncertainty, and the
higher similarity in the sample. On the other hand, if H (P) =1, it indicates the higher level of uncertainty, the lower
similarity in the sample. For instance, in the real network environment, for a particular type of network attack, the data
packets show a certain kind of characteristics. For example, DoS attacks, the data packets sent in a period of time are quite
more similar in comparison to the normal network packets, which show smaller entropy, that is, the lower randomness.
Another example is a network probing attack, which scans frequently a specific port in a certain period of time, so the
destination ports will get smaller entropy compared with the random port selection of normal packets.
As an effective measure of uncertainty, the entropy, proposed by Shannon [5], has been a useful mechanism for
characterizing the information content in various modes and applications in many diverse fields. In order to measure the
uncertainty in rough sets, many researchers have applied the entropy to rough sets, and proposed different entropy models
in rough sets. Rough entropy is an extend entropy to measure the uncertainty in rough sets. Given an information system IS
= (U, A, V, f), where U is a nonempty finite set of objects, A is a nonempty finite set of attributes. For any B⊆A, let
IND(B) be the equivalence relation as the form of U/IND(B) = {B1,B2, ...,Bm}. The rough entropy E(B) of equivalence
relation IND(B) is defined by[5]: 
(3) 
whereBi U denotes the probability of any element x∈ U being in equivalence class Bi; 1 <= i <=m. And M denotes the
cardinality of set M. 
IX.CONCLUSION 
In this paper, we have presented weighted k mean clustering algorithm as it is suitable for high dimensional data and also
outlier detection occur efficiently. In order to label the unlabelled data, we have presented classification by neural networks
because neural can be effectively used for noisy data and it can also work on untrained data. Using this hybrid technique,
entropy of the retrieved data can be reduced and also retrieval time, accuracy can be greatly enhanced. 
Tables at a glance 

Table 1 


Figures at a glance 



Figure 1 
Figure 2 
Figure 3 


References 
 Vipin Kumar, HimadriChauhan and DhirajPanwar, “KMeans Clustering Approach to Analyse NSLKDD Intrusion Detection Dataset”, Vol.3, Issue4, Sept 2013.
 LipingJing,Michaelk.Ng,JoshuaZhexue Huang, “ An entropy weighting kmean algorithm for subspace clustering of high dimensional sparse data”,IEEE transactions on knowledge and data engineering,Vol.19,no.8,August 2007.
 Son lam Phung and Abdesselambouzerdoum, “A pyramidal Nueral network for Visual pattern recognition”,IEEE transactions on neuralnetworks,vol.18,no.2,March 2007.
 QuanQian, Tianhong Wang and Rui, Zhan, “Relative Network Entropy based clustering Algorithm for Intrusion detection”, Vol.15, No. 1,pp.1622,Jan,2013.
 Xiangjun Li and Fen Rao “An rough entropy based approach to outlier detection”, Journal computational information systems, Vol. 8 ,pp. 1050110508, 2012.
 J. Y. Liang, Z. Z. Shi., “The information entropy, rough entropy, knowledge granulation in rough set theory”, International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, 12 (1), pp. 37 – 46, 2004.
 Velmurugan T., and Santhanam T., "Performance Evaluation of KMeans and Fuzzy CMeans Clustering Algorithms for Statistical Distributions of Input Data Points," European Journal of Scientific Research, vol. 46, no. 3, pp. 320330,2010.
 Z. Deng, K. Choi, F. Chung, and S. Wang, “Enhanced Soft Subspace Clustering Integrating WithinCluster and Between Cluster Information,” Pattern Recognition, vol. 43, no. 3, pp. 767781, 2010.
 Xindong Wu, Vipin Kumar ,J. Ross Quinlan , JoydeepGhosh , Qiang Yang,Hiroshi Motoda , Geoffrey J. McLachlan, Angus Ng, Bing Liu,Philip S. Yu ,ZhiHua Zhou ,Michael Steinbach , David J. Hand ,Dan Steinberg , “Top 10 algorithms in data mining” ,knwlinfsyst,Vol 14,pp.137,2008.
 Laura Auria and Rouslan A. Moro, “Support Vector Machines (SVM) as a Technique for Solvency Analysis”,2008.
 Feng Wenge , “Application of SVM classifier in IR target recognition” ,Physics procedia,Vol.24.pp. 21382142,2012.
