ISSN ONLINE(2320-9801) PRINT (2320-9798)

All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

Clustering Of Web Usage Data Using Chameleon Algorithm

T.Vijaya Kumar1, Dr. H.S.Guruprasad2
  1. Research Scholar, Department of IS & E, BMSCE, Bangalore, Karnataka, India.
  2. Professor and Head, Department of CS & E, BMSCE, Bangalore, Karnataka, India.
Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

Clustering is a discovery process in data mining which groups set of data items, in such a way that maximizes the similarity within clusters and minimizes the similarity between two different clusters. These discovered clusters depict the characteristics of the underlying data distribution. Clustering is useful in characterizing customer groups based on purchasing patterns, categorizing web documents that have similar functionality. In this work, graphbased clustering is proposed to form clusters based on web usage patterns. First sessions are constructed using time oriented approach. Based on the constructed sessions and page requests, adjacency matrix is created. Then data points are generated using adjacency matrix as input. Chameleon clustering algorithm takes data points as input and forms clusters. Chameleon uses a two phase approach to find the clusters. In the first phase, it uses a graph partitioning algorithm to cluster the data items into several relatively small sub-clusters. In the second phase, it uses an algorithm to form genuine clusters by repeatedly combining these sub clusters. Then these clusters are plotted on a plane using MATLAB where different clusters are distinguished by distinct colours and distinct symbols. In this paper, the server log files of the Website www.enggresources.com is considered for overall study and analysis.

Keywords

Web usage mining, K Nearest Neighbor graph, Chameleon algorithm.

INTRODUCTION

Clustering is a process of grouping a given set of unlabelled patterns into a number of clusters such that similar patterns are assigned to one cluster. Each pattern can be represented by a vector having many parameters. Most of traditional clustering algorithms use either interconnectivity or closeness to merge two clusters. The key feature of Chameleon algorithm is it considers both interconnectivity and closeness in identifying the most similar pair of clusters. Chameleon also uses a novel approach to model the degree of interconnectivity and closeness between each pair of clusters. The complete detail of Chameleon hierarchical clustering algorithm is given in [1]. Two clusters are merged only if the inter-connectivity and closeness between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters. Chameleon is a clustering algorithm based on K Nearest Neighbour (KNN) graph. In KNN graph, two data points far away from each other may reside in two disconnected sub-graphs and even if two points are in the same sub-graph, they may be separated in the following partition sets. Chameleon first constructs a KNN graph of the data set, and applies hMETIS to partition the graph in to clusters. It then defines the relative interconnectivity between two clusters and relative closeness within a cluster, and repeatedly merges clusters until the specified number of clusters is achieved. Chameleon uses a two phase approach to find the clusters. In the first phase, it uses a graph partitioning algorithm to cluster the data items into several relatively small sub-clusters. In the second phase, it uses an algorithm to form genuine clusters by repeatedly combining these sub clusters. The rest of the paper is organized as follows. Section 2 gives a brief description about the related work. Section 3 depicts the proposed model. The experimental results of the proposed approach are given in section 4 and section 5 concludes the discussion.

RELATED WORK

Web usage mining helps in modelling access behaviour of web users by extracting useful information from web server log files. Web usage mining refers to the automatic discovery and analysis of patterns in clickstream and associated data collected or generated as a result of user interaction with web resources on one or more web sites. Several researchers are working on Web Usage mining and have contributed various methodologies, tools for Web Usage mining. Models based on association rules, clustering algorithms, sequential analysis and Markov models have been used for discovering the knowledge from Web usage data. All these models are predominantly based on usage information from Web usage data alone. Significant improvement can be achieved by making use of domain knowledge, which is usually available from domain experts, content providers, and Web designers. Cooley et al. in [2, 3], covered Web usage mining process & various steps involved in it. It serves as the primary thesis to understand fundamentals of Web usage mining. Norwati Mustapha et al. [4], have proposed a model for mining user’s navigation pattern based on Expectation modelling algorithm and used it for finding maximum likelihood estimates of parameters in probabilistic models. In [5], Rajhans Mishra et al. have adopted the similarity upper approximation based clustering of web logs using various similarity metrics. In [6], K.Santhisree et al. have presented a technique to cluster web transactions based on the set similarity measures from web log data which identifies the behaviour of the user’s page visits and order of occurrence of visits. They have formed the Web data Clusters using the Similarity Upper Approximations. In [7], Esin Saka et al. have proposed a hybrid approach which combines the strengths of Spherical KMeans algorithm for fast clustering of high dimensional datasets in the original feature domain and the flock-based algorithm which iteratively adjusts the position and speed of dynamic flock’s agents on a visualization plane. The hybrid algorithm decreases the complexity of FClust from quadratic to linear with further improvements in the cluster quality. In [8], S Park et al. have investigated the use of fuzzy ART neural network, to enhance the performance of the K-Means algorithm. A major limitation with fuzzy ART is the unrestricted growth of clusters. Fuzzy ART networks sometimes produce numerous clusters each with only a small number of members. The authors have used fuzzy ART as an initial seed generator and K-Means as the final clustering algorithm. In [9], Santosh K et al. have developed a clustering algorithm that groups users according to their Web access patterns. The algorithm is based on the ART1 version of adaptive resonance theory. ART1 offers an unsupervised clustering approach that adapts to the changes in user’s access patterns over time without losing earlier information. It applies specifically to binary vectors. They have compared their algorithm’s performance with the traditional K-Means clustering algorithm and showed that the ART1- based technique performed better in terms of intra cluster distances. In [10], Antonia S et al. have developed a variant of the classical SOM called Growing Hierarchical SOM (GHSOM). They have suggested a new visualization technique for the patterns in the hierarchical structure. In [11], Kate A. Smith et al. have developed LOGSOM, a system that utilizes Kohonen’s Self Organizing Map to organize Web page in to two dimensional maps. The organization of the Web pages is based on the user’s navigation behaviour rather than the content of the Web pages. In [12], T.Vijaya Kumar et al. have proposed a framework for finding useful information from Web Usage Data that uses Self Organizing Maps (SOM). Sessions are constructed using the concept hierarchy and the link information. Then they have used SOM to form the cluster. In [13], K.Santhisree et al. have presented a technique to cluster web transactions based using rough set DBSCAN clustering algorithm. They have considered datasets of msnbc.com and found user access patterns, the order of visits of the hyperlinks of each user and inter cluster similarity among the clusters. In [14], Pradeep Kumar et al. have presented a new indiscernibility-based rough agglomerative hierarchical clustering algorithm for sequential data. In their approach, similarity upper approximation is used to form initial clusters and the concept of constrained similarity approximation is used to form subsequent clusters. They have compared their approach with the traditional hierarchical clustering algorithm on msnbc.com web navigation data set. In [15], Mehrbad Jalali et al. have proposed a novel based approach on Longest Common Subsequence algorithm for classifying user navigation patterns for predicting user’s future requests. In [16], J. Vellingiri et al. have proposed a novel clustering algorithm called Fuzzy Possibilistic C-Means algorithm, which integrates extended partition entropy and inter class resemblance to predict the user behaviour. In [17], K. Suresh et al. have used improved Fuzzy C-Means (FCM) algorithm to cluster the Web Usage data from msnbc.com. In [18], Subhash K. Shinde et al. have developed architecture for online recommendation in Web Usage Mining System. They have proposed an architecture of Online Recommendation in Web Usage Mining (OLRWMS) for enhancing accuracy of classification by interaction between classifications, evaluation, current user activities and user profiles. In [19], Supriya Kumar De et al. have used soft computing technique such as rough approximation based clustering to cluster web transactions In [20], Costantinos Dimopoulos et al. have considered the problem of web page usage prediction in a web site by modelling user’s navigation history and web page content with weighted suffix trees. Each navigation session is described by a weighted suffix tree. Then the similarity matrix is constructed which represents the similarity of each pair (i, j) of sequences. Then the clusters are formed by utilizing the content of the web pages in order to enrich the quality of the clustering. The weighted suffix tree is used to capture the most important navigational experience which works like a prediction model without explicitly storing strings and probabilities of occurrences. They have performed evaluation of the proposed scheme with experiments on various web site log files and web pages. In [21], Chen Wu et al. have presented Semantic Web Log Model (SWLM) to improve the efficiency and accuracy of web log mining. In [22], H. Hannan Inbarani et al. have proposed an approach for feature selection based on rough set theory for Web Usage mining. In [23], Sebastian A Rios et al. have used concept based approach to add semantics in to the mining process. They have applied the proposed solution to a real web site to produce offline enhancements of contents and structure and compared their approach with four different Web Usage Mining methods. In [24], Yunjuan Xie et al. have presented a novel approach to clustering Web site users into different groups and generating common user profiles using the concept of mass distribution in Dempster-Shafer’s theory. In [25], Jinfeng Li et al. have proposed a two stage clustering algorithm, named Chameleon Based on Clustering Feature Tree (CBCFT), which hybridizes the Clustering tree algorithm BIRCH with Chameleon algorithm for customer segmentation.

PROPOSED ALGORITHM

The main goal of the proposed system is to find web user clusters from web server log files. The proposed Web Usage Mining System is shown in Fig 1. The WUM system in the proposed approach is partitioned into two modules. In the first module User identification and Session constructions are considered for Data preprocessing phase. In the Second phase Chameleon clustering algorithm is used to form the clusters.
Web log data is usually diverse and voluminous. This raw log data must be assembled into a consistent, integrated and comprehensive form, in order to be used for pattern discovery. As in most data mining applications, data preprocessing involves removing and filtering redundant and irrelevant data, removing noise, transforming and resolving any inconsistencies, identifying unique users and finally constructing user sessions. There are several ways to identify individual users. The most obvious solution is to assume that each combination of IP address and browser identifies a single user. Then sessions are constructed from the individual user’s data. The session is termed as “user activity over a span of time on web site”. Here sequence of web page requests made by single users on particular web domain for a specific time period is identified. This is termed as session construction. Pattern discovery is the next stage of web usage mining. In this phase, frequent access patterns are determined from reconstructed sessions. In our work, this phase is divided into 2 phases namely constructing adjacency matrix and identifying clusters. The adjacency matrix is constructed from the sessions considering sessions as rows and pages as columns. Using the adjacency matrix the data points are generated which is considered as input for the clustering phase. There are several types of clustering and
and different density. Under graph-based clustering Chameleon algorithm is used to merge the clusters. CHAMELEON is an agglomerative hierarchical clustering where the objects are considered as nodes and proximity between nodes are considered as weights. The m by m proximity matrix for m data points can be represented as a dense graph in which each node is connected to all others and the weight of the edge between any pair of nodes reflects their pairwise proximity. For most data sets, objects are highly similar to a small number of objects and weakly similar to most other objects. This property is used to sparsify the proximity matrix by setting many of the low similarity values to 0 before starting the actual clustering process. This sparcification can be performed by removing all links that have a similarity below a specified threshold and by keeping only links to the k nearest neighbours of the point. This approach creates a K Nearest Neighbour graph. Most of the Agglomerative clustering techniques operate by merging the two most similar clusters based on the cluster similarity. The cluster similarity is computed either by using relative closeness or by relative interconnectivity. Chameleon operates by merging the clusters that best preserve the cluster self-similarity with respect to relative interconnectivity and relative closeness [26]. Relative interconnectivity is the absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters. Two clusters are combined if the points in the resulting cluster are almost as strongly connected as points in each of the original clusters. Relative closeness is the absolute closeness of two clusters normalized by the internal closeness of the clusters. Two clusters are combined only if the points in the resulting cluster are almost as close as in each of the original clusters. Chameleon is an agglomerative clustering algorithm which partitions the data using efficient graph partitioning algorithm and combines it with a novel hierarchical clustering algorithm that uses the notion of closeness and interconnectivity together with the local modelling of clusters to form the final clusters. CHAMELEON algorithm consists of two phases, namely Partitioning and Merging. In partition phase, it uses multilevel partition algorithm hMETIS which is built in multilevel partitioning algorithm for hyper graph partition. hMETIS enhances the speed and efficiency of algorithm. In the Merging phase, the sub clusters are merged based on the similarity value which depends on relative interconnectivity and relative closeness. Fig 2 provides an overview of the overall approach used by Chameleon algorithm to obtain the clusters.

PSEUDO CODE

An outline of the proposed system is summarized below.

Step1: Indexing the web page

Input: User identified log file
Output: indexed file
For Each User U
Add request page to array A
End for
For each request in array A
Remove redundancy in array
End for
For each request in array A
Add request page and index value to Hash table
Write request page and index value to IPindex file
End for
Step2. Session construction
Input: User identified log file & indexed file
Output: Session file
For each User U based on Distinct (ip+browser)
For each request r of User U
If (time_of_current_req – time_of_first_req < 30 mins)
Add this request to current session
Else
Create new session & Write session details (ip, time, req) to session file
curr_req_time=curr_req_time - 30
End For
End For
Step3. Generate adjacency matrix and data points.
Step4. Obtaining clusters using Chameleon algorithm
Input: data points obtained by adjacency matrix
Output: cluster file
1: Build a k-nearest neighbour graph of the data points
2: Partition the graph using multi-level graph partitioning algorithm hMETIS
3: repeat
4: Merge the clusters that best preserve the cluster self-similarity with respect to Relative interconnectivity and Relative closeness
5: until No more clusters can be merged

SIMULATION RESULTS

For experimentation, data from Server log files of the Web site www.enggresources.com collected from 19-07-2009 to 11-08-2009 is considered for the overall study and analysis. Error records, requests for images and multimedia files are removed from Server log file by using a tool called Web log filter. IP address, timestamp, user agent, request and referrer are retained for further processing. User Identification is considered as the next step. In user identification, IP address and user agent are used. That is, a combination of IP address and user agent is used to identify a unique user. In session construction, Time oriented approach is used for identifying user sessions. Session timeout threshold is set as 30 minutes. Each Web page is assigned with unique index. Then sessions are identified based on time out and it is written on to a separate file. Every unique session is also given unique index. Now the file contains session number and pages referred in that particular session. The adjacency matrix is constructed from the sessions considering sessions as rows and pages as columns. Using the adjacency matrix the data points are generated which is considered as input for the clustering phase. Then Chameleon algorithm is used to obtain the clusters. A KNN graph of the data set is constructed and hMETIS is applied to partition the graph in to clusters. Then the relative interconnectivity between two clusters and relative closeness within a cluster, are used to repeatedly merge the clusters until the specified number of clusters is achieved. Mat lab is used to plot the clustering data points on a plane where the different clusters are distinguished by distinct colours and distinct symbols. The output for the data set with ten clusters is shown in Fig 3. Fig 4 shows output for the same data set with six clusters. Apart from Time oriented and Navigation oriented, sessions can also be also obtained by using concept matching approach [27]. Fig 5 shows the output for the clusters obtained from sessions which are constructed by using concept hierarchy and web site graph with ten cluster numbers.

CONCLUSION AND FUTURE WORK

Clustering process groups the related data together to form clusters. Chameleon is agglomerative hierarchical clustering algorithm that overcomes the limitations of existing clustering algorithms. The main feature of Chameleon algorithm is that it considers both interconnectivity and closeness in identifying the most similar pair of clusters. In this paper, Chameleon algorithm is used to form the clusters from the data points. First the sessions are constructed using time–oriented approach. Sessions are transformed into adjacency matrix and data points. Then Chameleon algorithm is used to find clusters which are plotted on plane using MATLAB. As a future work rough set theory and semantic information can be used to find relative interconnectivity and relative closeness in order to obtain the most similar pair of clusters.
 

Figures at a glance

Figure 1 Figure 2 Figure 3 Figure 4 Figure 5
Figure 1 Figure 2 Figure 3 Figure 4 Figure 5
 

References