ISSN: 2229-371X
Mrs. K. Devipriyaa1* and Dr. (Mrs.) B.Kalpana2 |
Corresponding Author: Mrs. K. Devipriyaa , E-mail: krdevipriyaa@gmail.com, kalpanabsekar@gmail.com |
Related article at Pubmed, Scholar Google |
Visit for more related articles at Journal of Global Research in Computer Sciences
Web Usage Mining (WUM) is the process of extracting knowledge from Web user’s access data by exploiting Data Mining techniques. It mines the secondary data (web logs) derived from the users' interaction with the web pages during certain period of Web sessions. In the present research work, a hybrid method is proposed, which uses the ant-based clustering and LCS classification methods to find and predict user’s navigation behaviour. The proposed system works in two phases, (i) the offline phase and (ii) the online phase. The offline phase takes care of preprocessing and clustering, while the classification and prediction is performed during the online phase. Ant-based clustering method used to discover or extract user’s navigational pattern from web log files. The LCS classification algorithm uses the knowledge from offline stage and predicts the users’ next request.
Keywords |
web usage mining; pre-processing; pattern discovery; pattern analysis; ant-based clustering; LCS classification |
INTRODUCTION |
Web mining is the application of data mining techniques to web-based data for the purpose of learning or extracting knowledge. Based on several researches done in the area of web mining, we can broadly classify it into three domains: web content mining, web structure mining, and web usage mining. Web usage information mining could help to engage new customers, maintain current customers, track customers who are leaving web site, and so on [1]. A general web usage mining system (Figure 1) consists of three steps, namely, preprocessing, pattern discovery and pattern analysis |
Figure 1: Knowledge Discovery |
Data preprocessing is responsible for converting the usage, content, and structure information contained in the web log file into a format that is suitable for pattern discovery. Pattern discovery draws upon methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern recognition. Pattern analysis is the last step in the overall Web Usage mining process. The motivation behind pattern analysis is to filter out uninteresting rules or patterns from the set found in the pattern discovery phase. |
A clustering algorithm discovers groups in the set of documents such as documents within a group or more similar than document across groups. The clustering method used in an ant-based clustering method[12], where artificial ants act as agents, which communicate and influence themselves through the configuration of objects on the floor. Thus, the agents construct groups of similar objects or construct clusters. The classification algorithm based on Longest Common Sequence algorithm is used [9]. The main aim of this algorithm is to use the knowledge from clustering stage and predict the users’ next request. It uses a weight matrix to calculate the LCS and the path with least weight is chooses to predict the next request. |
In the following section we give an overview over the related work. Section 3 explains the methodology in more detail. Section 4 goes into detail how we implement the proposed method i.e. the experimental procedure of the proposed method and results are shown. We concluded our work in Section 5. |
RELATED WORK |
Identifying Web browsing strategies is a crucial step in Website design and evaluation, and requires approaches that provide information on both the extent of any particular type of user behavior and the motivations for such behavior [9]. Pattern discovery from web data is the key component of web mining and it converge algorithms and techniques from several research areas. Baraglia and Palmerini (2002) [12] proposed a WUM system called SUGGEST that provide useful information to make easier the web user navigation and to optimize the web server performance. Liu and Keselj (2007)[13] proposed the automatic classification of web user navigation patterns and proposed a novel approach to classifying user navigation patterns and predicting users’ future requests. Mobasher et al. (2000) [15] and Nakagawa and Mobasher (2003) [16] presents a WebPersonalizer system which provides dynamic recommendations, as a list of hypertext links, to users. Jespersen et al. (2002) [10] proposed a hybrid approach for analyzing the visitor click sequences. Jalali et al. (2008a [7] and 2008b [8]) proposed a system for discovering user navigation patterns using a graph partitioning model. An undirected graph based on connectivity between each pair of Web pages was considered and weights were assigning to edges of the graph. Dixit and Gadge (2010) [5] presented another user navigation pattern mining system based on the graph partitioning. An undirected graph based on connectivity between Referrer and URI pages was presented along with a preprocessing method to process unprocessed web log file and a formula for assigning weights to edges of the undirected graph. Ant-based clustering due to its flexibility and self-organization has been applied in a variety of areas from problems arising in e-commerce to circuit design, and text-mining to web-mining, etc (Jianbin et al., 2007) [11]. The various works proposed in this area with particular emphasize on web usage mining, clustering and classification was provided in this section. In this present work, research work is one another attempt made to propose a hybrid system that uses clustering and classification methods to discover the user’s navigation pattern and analyze them from the server’s web log file. |
METHODOLOGY |
The navigation pattern of an online user often reflects user’s mental model and for this reason, websites owners and developers pay more attention to the navigation patterns. In order to study these patterns effectively, browsing patterns gathered from a site is very important. Several solutions have been proposed and the usage of clustering and classification is more frequently used on such solutions. This research work is one another attempt made to propose a hybrid system that uses clustering and classification methods to discover the user’s navigation pattern and analyze them from the server’s web log file. |
Preprocessing |
Preprocessing of a web log file simply reformats the entries of a log file into a form that can be used directly by the subsequent steps of the log analyzer. The preprocessing is performed in four steps as given below. (i) Cleaning (ii)User identification (iii) Session identification and (iv) Formatting. |
In the Cleaning step, (i.e.,) cleaning of data, unwanted data will be deleted. Examples of unwanted data include requests for images, Javascripts, flash animations, video, etc. These data are not required for user navigation and hence are deleted form the log file. The identification of users through the IP address is the most frequently used method as it is simple, easy to capture and is never empty. Session Identification is performed using session timeout value. The original log file is thus pruned from unwanted data and is formatted which is done in the last step. This formatted data is taken as input in the next process of web log analysis. Several methods and techniques have already been developed for this step. Some of the frequently used solutions are statistical analysis, clustering, classification, association rules, sequent patterns and dependency modeling. Etminani et al. (2009) [12] used an Ant-based clustering approach to discover navigation patterns for discovering navigation pattern. This method has some issues when combined with log file knowledge discovery. The issues are |
(i) One of the critical problems of ant-based clustering is the high-execution times. |
(ii) The second issue is the inability of ant-based algorithm to detect the completion of clustering process. |
(iii) Another important issue is that one ant can produce the same results as many ants in the models function like stochastic sampling algorithms and the result is repeated and are considered during clustering |
All the above mentioned problems motivated the present research work to use a classification algorithm, after the antbased clustering. The classification algorithm used is the Longest Common Subsequence (LCS), proposed by Jalali et al.[9]. The study divides the pattern discovery and analysis phase into two phases, the online and offline phase. The offline phase consists of analyzing the log file and producing the clusters. The online phase, on receiving a new user request, the URL and session to which the user belongs are identified and are classified to the correct cluster. Both the ant-based clustering algorithm and the LCS algorithm are explained in the next section. |
Pattern Analysis |
The final stage is the analysis of the patterns discovered in the previous step. This is done in two stages: |
(i) Validation : To identify relevant rules or patterns from which interesting patterns can be discovered |
(ii) Interpretation : Mathematical interpretations which can be used by scientists to discover knowledge |
Ant-based clustering (offline Phase) |
The ant algorithm is mainly based on the version described in Handl and Meyer (2002)[6]. Deneubourg et al. in [4] proposed ant-based clustering and sorting. In the case of antbased clustering and sorting, two related types of natural ant behaviors are modeled. When clustering, ants gather items to form heaps. And when sorting, ants discriminate between different kinds of items and spatially arrange them according to their properties. Lumer and Faieta in [14] proposed ant-based data clustering algorithm (shown in Figure 2), which resembles the ant behavior described in [4]. As shown in Figure 2, the agents (ants) and data are randomly initialized on a toroidal grid. By moving agents,Data is sorted according to its neighbors. The picking and dropping probabilities, given a grid position and a particular data item i, are computed using the density functions: |
Figure 2: Ant based algorithm |
During the online phase, when a new request arrives at the server, the URL requested and the session to which the user belongs are identified, the underlying knowledge base is updated, and a list of suggestion is appended to the requested page. The online phase is shown in Figure 3. |
Figure 3: Online Phase |
From the clustering results, we have a set of clusters np' = <np1, np2, …npn> where npi = <P1, P2, …, Pk> where k is the set of web pages identified as user navigation patterns and 1 £ i £ n. Sequence W' = <P1, P2, …Pm> is a current active session and m is size of active session window. Before classifying an active session to construct the prediction list, the pages in active session windows is sorted based on values stored in the co-occurrence matrix M. After this step, for building the prediction list, the system must find the cluster based on LCS algorithm. After applying this algorithm, the system finds a cluster with highest degree of LCS in respect to sequence W'. |
When the prediction engine finds more than one cluster based on LCS algorithm, then the prediction engine selects a cluster in such a way that, if the difference between positions of last elements of longest common subsequence founded in the cluster and the position of first element of this sequence is minimized, the system chooses this cluster. In this module, if the first page in the next user activity is different with prediction list, it needs again to classify with new user activities. |
RESULTS AND DISCUSSION |
In order to test the effectiveness of the proposed system, server web log data file was obtained. The system was tested with several data collected from 90 days for easy discussion, experiments projected here are from one day, that is, data collected on 29-12-2009. |
As mentioned in section 3, the preprocessing is conducted in four steps, namely (i) Cleaning (ii) User Identification (iii) Session Identification and (iv) formatting. |
Figure 4: Effect of cleaning step on raw web log file |
Ant Based clustering is an application for clustering similar interested users into a single class. The result after such grouping is shown in Figure 5. |
Figure 5: Cluster Groups |
Knowledge gained from the clustering results include the number of visits made to a single webpage, website traffic, , most frequently viewed pages cluster information, the users navigation behaviour, etc. The user profile and path accessed by the users are shown in figure 6. |
Figure 6 Extracted navigation patterns |
To predict the user’s next request, LCS classification algorithm was used. The LCS finds the longest navigation sequence cluster that matches with the user’s referral URL For example, Table I shows the navigation pattern of four users belonging to the same cluster, constructed over 3 sessions. |
Table I: Sample pattern |
In order to make pattern analysis and prediction, the LCS algorithm calculates a weight matrix with each pattern discovered. The weight W for edge E can be computed as: |
The accuracy of the system while testing with threshold values ranging from 0.1 to 1.0. From the results, it is apparent that the proposed ant-based clustering when combined with LCS produces better result when compared to the Jalali’s system [9], which uses graph partitioning clustering algorithm and LCS algorithm. The result shows a trend that when the threshold value increases, the accuracy also increases and the maximum accuracy achieved by the proposed system is 74%. |
CONCLUSION |
The results thus prove that the application of clustering and classification have positive impact during user navigation pattern discovery process and gives its best result in both finding the navigation pattern and predicting future request. Future research can also try to combine clustering and association rules to discover more knowledge from the clustered data. Different clustering algorithms can also be investigated to improve the trend analysis and knowledge discovery. |
References |
|