An Efficient Algorithm for Mining Frequent
Items in Data Streams

Dr. S. Vijayarani; Ms. P. Sathya

An Efficient Algorithm for Mining Frequent Items in Data Streams

Dr. S. Vijayarani¹, Ms. P. Sathya²

Assistant Professor, Dept. of CSE, Bharathiar university, Tamil Nadu, India
Research Scholar, Dept. of CSE, Bharathiar University, Tamil Nadu, India

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

Data stream is a continuous, real time, ordered sequence of items. In data stream, data arrives endlessly and the volume of data can be potentially infinite. In recent years advances in hardware and software technologies have resulted in automated storage of data from a variety of process. Data mining techniques are applied in data streams to find out the significant knowledge. The term data mining refers, to find relevant and useful information from large database. Some of the important techniques in data mining are association rule, classification, clustering, frequent episodes, and deviation detection. Frequent pattern mining is used to find important frequent patterns from the large dataset. Click stream analysis, market basket analysis, web link enquiry, genome study, network monitoring and medicine designing are some of the important areas where frequent pattern mining is used. Most commonly used frequent pattern mining algorithms are Apriori, partition algorithm, pincer- search algorithm, fp-growth algorithm, dynamic item set counting algorithm and so on. In this research paper Éclat and Rapid Association Rule mining algorithm are used for finding the frequent item sets in data streams. The experimental results show that the performance of RARM algorithm is better than Éclat.

Keywords

Association rules mining, Data mining, Data streams, Éclat algorithm, frequent pattern mining, RARM algorithm

INTRODUCTION

In data streams the data elements arrive incessantly and infinitely [2]. There is no limit on the end points in data stream data. The system has no control over the order in which the data elements arrive. The repeated scanning method is impossible in data streams. Due to the continuous flow of data the user should select the single pass algorithm to find out the frequent item sets [4]. In addition to that the algorithm does not drop any events in that stream data. The data streams unremitting data has lots of problems to store, compute as well as communication capabilities in computing system. Call records, web page visits, sensor reading are some of the examples of tuples in data streams. The hasty growth of uninterrupted data has many challenges to store, computation and communication capabilities in computing system.

In data stream data enters at high speed and continuous way. It is not possible to store them in a data warehouse. To identify a bit that is some portion of information in the database and remove the unnecessary information from the large data base. Important data mining techniques such as clustering, classification, frequent item mining, etc. [1] [12] are used for extracting the hidden knowledge from the data streams. In this paper, we have concentrated on mining frequent items from data streams. In association rule mining, association rules are discovered only when the rules whose support and confidence is greater than or equal to the minimum support and confidence. There are two important steps in association rule mining. The first step is to find the frequent items from the data set. During the second step, the association rules are generated from the frequent items.

This paper will focus on the following sections. In Section 2, we present an overview of the frequent item set mining in data streams and its related works. Section 3 discusses how Éclat algorithm is used for finding the frequent items in data streams. Section 4 gives the rapid association rule mining algorithm (RARM). Section 5 discusses the performance evaluation of the two algorithms. Section 6 discusses the conclusion and future work.

FREQUENT ITEM SET MINING IN DATA STEAMS

In a transactional database, the items which occur repeatedly in transactions are called frequent items. Frequent item set mining is used to discover useful patterns in customer’s transaction databases. This type of finding item set helps businesses to make important decisions, such as catalogue drawing, cross marketing and consumer shopping performance scrutiny. A customer’s transaction database consists of set of transactions, where each transaction is an item set. The frequent item set mining is to find all frequent item set in a given transaction database. There are many kinds of frequent patterns. They are [11], [2], [15]

? Frequent item set

? Subsequent item set

? Sub structure item set

Frequent item set -> ex: set of items like milk, bread appears together in a transactional data is known as frequent item set. (Items frequently appears together)

Subsequence item set -> ex: PC, digital camera, memory card which means a person first buys a PC, then a digital camera and then a memory card. If these items are appearing together in a shopping database, it is a sequential pattern.

Sub structure items -> ex: graph, tree, and lattice. Different structural forms combined with item sets or subsequence is called as substructure.

A. Related Works

Syed Khairuzzaman Tanbeer et.al [14] proposed a prefix-tree structure called CPS-tree. The CPS tree uses a new technique called as dynamic tree restructuring technique to handle the stream data. The main disadvantage of this algorithm is every time a new item arrives, it reconstructs the tree. So it causes more memory space as well as time. Pauray S.M. Tsai [13] proposed a new technique called the weighted sliding window (WSW) algorithm. This algorithm calculates the weight of each transaction in each window. The candidate item set generation may take more time and memory. For candidate generation, an apriori algorithm is used.

Hue-Fu Li et.al [7] proposed an efficient bit-sequence based algorithm called MFI-Trans SW (Mining Frequent Item sets with in a Transaction Sensitive Sliding window). MFI algorithm worked on three phases. If the window size is increased, the memory usage of MFI-TransSW is also increased. If the window size increases, the processing time of phase 1 and phase 2 of MFI-TimeSW is also increased.

Yo Unghee Kim et.al [17] proposed an efficient algorithm WSFI mine (Weighted Support Frequent Item sets mining) with normalized weight over data stream. This WSFI-mine algorithm can mine all frequent item sets in one scan from the database.

Chowdhury Farhan Ahmed et. al [6] they recommended a novel algorithm for sliding window based high utility pattern mining over data stream called as HUPMS (High Utility Pattern Mining in Stream data). This algorithm is only suitable for interactive mining Jing Guo,Peng Zhang et. al [9] discussed how to mine frequent patterns across multiple data streams. In this paper they selected real time news paper data for analysis. In multiple streams it is important to discover collaborative frequent patterns and comparative frequent patterns.

Anushree Gowtham Ringe et. al [3], This work was mainly focused on, how to prevent the misuse of sensitive data, in a stream. In this paper they proposed a novel technique for preserving the privacy of data stream.

ÉCLAT ALGORITHM IN DATA STREAMS

Equivalence Class Clustering and bottom up Lattice Traversal is an acronym for ECLAT algorithm [2] [5]. The name implies, that the algorithm uses bottom up searching method to find out the frequent item set. This algorithm is also used to perform item set mining. It uses tid set intersection to compute the support of a candidate item set. Compared with other algorithms like Apriori, FP- growth, partition algorithm, the ECLAT algorithm does not required candidate generation phase and pruning phase.

Éclat algorithm first put some id value to all the item set in a database. It attempts to improve the support counting step by indexing the data base. This algorithm does not create candidate generation process directly. The support of the candidate item set can be computed to intersect the tid list of suitably chosen subsets. Éclat counts the supports of all item sets much more efficiently using the intersection of tid lists. Compared with other algorithms, the memory usage is much lower as at any depth.

RARM ALGORITHM IN DATA STREAMS

Rapid Association Rule Mining algorithm is an abbreviation for RARM [2]. This algorithm uses tree structure to represent the set of transactions. This algorithm avoids candidate generation process. It uses a data structure called Support Ordered Trie Item set (SOTrieIT) to generate large 1 item sets and large 2 item sets. This algorithm scans the data base only once. This algorithm first scans the database and forms some tree structure.

PERFORMANCE EVALUATION

Experimental results of éclat algorithm and RARM algorithm are discussed in this section. It is implemented in Microsoft visual studio 2008 with SQL server 2000. For testing frequent pattern mining over transactional data, synthetic data streams data sets [10] [18] are used. The synthetic data used in this paper is Kosakshi from IBM data generator. This data set contains 88054 transactions and 46 attributes

A. Number of Frequent Items

There are five windows W1, W2, W3, W4 and W5 are used in this work. The performance of the Éclat and RARM algorithms are compared by the two factors namely execution time and the number of frequent items discovered in each window. Window sliding concept is used in this work. After finding the frequent items in W1 the next window W2 automatically slides. Different sizes of transactions 100,500 and 1000 are tested and their results are obtained.

Table 1 illustrates the number frequent items identified by Éclat and RARM algorithms. From the results, it can be shown the more number of frequent items are identified by RARM algorithm compared to Éclat algorithm.

B. Time Efficiency:

Another performance factor used for measuring the efficiency of Éclat and RARM is execution time. Execution time is nothing but how much time required for identifying the frequent items in each window.

Table II shows the execution time required for Éclat and RARM algorithms. From this we come to know that Éclat algorithm needs more time for finding the frequent items compared to RARM algorithm.

CONCLUSION AND FUTURE WORK

In this research work we have analyzed the performance of two algorithms namely Éclat and Rapid Association Rule Mining for mining frequent items in data streams. Execution time and number of frequent items are the main performance factors of this work. From the experimental results, we observed that, the RARM algorithm required minimum execution time and it also identified more number of frequent items compared to Éclat algorithm. Mining techniques will then be very significant in order to conduct advanced analysis, such as determining trends and finding interesting patterns, on streaming data. Data streams data are from various sources, and it has much confidential information also, so we can protect these confidential data by applying a privacy technique in future [8]. It is often a challenge to perform privacy for continuously arriving data.

Tables at a glance


Table 1	Table 2

Figures at a glance


Figure 1	Figure 2

References

Arun k Pujari “Data mining techniques second edition” ISBN 978 81 7371 672 0 universities press 2010

Aggarwal, C. In C. Aggarwal (Ed.), “Data streams: Models and algorithms”. Springer, 2007.

AnushreeGowthamRinge, DeekshaSood and TurgaToshniwa”Compression and privacy preservation of data streams using moments”,Information journal of machine learning and computing. 2011

Albert Bifet, Geoff Holmes, Richard Kirkby and Bernhard Pfahringer” Data Stream Mining A Practical Approach” May 2011

.Borgelt C “efficient implementations of apriori and éclat”. In: 2nd workshop of frequent item set mining implementations. 2004.

ChowdaryFarhaahmed, Byeong-SooJeong, “Efficient mining of high utility patterns over data streams with a sliding window model”. Springerlink.com, 2011

Hue-Fu Li, Suh-Li “Mining frequent item sets over data streams using efficient window sliding technique”, Elsevier publication. 2009.

Han J, Cheng H, Xin D, Yan X “Frequent pattern mining: current status and future directions”. 2007

Jing Guo, Peng Zhang, Jianlong Tan and li Guo “Mining frequent patterns across multiple data streams”, 2011.

J. Han and M. kamber, “Data Mining: Concepts and techniques,” Series Editor Morgan Kaufmann Publishers, ISDN 1-55860-489-8. 2000

Koh, J.-L., and Shieh, S.-F, “An efficient approach for maintaining association rules based on adjusting FP-tree structures”. In Lee Y-J, Li J, Whang K-Y, Lee D (eds) Proc. of DASFAA 2004. Springer-Verlag, Berlin Heidelberg New York, 417–424

Margaret H. Dunham“Data Mining: Introductory and Advanced Topics”.

PaurayS.M.Tsai, “Mining frequent item sets in data streams using the weighted sliding window model”, Elsevier publication 2009.

Syed KhairuzzamanTabeer, ChowdaryFarhaahmed, Byeong-SooJeong, Young Koo Lee”Efficient frequent pattern mining over data streams” 2008.

Tanbeer, S. K., Ahmed, C. F., Jeong, B.-S., and Lee, Y.-K. 2008. “CP-tree: a tree structure for single-pass frequent pattern mining”S. In Proc. of PAKDD, Lect Notes ArtifInt, 1022-1027.

Younghee Kim, Won Young Kim and Ungmo Kim “Mining frequent item sets with normalized weight in continuous data streams”. Journal of information processing systems. 2010.

www.borgelt.net/slides/fpm.pdf