Keywords
|
Apriori algorithm, frequent items, association |
INTRODUCTION
|
Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets.[1] These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction. |
Data mining is becoming increasingly common in both the private and public sectors. Industries such as banking, insurance, medicine, and retailing commonly use data mining to reduce costs, enhance research, and increase sales. In the public sector, data mining applications initially were used as a means to detect fraud and waste, but have grown to also be used for purposes such as measuring and improving program performance. |
ASSOCIATION RULE MINING
|
Association rule mining, one of the most important and well researched techniques of data mining, was first introduced in [2]. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc. |
Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. [3]The problem is usually decomposed into two subproblems. One is to find those itemsets whose occurrences exceed a predefined threshold in the database; those itemsets are called frequent or large itemsets. The second problem is to generate association rules from those large itemsets with the constraints of minimal confidence. Suppose one of the large itemsets is Lk, Lk = {I1, I2, … , Ik}, association rules with this itemsets are generated in the following way: the first rule is {I1, I2, … , Ik-1}⇒ {Ik}, by checking the confidence this rule can be determined as interesting or not. Then other rule are generated by deleting the last items in the antecedent and inserting it to the consequent, further the confidences of the new rules are checked to determine the interestingness of them. Those processes iterated until the antecedent becomes empty. Since the second subproblem is quite straight forward, most of the researches focus on the first subproblem. |
The first sub-problem can be further divided into two sub-problems: candidate large itemsets generation process and frequent itemsets generation process. We call those itemsets whose support exceed the support threshold as large or frequent item-sets, those itemsets that are expected or have the hope to be large or frequent are called candidate itemsets. |
In many cases, the algorithms generate an extremely large number of association rules, often in thousands or even millions. Further, the association rules are sometimes very large. It is nearly impossible for the end users to comprehend or validate such large number of complex association rules, thereby limiting the usefulness of the data mining results. [4] Several strategies have been proposed to reduce the number of association rules, such as generating only “interesting” rules, generating only “non redundant” rules, or generating only those rules satisfying certain other criteria such as coverage, leverage, lift or strength. |
Apriori is more efficient during the candidate generation process [5]. Apriori uses pruning techniques to avoid measuring certain itemsets, while guaranteeing completeness. These are the itemsets that the algorithm can prove will not turn out to be large. |
APRIORI ALGORITHM
|
A basic algorithm Apriori, designed by Agrawal and others in 1993, generated all frequent item sets, Apriori uses the recursive method, the core algorithm is: |
LI =find_frequent_I-itemsets( D); |
for (k=2;LH :;t¢;k++){ |
Ck = apriori _gen(Lk-l,minsup)., |
for each transaction tED {//scan L = UkL k for counts |
C1 = subset( Ck, t ) ;//get the subsets of t that are candidates |
for each candidate C E C1 |
c.count+ +; |
Lk = {CE Ck I c.count ? minsup} } |
returnL = ukLk; Iiall of Lk; |
First, scan database once, resulting in frequent I item sets of LI ; and then loop, in the first k cycles, the first the frequent k -1 item sets through self-connection and pruning, to generate the candidate frequent k item sets Ck, and then use the Hash function to store Ck in a tree, scanning the database for each transaction T to use the same Hash functions to calculate the candidate frequent k item sets the transaction T contains, and make the support number of the candidate frequent k item sets plus 1, If the candidate frequent k item sets support number is greater than or equal to the number of minimum support, then the candidate frequent k item sets are frequent k item sets; the loop ends until the candidate frequent k item sets are not generated any longer. [6] |
A. Limitations of Apriori Algorithm |
The various limitations are as follows : |
1) It only tells the presence and absence of an item in transactional database [7]. |
2) It is not efficient in case of large dataset. |
3) The algorithm treat all items in database equally by considering only the presence and absence of an item within the transaction [8].it does not take into account the significance of item to user or business. |
The algorithm has a lot of disadvantages .These can be removed by using attributes like weight and quantity, weight attribute will give user an estimate of how much quantity of item has been purchased by the customer, profit attribute will calculate the profit ratio and tell total amount of profit an item is giving to the customer. |
PROPOSED METHODOLOGY
|
Following are the steps involved in the proposed methodology. It will tell us how the proposed work has been done. Firstly a transactional database has been assumed on which the proposed methodology is to be applied. |
STEP 1:Firstly a Database will be assumed which will consist of number of ITEMS to be purchased by the customer and total Profit achieved by the items .Profit ratio for each item will be calculated by applying Q-Factor . |
STEP 2:Now consider a Transactional Database will be assumed which will consist of number of ITEMS to be purchased by the customer and total number of transactions in which customer purchase the items. |
STEP 3:Now apply Apriori algorithm of Association rule mining in order to determine the frequency of each Itemset. |
STEP 4:Calculate the CONFIDENCE measure of each Itemset. |
STEP 5 :Sort itemsets based on user specified minimum Confidence. |
STEP 6 : Now Apply Profit-Weighing Factor on the sorted itemset. |
STEP 7 :Output will be the frequent itemsets which are giving maximum profit to the business. |
A. Example of Efficient Frequent itemset generation using attributes |
In the following Example , it illustrate how these steps are performed on a particular transactional Dataset. The first step of the proposed approach is the computation of the profit ratio for all items based on Profit which can be calculated by using the formula in Fig.1, q-factor as : |
Where i = 1 to n and n = number of items and P = profit of an item. |
a) Calculation of profit ratio: Following table 1 shows the five Items A,B,C,D,E. Each Item has a unique Transactional ID and Profit Gained. When the particular Item is purchased by the customer. Profit Ratio is calculated by using the Q-Factor for each item and the profit associated with each item. By using this Profit Ratio The P-W Factor is calculated in the Final Step of Proposed Methodology. |
b ) Applying Apriori algorithm on transactional database: One of the contemporary approaches for frequent item set mining is the Apriori Algorithm. Table 2 Shows the Transactional database with a minimum support = “3”. Minimum support is termed as the user specified support assumed in order to determine the frequency of each itemset occurrence within transactional database. As shown in the Table 2 , we have 10 number of transactions which represent the Presence and absence of an item in the transactional database. It tells in how many transactions a particular item is purchased or not purchased by the customer . From this transactional database we will calculate the frequency of each item’s occurrence within the transactional database using Support measure of apriori algorithm, as shown in Table 3. |
Table 3 shows final Frequent Itemsets satisfying Minimum support that is assumed. In the Diagram red marked Itemsets are not frequent Itemsets as they are not satisfying minimum support. So in the Final Output they are excluded. |
c) Frequent Pattern Selection using Confidence measure : Table 4 shows the selected frequent patterns using confidence measure of apriori algorithm.confidence is assumed to be 60% .Sorting of frequent patterns is done and those patterns are selected having confidence ≥ 60% |
d) Calculation of profit and weighing factor(PW-factor): The next step in Frequent Pattern mining is the calculation of PW-Factor for each frequent patterns selected based on confidence. It is calculated by using equation in Fig.2 .Table 5 shows the values calculated by applying the PW-factor on each Frequent itemset. Here frequency is the support calculated for each itemset in Table 3.Q-factor is the profit ratio calculated in Table 1. |
e) Efficient frequent pattern selection: Table 6 shows the Sorting of Frequent patterns whose PQ-Factor ≥ 2.0. It is an efficient frequent pattern selection step consisting of frequent patterns giving maximum profit to the business. A discussion of the results obtained from our approach is presented here. In general, standard association rule mining algorithms result in enormous patterns , and users are expected to shortlist or select the patterns that are interesting to their own businesses. However, from the results, it is seen that this generates only number of interesting association patterns that are both statistically and semantically important for business development. The high utility patterns discovered in historical buying patterns certainly signify the importance of the items in the growth of the enterprise. |
CONCLUSION
|
The conclusion to this work is that Apriori algorithm is applied on the transactional database. By using measures of apriori algorithm, frequent itemsets can be generated from the database. Also apriori algorithm is associated with certain limitations. Association rule mining efficiency can be improved by using attributes like profit ,quantity which will give the valuable information to the customer as well as the business. In contradiction to traditional association rule mining algorithms, this generates only number of interesting association patterns that are both statistically and semantically important for business development. This in turn affects business utility because, in most cases , frequency plays a vital part in business development including sales backup and more. |
Tables at a glance
|
|
|
|
Table 1 |
Table 2 |
Table 3 |
|
|
|
Table 4 |
Table 5 |
Table 6 |
|
Figures at a glance
|
|
|
Figure 1 |
Figure 2 |
|
References
|
- Pieter Adriaans and DolfZantinge , "Introduction to Data Mining and Knowledge Discovery", Two Crows Corporation, Third Edition (Potomac, MD: Two Crows Corporation, 1999), Data Mining (New York: Addison Wesley, 1996).
- Agrawal, R., Imielinski, T., and Swami, A. N," Mining association rules between sets of items in large databases", In Proceedings of the 1993ACM SIGMOD International Conference on Management of Data, 207-216, 1993
- Hegland, M.," Algorithms for Association Rules, Lecture Notes in Computer Science", Volume 2600, Pages 226 - 234 , Jan 2003.
- Techapichetvanich, K., Datta, A.," Visual Mining of Market Basket Association Rules, Lecture Notes in Computer Science", Volume 3046, Pages 479 - 488, Jan 2004.
- Agrawal, R. and Srikant, R, " Fast algorithms for mining association rules ", In Proc. 20th Int. Conf. Very Large Data Bases, 487-499, 1994.
- Kong Fang , QianXue-zhong, " Research of improved apriori algorithm in mining association rules ",Computer Engineering and Design, v29, n16, p4220-4223, 2008.
- Tang, P., Turkia, M., " Parallelizing frequent itemset mining with FP-trees ", Technical Report , titus.compsci.ualr. edu/~ptang/papers/parfi.pdf, Department of Computer Science, University of Arkansas at Little Rock, 2005.
- Han, J., Pei, J. "Mining frequent patterns by pattern growth: methodology and implications". ACM SIGKDD Explorations Newsletter 2, 2, 14-20, 2000
|