Data
Mining: Apriori Algorithm
NOTES:
a) The following
content is modified from the following original source:
· APRIORI ALGORITHM: EXAMPLE (Source 1)
Data
Mining is a, “computational process for
discovering patterns in large data sets involving methods at the intersection
of computer science and statistics” (Wikipedia).
Apriori
Algorithm is a classic algorithm used in data mining for learning association
rules. Shopping centers use association rules to place the items next to each
other so that users buy more items.
Amazon.com uses association rules to recommend items based on the
current item you are browsing/buying. The auto-complete feature in Google uses
association as well. Hence, once you type a word (say, X) in “Google
auto-complete”, it displays the frequently associated words that users type
along with X. (Source 1)
Below,
you will see a step-by-step description of the Apriori Algorithm to find the set
of three items (from the below list of transactions) that are bought
together most frequently.
Start with the following input values:
Transaction
ID Items Bought
T1
{Mango, Onion,
Nintendo, Key-chain, Eggs, Yo-yo}
T2
{Doll, Onion,
Nintendo, Key-chain, Eggs, Yo-yo}
T3
{Mango, Apple, Key-chain,
Eggs}
T4
{Mango, Umbrella,
Corn, Key-chain, Yo-yo}
T5
{Corn, Onion,
Onion, Key-chain, Ice-cream, Eggs}
In
this problem, we say an item/item set is frequently bought if it is bought at
least 60% of times. For this problem, the item should be bought at least 3
times.
For
simplicity M = Mango, O = Onion and so on……
Transaction
ID Items Bought
T1
{M, O, N, K, E, Y}
T2
{D, O, N, K, E, Y}
T3
{M, A, K, E}
T4
{M, U, C, K, Y}
T5
{C, O, K, I, E}
Step 1: Count the number
of transactions in which each item occurs. Note that, ‘O=Onion’ is bought 4
times in total, but, it occurs in just 3 transactions.
Item
No of transactions
M
3
O
3
N
2
K
5
E
4
Y
3
D
1
A
1
U
1
C
2
I
1
Step 2: Now, remember we said an item is said to be
frequently bought if it is bought at least 3 times. Hence, we remove all the
items that are bought less than 3 times from the above table and we are left
with the following:
Item
Number of transactions
M
3
O
3
K
5
E
4
Y
3
Step 3: Using the table generated in Step 2, find
the top two items (largest and the second largest) that are the most
transacted. As per the above table, your answer should be K and E.
Step 4: Now, that leaves M, O and Y in the table
from Step 2. They all appear in the same number of transactions (=3). Which of
these appear the most with K and E? Step
4 should provide the answer.
In
this step, you should count the number of transactions as described below
(refer to the initial transaction table below):
· M appears with K
in T1, T3 and T4. M appears with E in T1 and T3. In all, M appears with K and E
in five transactions.
· O appears with K
in T1, T2 and T5. O appears with E in T1, T2 and T5. In all, O appears with K
and E in six transactions.
· Y appears with K
in T1, T2 and T4. Y appears with E in T1 and T2. In all, Y appears with K and E
in five transactions.
Referring back to the initial
transaction table:
Transaction ID Items Bought
T1 {M,
O, N, K, E, Y}
T2 {D,
O, N, K, E, Y}
T3 {M,
A, K, E}
T4 {M,
U, C, K, Y}
T5 {C,
O, K, I, E}
From
the above calculations, it’s clear that O appears the maximum number of times
with K and E. Hence, the set of three items that appear the most together is K,
E and O.
(NOTE: The above data table and
the algorithm steps are courtesy of
Source 1)
Program Output:
Implementing Apriori Algorithm
Step 1: Counting Transaction for Each Item:
Item Number
of Transactions
M 3
O 3
N 2
K 5
E 4
Y 3
D 1
A 1
U 1
C 2
I 1
Step 2: Discarding Items with less than 3
transactions:
Item Number
of Transactions
M 3
O 3
K 5
E 4
Y 3
Step 3:
The top two items that are the most
transacted are K and E
Step 4:
The contenders for the third item that
is transacted most with K and E are:
M
O
Y
M appears with K and E 5 times
O appears with K and E 6 times
Y appears with K and E 5 times
The set of three items that appear the
most together are KEO