Data Mining: Apriori Algorithm

 

 

NOTES: 

a)      The following content is modified from the following original source: 

·       APRIORI ALGORITHM: EXAMPLE  (Source 1)

·       Java Implementation

 

Data Mining is a, “computational process for discovering patterns in large data sets involving methods at the intersection of computer science and statistics” (Wikipedia).

Apriori Algorithm is a classic algorithm used in data mining for learning association rules. Shopping centers use association rules to place the items next to each other so that users buy more items.  Amazon.com uses association rules to recommend items based on the current item you are browsing/buying. The auto-complete feature in Google uses association as well. Hence, once you type a word (say, X) in “Google auto-complete”, it displays the frequently associated words that users type along with X. (Source 1)

Below, you will see a step-by-step description of the Apriori Algorithm to find the set of three items (from the below list of transactions) that are bought together most frequently.

Start with the following input values:

Transaction ID    Items Bought

T1                         {Mango, Onion, Nintendo, Key-chain, Eggs, Yo-yo}

T2                         {Doll, Onion, Nintendo, Key-chain, Eggs, Yo-yo}

T3                         {Mango, Apple, Key-chain, Eggs}

T4                         {Mango, Umbrella, Corn, Key-chain, Yo-yo}

T5                         {Corn, Onion, Onion, Key-chain, Ice-cream, Eggs}

 

In this problem, we say an item/item set is frequently bought if it is bought at least 60% of times. For this problem, the item should be bought at least 3 times.

For simplicity M = Mango, O = Onion and so on……

Transaction ID    Items Bought

T1                         {M, O, N, K, E, Y}

T2                         {D, O, N, K, E, Y}

T3                         {M, A, K, E}

T4                         {M, U, C, K, Y}

T5                         {C, O, K, I, E}

 

Step 1: Count the number of transactions in which each item occurs. Note that, ‘O=Onion’ is bought 4 times in total, but, it occurs in just 3 transactions.

Item No of transactions

M      3

O       3

N       2

K       5

E        4

Y        3

D       1

A       1

U       1

C       2

I         1

 

Step 2:  Now, remember we said an item is said to be frequently bought if it is bought at least 3 times. Hence, we remove all the items that are bought less than 3 times from the above table and we are left with the following:

Item Number of transactions

M      3

O       3

K       5

E        4

Y        3

 

Step 3:   Using the table generated in Step 2, find the top two items (largest and the second largest) that are the most transacted. As per the above table, your answer should be K and E.

 

Step 4:   Now, that leaves M, O and Y in the table from Step 2. They all appear in the same number of transactions (=3). Which of these appear the most with K and E?  Step 4 should provide the answer.

In this step, you should count the number of transactions as described below (refer to the initial transaction table below):

·       M appears with K in T1, T3 and T4. M appears with E in T1 and T3. In all, M appears with K and E in five transactions.

·       O appears with K in T1, T2 and T5. O appears with E in T1, T2 and T5. In all, O appears with K and E in six transactions.

·       Y appears with K in T1, T2 and T4. Y appears with E in T1 and T2. In all, Y appears with K and E in five transactions.

Referring back to the initial transaction table:

Transaction ID         Items Bought

T1                             {M, O, N, K, E, Y}

T2                             {D, O, N, K, E, Y}

T3                             {M, A, K, E}

T4                             {M, U, C, K, Y}

T5                             {C, O, K, I, E}

From the above calculations, it’s clear that O appears the maximum number of times with K and E. Hence, the set of three items that appear the most together is K, E and O.

(NOTE: The above data table and the algorithm steps are courtesy of Source 1)

 

Program Output:

Implementing Apriori Algorithm

Step 1: Counting Transaction for Each Item:

Item           Number of Transactions

M                3

O                3

N                2

K                 5

E                 4

Y                 3

D                 1

A                 1

U                1

C                 2

I                  1

 

Step 2: Discarding Items with less than 3 transactions:

Item           Number of Transactions

M                3

O                3

K                 5

E                 4

Y                 3

 

Step 3:

The top two items that are the most transacted are K and   E

 

Step 4:

The contenders for the third item that is transacted most with K and E are:

M

O

Y

M appears with K and E  5 times

O appears with K and E  6 times

Y appears with K and E  5 times

The set of three items that appear the most together are KEO