Association Rule Mining in R:

Association Rule Mining (also known as Association Rule Learning) is a technique that is commonly used to find out associations between variables. It is used by e-commerce websites, grocery stores and other businesses with a large transactional database. A common example would be?—?Amazon knows what we might order next based on our previous purchases. We experience something similar in case of Spotify too?—?they know what song we might listen next based on the songs we have played previously. All of it is incorporated and at some level, data mining concepts and association rule mining algorithms are used.

To perform Association Rule Mining in R, we use the following packages in R.

> install.packages(“arules”)

> install.packages(“arulesViz”)

Generating Rules

Three parameters are there to control the number of rules that are to be generated viz. Support and Confidence. Another parameter is used known as Lift which is generated using Support and Confidence and is also one of the major parameters to filter the generated rules.

Support indicates how frequently the item set has appeared in the dataset.

Confidence indicates how often the rule has been found to be true.

The Apriori algorithm generates 15 rules with the given constraints. :

Minval : Least value of the support of an item set that needs to satisfy to be a part of a rule.

smax : Most support value for an item set.

arem : Additional Rule Evaluation Parameter. In the above code we have constrained the number of rules using Support and Confidence. There are several other ways to constrain the rules using the arem parameter in the function.

aval : logical indicating whether to return the additional rule evaluation measure selected with arem.

The traditional support value only considers both LHS and RHS items for calculating support. If you want to use only the LHS items for the calculation then you need to set this to FALSE.

maxtime : Most amount of time allowed to check for subsets.

minlen : least number of items required in the rule.

maxlen : Most number of items that can be present in the rule.

Limiting the number of rules generated

Quite a number of time, we would want to limit the number of rules generated. For eg, we can use association rules as predictors in Regression/Classification. We can generate rules with Right Hand Side of the rule as your response and use the rules generated as the modelling features. Here we would not want to use all the rules generated as the predictors because many rules are actually subsets of bigger rules and hence you would want to eliminate them.

So if we want stronger rules, we have to increase the confidence. If we want lengthier rules, then increase maxlen. If we need to eliminate the shorter rules, decrease minlen.

At times we might be interested to find the maximum number of items and then remove the shorter rules which are the subsets of the longer rules.

How to measure the strength of a rule?

From the transaction data the apriori() generates utmost relevant rules and also shows the support, confidence and lift related to those rules. So these three measures can be used to find the relative strength of the rules. So what are the means of these three terms?

Let’s consider the rule A =; B in order to compute these metrics.

Support=Number of transactions with both A and B/Total number of transactions=P(A?B)

Confidence=Number of transactions with both A and B/Total number of transactions with A=P(A?B)P(A)

Expected Confidence=Number of transactions with B/Total number of transactions=P(B)

Lift=Confidence/Expected Confidence=P(A?B)P(A).P(B)

Lift is the factor by which, the co-occurrence of A and B exceeds the expected probability of A and B co-occurring, had they been independent. So, higher the lift, higher the chance of A and B occurring together.

BUSINESS USE:

Example: (Source- datacamp.com)

Given is a set of transaction data. You can see transactions numbered 1 to 5. Each transaction shows items bought in that transaction. You can see that Diaper is bought with Beer in three transactions. Similarly, Bread is bought with milk in three transactions making them both frequent item sets. Association rules are given in the form as below:

A=;B Support,Confidence

The part before =; is referred to as if (Antecedent) and the part after =; is referred to as then (Consequent).

Where A and B are sets of items in the transaction data. A and B are disjoint sets.

Computer=;Anti?virus SoftwareSupport=20%,confidence=60%

Above rule says:

1. 20% transaction show Anti-virus software is bought with purchase of a Computer

2. 60% of customers who purchase Anti-virus software is bought with purchase of a Computer

In the following section you will learn about the basic concepts of Association Rule Mining:

Basic Concepts of Association Rule Mining

1. Itemset: Collection of one or more items. K-item-set means a set of k items.

2. Support Count: Frequency of occurrence of an item-set

3. Support (s): Fraction of transactions that contain the item-set

Support(A=;B)=P(AUB)=n(AUB)N

In words it is the number of transactions with both A and B divided by the total number of transactions. N is the total number of transactions.

Support(Bread=;Milk)=35=0.6=60%

Note: P(AUB) is the probability of A and B occurring together. P denotes probability.

Go ahead, try finding the support for Milk=;Diaper as an exercise.

1. Confidence (c): For a rule A=;B Confidence shows the percentage in which B is bought with A.

Confidence(A=;B)=P(AUB)P(A)=n(AUB)n(A)

The number of transactions with both A and B divided by the total number of transactions having A.

Confidence(Bread=;Milk)=34=0.75=75%

Now find the confidence for Milk=;Diaper.

Note: Support and Confidence measure how interesting the rule is. It is set by the minimum support and minimum confidence thresholds. These thresholds set by client help to compare the rule strength according to your own or client’s will. The closer to threshold the more the rule is of use to the client.

1. Frequent Itemsets: Item-sets whose support is greater or equal than minimum support threshold (min_sup). In above example min_sup=3. This is set on user choice.

2. Strong rules: If a rule A=>BSupport, Confidence satisfies min_sup and min_confidence then it is a strong rule.

3. Lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows how one item-set A effects the item-set B. A and B are independent if:

P(AUB)=P(A)P(B)P(AUB)=P(A)P(B)

otherwise dependent. Lift is given by:

Lift(A,B)=P(AUB)P(A)P(B)

So, higher the lift, higher the chance of A and B occurring together.

Goal of Association Rule Mining

When you apply Association Rule Mining on a given set of transactions T your goal will be to find all rules with:

1. Support greater than or equal to min_support

2. Confidence greater than or equal to min_confidence

APRIORI Algorithm

In this part we have discussed the algorithm that will be running behind R libraries for Market Basket Analysis. This will help in understanding the clients more and perform analysis with more attention.

Association Rule Mining is viewed as a two-step approach:

1. Frequent Itemset Generation: Find all frequent item-sets with support >= pre-determined min support count

2. Rule Generation: List all Association Rules from frequent item-sets. Calculate Support and Confidence for all rules. Prune rules that fail min_support and min_confidence thresholds.

Frequent Itemset Generation is the most computationally expensive step because it requires a full database scan.

Among the above steps, Frequent Item-set generation is the most costly in terms of computation.

Above in the example there are only 5 transactions, but in real-world transaction data for retail can exceed up to GBs and TBs of data for which an optimized algorithm is needed to remove out Item-sets that will not help in later steps. For this APRIORI Algorithm is used. It states:

Any subset of a frequent itemset must also be frequent. In other words, No superset of an infrequent itemset must be generated or tested

It is represented in Itemset Lattice which is a graphical representation of the APRIORI algorithm principle. It consists of k-item-set node and relation of subsets of that k-item-set.

In the above figure, in the bottom is all the items in the transaction data and then as we start moving upwards creating subsets till the null set. For d number of items size of the lattice will become 2d. This shows how difficult it will be to generate Frequent Item-set by finding support for each combination. The following figure shows how much APRIORI helps to reduce the number of sets to be generated:

If item-set {a,b} is infrequent then we do not need to take into account all its super-sets.

Eg.

One can start by creating Candidate List for the 1-itemset that will include all the items, which are present in the transaction data, individually. Considering retail transaction data from real-world, you can see how expensive this candidate generation is. Here APRIORI plays its role and helps reduce the number of the Candidate list, and useful rules are generated at the end. In the following steps, you will see how we reach the end of Frequent Itemset generation,that is the first step of Association rule mining.

The next step involves listing all frequent itemsets. Then take the last non-empty Frequent Itemset, which in this example is L2={I1, I2},{I2, I3}. Then make all non-empty subsets of the item-sets present in that Frequent Item-set List. Follow along as shown in below illustration: