Apriori algorithm – Association Rule Mining

Filed Under: Python Advanced
Association Rule Mining Apriori 101

Today we’ll cover the Apriori Algorithm, which is used for Market Basket Analysis.

While it is often enough for an expert in any other traditional subject (like math or physics) to know their subjects well, it is definitely not so for the programmer or data scientist.

It is important to have sound logic, problem-solving skills, efficient programming skills, domain knowledge, and knowledge about algorithms.

Keeping that in mind, today I brought something special for you – Association Rule Mining (or Market Basket Analysis).

It has wide use in industry and is one of my favorite algorithms because of its simplicity and ingenuity. So let’s get right into the topic.

What is Market Basket Analysis?

Consider a scenarioYou are the CEO of a huge shopping store (like Amazon or Walmart).

You are looking for a way to increase your sales, with the least effort.

You have the transaction history of all your customers, and you observe that when people buy tea, 50% of the time they buy milk as well. Similarly, when customers buy a pair of trousers, they also tend to look for a shirt.

And you are hit with an idea! You direct your employees to shift the items that are bought frequently, much closer together. This way, customers are more inclined to consider buying these items together.

And your sales skyrocket… WIN – WIN.

This is exactly what is used by every online service nowadays – Netflix, Amazon, Flipkart… you name it. In fact, it is also used by TV channels ( specific ads along with IPL), insurance companies, etc. but mostly shopping services.

This is Market Basket Analysis. From a dataset of transactions, it is possible to find and extract “rules” for which items are sold together, and then the items can be grouped together for more profit.

There are many algorithms for association rule mining, but two of the most popular are:

  • Apriori algorithm
  • FP tree algorithm

Benefits of Market Basket

  • Identifying items that can also be bought together and coordinating the location of such items nearby (such as in a catalog or on a website) to enable the consumer to purchase all products.
  • An alternative use for the location of physical goods in a shop is to distinguish items that are frequently bought at the same time and allow customers to walk around the store and find what they are searching for to theoretically increase the possibility of additional sales with impulses.
  • At the same time, clients could be predisposed to purchase clustered goods. This helps the presentation of cross-selling goods or may mean that when such things are packaged together, consumers may be able to purchase more goods.
  • A company representative can also use market basket analysis to decide the best offers to give to maintain the business of the customer when consumers approach a business to break a relationship.

Implementing the Apriori Algorithm in Python

First off, we’re doing this directly from scratch so that you get the concepts. There are of course many packages like that you can use for actual datasets, but concepts are more important:

1. Create the dataset

So let’s create our transaction dataset!

df = pd.DataFrame()
df['Transaction id'] = ['T'+str(i) for i in range(9)]
df['items'] = [['A','B','E'],
               ['B','D'],
               ['B','C'],
               ['A','B','D'],
               ['A','C'],
               ['B','C'],
               ['A','C'],
               ['A','B','C','E'],
               ['A','B','C']]
Item Dataset For Mba
Item Dataset For MBA

2. Count each product in the sets (1-itemsets)

Next we find the counts of each candidate item:

items = []
counts = {}
for i in range(df.shape[0]):
  for item in df['items'][i]:
    if item not in items:
      items.append(item)
      counts[item]=1
    else:
      counts[item] +=1
Counts Candidates Apriori Itemsets
1 Itemset Counts – Apriori

Now, we check the counts against minimum support, which is our threshold. So, say our support is 2. This means we only consider items that have occurred two or more times.

3. Grouping the items together (2-itemsets)

We move on to the two-item groupings.

counts = {'AB':0,'AC':0,
          'AD':0,'AE':0,
          'BC':0,'BD':0,
          'BE':0,'CD':0,
          'CE':0,'DE':0}

for item in df['items']:
    print(item)
    if 'A' in item:
      if 'B' in item:
        counts['AB']+=1
      if 'C' in item:
        counts['AC']+=1
      if 'D' in item:
        counts['AD']+=1
      if 'E' in item:
        counts['AE']+=1
    if 'B' in item:
      if 'C' in item:
        counts['BC']+=1
      if 'D' in item:
        counts['BD']+=1
      if 'E' in item:
        counts['BE']+=1
    if 'C' in item:
      if 'D' in item:
        counts['CD']+=1
      if 'E' in item:
        counts['CE']+=1
    if 'D' in item:
      if 'E' in item:
        counts['DE']+=1
2 Itemset Counts Apriori
2 Itemset Counts – Apriori

These are called 2-itemsets. Similarly, next we shall find 3-itemsets.

But first, we check against our min. support, and since AD,CD,CE,DE don’t satisfy the condition, we can remove them from our list.

How this helps is, we will generate the 3-itemset for a group if and only if all of its subsets are present in the 2-itemset list.

4. Creating groups of 3 products (3-itemsets)

So our 3-itemset is only ABC and ABE

counts = {'ABC':0,'ABE':0}
for item in df['items']:
  print(item)
  if 'A' in item:
    if 'B' in item:
      if 'C' in item:
        counts['ABC']+=1
      if 'E' in item:
        counts['ABE']+=1

Then we get the counts as:

3 Itemset Counts Apriori
3 Itemset Counts – Apriori

Since no 4-itemsets can be created from these two items, we are done !

Conclusion

The items ‘A’, ‘B’ and ‘C’ are bought together with 2/9 probability, and the same goes for items ‘A’, ‘B’ and ‘E’.

Perhaps you can understand the benefit of this algorithm more from the story of Walmart, who used the Apriori algorithm to discover a strange occurrence:

Some time ago, Wal-Mart decided to combine the data from its loyalty card system with that from its point of sale systems.

The former provided Wal-Mart with demographic data about its customers, the latter told it where, when, and what those customers bought.

Once combined, the data was mined extensively and many correlations appeared.

Some of these were obvious; people who buy gin are also likely to buy tonic. They often also buy lemons.

However, one correlation stood out like a sore thumb because it was so unexpected.

On Friday afternoons, young American males who buy diapers (nappies) also have a predisposition to buy beer.

No one had predicted that result, so no one would ever have even asked the question in the first place. Hence, this is an excellent example of the difference between data mining and querying.

DSSResources.com, October 2002

I hope all of you enjoyed this article. I sure did. Bookmark the site and keep checking in.

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages