Association rules can be thought of as an If-Then Relationship.
ARM(Association Rule Mining) is one of the important techniques in data science. In ARM, the frequency of patterns and associations in the dataset is identified among the item sets then used to predict the next relevant item in the set. This ARM technique is mostly used in business decisions according to customer purchases.
Suppose item A is being bought by the customer, then the chances of item B being picked by the customer too under the same transaction ID is found out.
For example, People who buy diapers are likely to buy baby powder. Or we can rephrase the statement by saying: If (People buy diaper), Then( they buy baby powder).
Note the if, then rule.
This does not necessarily mean that if people buy the baby powder, they buy diapers.In general, we can say that if condition A tends to B it does not necessarily mean that B tends to A. Watch the directionality!.
There are two elements of these rules:
The first is the — Antecedent (IF): This is an item/group of items that are typically found in the itemsets or Datasets.
The Second is the — Consequent (THEN): This comes along as an item with an Antecedent/Group of Antecedents.
There are Three Ways to Measure Association:
- Support: An indication of how frequent the itemset appears in the dataset
2. Confidence: An indication of how often the rules have been found to be true.
3. Lift: Greater lift values (>1) indicate stronger associations between X and Y and they depend on one another.
Now that we understand how to quantify the importance of association of products within an itemset, the next step is to generate rules from the entire list of items and identify the most important ones.
This is not as simple as it might sound.
Supermarkets will have thousands of different products in the store. After some simple calculations, it can be shown that just 10 products will lead to 50000+ rules!! And this number increases exponentially with the increase in the number of items. Finding Lift values for each of these will get computationally very expensive.
How to Deal with these Problems?? How to come up with a set of most important association rules to be considered??
Apriori algorithm comes to our rescue for this.
Apriori algorithm is a classical algorithm used in the field of data mining for the Association Rule Mining. It mines frequent itemsets and relevant association rules among relational databases that may contain a large number of transactions. It Builds on associations and correlations between the itemsets.
Apriori Algorithm is mainly applied in domains such as market basket analysis to help customers purchase products and services with ease and increase sales for the merchants.
Apriori uses a “bottom-up” approach, where frequent subsets are extended one item at a time(a step is known as a candidate), and groups of candidates are tested against the data. The algorithm terminates when no future successful extension is found.
Consider the following database, where each row is a transaction and each cell is an individual item of the transaction:
The association rules that can be determined from this database are the following:
- 100% of sets with alpha also contain beta.
- 50% of sets with alpha, beta also have epsilon.
- 50% of sets with alpha, beta also have theta.
we can also illustrate this through a variety of examples.
- The algorithm scans the database too many times, which reduces the overall performance. Due to this, the algorithm assumes that the database is permanent in memory.
- Time and Space complexity of this algorithm is very high.
- Algorithms such as Max_Miner try to identify the maximal frequent itemsets without enumerating their subsets and perform “jumps” in the search space rather than a purely bottom-up approach.