Association rule mining and Apriori algorithm

Association rule mining is a common method in data mining, which generally refers to the process of finding frequent patterns and associations of items or objects from transaction databases, relational databases, and other data sets. This method is commonly used in market Basket analysis. It was originally used to discover relationships between different items in supermarket sales databases, and the most classic is the beer and diaper joke.

It was observed that many customers also bought beer when they bought nappies, so putting nappies and beer together increased sales, and putting potato chips between them increased sales of all three.

Then this phenomenon will bring some thinking, which goods customers are likely to buy at the same time of shopping?

Basic concept

First, we have a Transacational database:

The object to be studied is called Item. In the context of basket analysis application, each Item purchased by a customer is called an Item
Set of all terms: I={i1i_1i1, i2i_2i2… imi_mim}
The set of items corresponding to each transaction tit_iti is a subset of I
Any subset of I is called an Itemset: x = {ij1,ij2… ijpi_{j1}, i_{j2}… i_{jp}ij1,ij2… ijp}
Transaction database D is a collection of transactions, D={t1t_1T1, t2t_2T2… , tmt_mtm}
The number of items contained in each item set becomes the length of the item set, and an item set of length K is called a k-item set
The number of occurrences of an item set X in database D is called the frequency and is called count(X)

Support the support

The support of an item set X refers to the percentage of records in the data set that contain the item set.

support(X) = count(X)/|D|

If a minimum support minsup is given, support(X) >= minsup, X is called a frequent item set, or X can be said to be frequent.

Confidence level/confidence level

Confidence/confidence is defined for an association rule such as X->Y, which refers to the percentage of transactions involving X that contain Y.

conf(X \rightarrow Y) = |XY|/|X| = sup(XY)/sup(X)

Apriori algorithm

Apriori algorithm is a classic mining rule algorithm and is also the most commonly used algorithm for mining frequent item sets. The main steps are as follows

Find all frequent sets: support >= all item sets of Minsupport
1. Count the support degree of each candidate set of K items and find the frequent K item set: LkL_kLk
2. K +1 candidate sets are generated using frequent K item sets: Ck+1C_{k+1}Ck+1, k= K +1
Generate possible association rules for each frequent set

For example, the minimum support minsup=2

The algorithm flow in text description is:

TDB is scanned and support is calculated for each candidate to obtain table C1
The maximum item set L1 of 1 dimension is obtained by comparing the support degree and minimum support degree of each latter option in C
Candidate C2 is derived from L1
L2 can be obtained by comparing the support degree of C2
So we go from L2 to C3
From C3, I get the 3-dimensional item set L3

According to the above process, we will also find that the algorithm generates too many combinations of candidate sets in each step. We can pry them by mining the nature of association rules.

Any subset of the frequent set must be frequent
Apriori clipping rule: If there are some itemsets that are infrequent, then any supersets of those itemsets are infrequent because they do not need to be generated and tested

As shown, AB is infrequent, so all the supersets circled by the red dotted line are infrequent, so they can be cut off. So the method of generating the candidate set can be summarized as follows:

Assume that the items in item set LkL_kLk are ordered

The itemset in pairwise combination LkL_kLk generates Ck+1C_{k+1}Ck+1
tailoring

Generating association rules

How to generate frequent itemsets is described above, and then how to generate association rules. The existing frequent itemsets L generate each non-empty subset S, if S satisfies the minimum confidence degree:

confidence((l-s) \rightarrow s) = \frac{sup(l)}{sup(l-s)} >= minconf

Output association rule (L-S)->s.

As shown in the figure, the minimum confidence degree minconf=80%. For frequent itemset BCE, the following association rules can be obtained, among which the rules greater than 80% are BC->E and CE-<B.

Ascension degree

Now that we have an algorithm for finding frequent itemsets and association rules, how do we make sure that the calculated support and confidence are valid? Here’s an example:

While playing basketball may seem intuitively uncorrelated with eating oatmeal, the basketball-> oatmeal rule seems to have a high degree of support and confidence based on the calculations above. This shows that support and confidence may not always accurately reflect the correlation between data. So a new concept, Lift, was introduced.

lift = \frac{P(A \cup B)}{P(A)P(B)} = \frac{conf(A \rightarrow B)}{sup(B)}

If lift>1, then A is promoting B, and vice versa.

Then the Apriori algorithm can be summarized as four steps:

Set minsup and minconf
Find frequent itemsets
Find the rule for conf>minconf based on frequent itemsets
Calculate the degree of improvement of the rules found in the previous step and select the ones with the highest degree of improvement

Due to the problem that a large number of candidate sets may be generated in Apriori algorithm, a new algorithm, FP-growth algorithm, appears. It is built based on Apriori, but uses advanced data structure to reduce the number of scans, greatly speeding up the algorithm speed. Fp-growth algorithm only needs to scan the database twice, while Apriori algorithm scans the data set for each potential frequent item set to determine whether a given pattern is frequent. Therefore, the speed of FP-growth algorithm is faster than Apriori algorithm. We’ll talk more about this algorithm later. The above is about the basic knowledge of Apriori algorithm, the follow-up will take time to write a simple code Apriori algorithm, here first set a flag ~

Association rule mining and Apriori algorithm

Basic concept

Support the support

Confidence level/confidence level

Apriori algorithm

Generating association rules

Ascension degree

Related Posts

ROS2 compiler colcon is better than catkin.

PaddleBoBo: The year of the Tiger

· The interpretability of deep learning and the research and application of low-frequency event learning in the field of finance