5 minutes to fully understand the association rule mining algorithm

Introduction to Association Rules

Association rule mining can find items let us from the data set and the relationship between, it is in our life there are many application scenarios, “basket analysis” is a common scenario, this scenario can be excavated from consumer transactions relationship between goods and goods, and then through the bundle of goods or related recommended way to bring more sales.

Understand several important concepts in association rules: support degree, confidence degree and promotion degree
The working principle of Apriori algorithm
In practice, how should we mine association rules

Important concepts in association rules

As an example of supermarket shopping, here is a list of items purchased by several customers:

The order no.	The purchase of goods
1	Milk, bread, diapers
2	Coke, bread, diapers, beer
3	Milk, diapers, beer, eggs
4	Bread, milk, diapers, beer
5	Bread, milk, diapers, coke

support

Support is a percentage, which is the ratio of the number of occurrences of a particular product mix to the total number of occurrences. The higher the support, the more frequently the combination occurs.

We see that beer appears 3 times, so the support of beer in 5 orders is 3/5=0.6. Similarly, diapers appear 5 times, so the diaper support is 5/5=1. Diapers and beer are supported by 3/6=0.6.

Degree of confidence

It’s the probability that when you buy good A, you’ll buy good B.

We can look at the above goods, buy diapers and beer order quantity is 3, buy beer order quantity is 3, then (diapers -> beer) confidence = 3/3=1.

If the order of beer and diapers is 3 and the order of diapers is 5, then the confidence of (beer -> diapers) =3/5=0.6.

Ascension degree

The degree of improvement represents the degree to which “the appearance of commodity A increases the probability of commodity B”. Therefore, when we do product recommendation, we focus on the promotion degree.

Let’s look at the formula for degree of improvement:So let’s calculate how much the diaper improves the beer:

How to interpret 1.67?

The degree of improvement (A→B)>1 indicates that there is improvement
Degree of improvement (A→B)=1: indicates whether there is improvement or decrease
Degree of elevation (A→B)<1: indicates A decrease.

In fact, we can understand by looking at the above formula of promotion degree, that is, the more times that AB appears at the same time, the less times that B appears alone, the greater the support will be, that is, the appearance of B is always accompanied by the appearance of A, so the probability of A to B will be increased!

How Apriori works

Let’s take a look at how the classic association rule Apriori algorithm works.

Apriori algorithm is actually a process of finding frequent itemsets, so we need to know frequent itemsets first.

A frequent item set is an item set whose support is greater than or equal to the minimum support threshold. Therefore, an item less than the minimum support is an infrequent item set, while an item set greater than or equal to the minimum support is a frequent item set.

Here’s an example:

Suppose I randomly specify a minimum support of 0.2. First, we calculate the support of individual goods:

The purchase of goods	support
milk	4/5
bread	4/5
diapers	5/5
coke	2/5
beer	3/5
egg	1/5

Since the minimum support is 0.2, you can see that the commodity egg does not meet the minimum support and does not belong to the frequent itemset, so the frequent itemset of the filtered commodity is as follows:

The purchase of goods	support
milk	4/5
bread	4/5
diapers	5/5
coke	2/5
beer	3/5

On this basis, we pairwise combine the commodities to obtain the support degree of the two commodities:

The purchase of goods	support
Milk, bread	3/5
Milk, diapers	4/5
Milk, cola	1/5
Milk, beer	2/5
Bread, diapers	4/5
Bread, Coke	2/5
Bread, beer	2/5
Diapers, Coke	2/5
Diapers, beer	3/5
Coke, beer	1/5

After filtering the data greater than the minimum support (0.2)

The purchase of goods	support
Milk, bread	3/5
Milk, diapers	4/5
Milk, beer	2/5
Bread, diapers	4/5
Bread, Coke	2/5
Bread, beer	2/5
Diapers, Coke	2/5
Diapers, beer	3/5

On this basis, we combine three commodities to obtain the support degree of three commodities:

The purchase of goods	support
Milk, bread, diapers	3/5
Milk, bread, coke	1/5
Milk, bread, beer	1/5
Bread, diapers, coke	1/5
Bread, diapers, beer	2/5
Diapers, Coke, beer	1/5

After filtering the data greater than the minimum support (0.2)

The purchase of goods	support
Milk, bread, diapers	3/5
Bread, diapers, beer	2/5

On this basis, we combine four commodities to obtain the support degree of four commodities:

The purchase of goods	support
Milk, bread, diapers, coke	1/5
Milk, bread, diapers, beer	1/5
Bread, diapers, Coke, beer	1/5

Then, the algorithm ends, and the last result is the frequent term we are looking for, i.e. {milk, bread, diaper}, {bread, diaper, beer}.

Let’s summarize the Apriori algorithm process above:

K=1, calculate the support of K item set
Filter out items that are less than the minimum support
If the item set is empty, the result of the corresponding k-1 item set is the final result
Otherwise K=K+1, repeat steps 1-3

We can see that Apriori has the following shortcomings in the calculation process:

A large number of candidate sets can be generated. Because by permutation and combination, we’ve combined all the possible sets of terms
Each calculation requires a rescan of the data set to calculate the support of each item set

This is like a “full table scan” query in our database, which is very IO and time consuming. In databases we all know how to use indexes to quickly retrieve data, but can Apriori be optimized?

Improved algorithm for Apriori: FP-growth algorithm

Fp-growth algorithm is based on the Apriori principle. It discovers frequent item sets by storing data sets in the FP tree, but cannot discover association rules between data. The FP-growth algorithm only needs to scan the database twice, while the Apriori algorithm needs to scan the data set once for each potential frequent item set, so the Apriori algorithm is efficient. The algorithm discovers frequent item sets by :(1) constructing FP tree; (2) Mining frequent item sets from FP tree.

Create the header table

Conceptual knowledge is not here to make up the word count, let’s go straight to the work! Suppose we mine frequent terms from the following data.

First, create the item head table. The process of this step is to scan the data set and sort the single items that meet the minimum support degree from high to low. In this process, delete the items that do not meet the minimum support degree (assuming that the minimum support degree is 0.2).

Build the FP tree

The whole process is to scan the data set again. For each piece of data, create nodes in descending order of support (that is, the sorting result in the item header table in step 1). Count +1 if the node exists, and create nodes if it does not exist. At the same time, the linked list of the item header table needs to be updated during creation.

First, sort the original data according to the support degree, then the change of the original data is as follows:

Let’s insert each of the above rows into the FP tree in sequence, noting that the root node of the FP tree is denoted as a NULL node.

Then insert the second row of data, since the first row of data is also B, and the existing tree structure, so we keep the original tree structure of B unchanged, and count increment 1, C, D are new data, then there will be a new tree fork, the result is as follows: \

Similarly, read the following three rows of data into the FP tree \

The resulting FP number is as follows: \

Mining frequent items according to FP number

We finally set up the FP tree, so how to see this tree? After obtaining the FP tree, we need to mine the frequent item set one by one for each frequent item. The specific process is as follows: firstly, prefix paths of frequent items are obtained, and then prefix paths are taken as new data sets to construct conditional FP tree of prefix paths. Then, for each frequent item in the conditional FP tree, the prefix path is obtained and a new conditional FP tree is constructed. Iterate until the conditional FP tree contains only one frequent term.

FP tree looks from the bottom up, that is, according to the child node to find the parent node

First, let’s look at the frequent term containing A: \

We can see that there are two trees containing A. If we look at tree 2 first, we can get the path {B:2,C:2}, where 2 is determined by the number of occurrences of A. Then we create the FP tree, the specific creation process is the same as the above process, as shown below: \

Note that there are two elements in the header pointer table, so for each element, we need to get the prefix path and create the prefix path as a conditional FP tree until the conditional FP tree contains only one element.

For element B, the prefix path is {}, then the frequent item set returns {A:2,B:2};
For element C, the prefix path {B:2} is obtained, and the prefix path is created as a conditional FP tree, as shown in the figure below.

Note that the conditional FP tree contains only one element, so the frequent item set {A:2,C:2,B:2} is returned. Since element C is also A frequent item, {A:2,C:2} is also A frequent item set.

Add {2} A: itself is frequent itemsets, so A corresponding frequent itemsets are: {2} A:, {2, A: C: 2}, {2, A: B: 2}, {2, A: C: 2, B: 2}. \

Similarly, let’s look at tree 1, which is relatively simple and has only one path {B:1}. According to the above method, we get the branch frequency term {A:1,B:1},{A:1}.

To sum up, we can see that both branches contain frequent term {A,B} and {A}. At this time, we merge the two branches to obtain frequent term containing A: {A:3},{A:3,B:3},{A:2,C:2},{A:2,C:2,B:2}. We convert the frequency of occurrence into (A,): 3, (A, B): 3, (A, C): 2, (A, B, C): 2.

Similarly, according to the above method, we can find in turn include the frequent item B (D) : 2, (C, D) : 2, (B, D) : 2, (B, C, D) : 2, (C) : 4, (B, C) : 4, (B) : 5.

practice

Classic algorithm, many big god has been implemented, we directly use the line, such as the above FP-growth algorithm, introduce a special package PyfpGrowth.

Code: \

import pyfpgrowth as fp
Copy the code

itemsets = [['A'.'B'], ['B'.'C'.'D'], ['B'.'C'], ['A'.'B'.'C'], ['A'.'B'.'C'.'D']]
Copy the code

The number of frequencies to be deleted is greater than2
patterns = fp.find_frequent_patterns(itemsets, 2)  
Copy the code

print(patterns)
Copy the code

{('D',) :2, ('C'.'D') :2, ('B'.'D') :2, ('B'.'C'.'D') :2, ('A',) :3, ('A'.'B') :3, ('A'.'C') :2, ('A'.'B'.'C') :2, ('C',) :4, ('B'.'C') :4, ('B',) :5}
Copy the code

A lot of people will ask how the algorithm works and it looks like a lot of work, two lines of code. Why do we look at the principle, just write (copy) the code. I give an example, remember the primary school 3 grade of time, there is a math test, the teacher gave an additional question: \

Additional questions (10 marks) : 1+2+3+… And + 99 + 100. Tip: Refer to the trapezoidal area calculation formula.

Of course, for a junior high school student, this question is easy to solve, but for a third grade primary school student is still more difficult. But I know the formula for trapezoid:

Trapezoidal area = (bottom + bottom) × height ÷2
Copy the code

So, \ is calculated according to this

1+2+3+... +99+100 = (1 + 100) x100 present2
Copy the code

In fact, the algorithm principle is the same, we understand, can be used by analogy, not always in the same scene.

Long press scan to add “Python Assistant”

Click here to become a community member