“This is the ninth day of my participation in the Gwen Challenge.

Principle of Apriori algorithm

Apriori algorithm is a famous algorithm for mining association rules.

If we are running a grocery store with a limited variety of items, we are very interested in which items are often bought together. We only have four commodities: Commodity 0, commodity 1, commodity 2 and commodity 3. So what are all the combinations of goods that might be bought together? These combinations may contain one commodity, such as commodity 0, or two, three, or all four commodities. But we don’t care if someone buys two goods 0 and four goods 2, we only care if they buy one or more goods.

The image below shows all possible combinations between items:

  • The figure uses the item number 0 to represent the item 0 itself.
  • The first set that goes down in the figure is ϕ\phiϕ, indicating an empty set or one that contains nothing.
  • Lines between collections of items indicate that two or more collections can be combined to form a larger collection.

** Our goal is to find a collection of items that are often purchased together. We measure the frequency of occurrence using the support of the collection.

The support of a collection refers to the percentage of transaction records that contain the collection.

Question: how to calculate the support for a given set, such as {0,3}?

  • We can iterate through each record and check that it contains both 0 and 3, increasing the total if it does contain both. After all the data has been scanned, the support is obtained by dividing the total number of transactions by the total number of transactions recorded.

** note: ** the above procedures and results are only for a single set {0,3}. This process is repeated several times to gain support for each possible set. We can count the number of collections in the figure below and see that even for a collection of only four items, we need to traverse the data 15 times. The number of traversals increases dramatically as the number of items increases. There are 2N−12^{n-1}2N−1 itemset combinations for a data set containing N items. And there’s actually no shortage of stores selling 10, 000 items or more. Even a store that only sells 100 items will have 1.26∗10301.26 * 10^{30}1.26∗1030 possible combinations of itemsets. This amount of computation, even for many of today’s computers, would take a long time to complete.

The principle of Apriori algorithm can help us reduce the number of items that may be of interest and reduce the required calculation time.

Apriori algorithm principle:

  • If an item set is frequent, then all subsets of it are frequent; for example, if {1,2} is frequent, then {1} and {2} must also be frequent.

  • The inverse of this principle shows that if an item set is infrequent, then all of its supersets are infrequent

Given that item set {2,3} is infrequent, it can be immediately determined that item set {0,2,3}, {1,2,3} and {0,1,2,3} are infrequent, so there is no need to calculate the support of these item sets

The general process of Apriori algorithm:

  1. Collect data: Use any method.
  2. Prepare data: Any data type will do, since we only save collections.
  3. Analyze data: Use any method.
  4. Training algorithm: Use Apriori algorithm to find frequent item sets.
  5. Testing algorithms: No testing procedures are required.
  6. Usage algorithms: Used to discover frequent itemsets and association rules between items.

Data set scanning method:

from numpy import *


def loadDataSet():
Copy the code

Load data set

:return: dataset ''' return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]] def createC1(dataSet): C1 is a list of all candidate datasets of size 1 :param dataSet: :return: C1 = [] # for record in dataSet: for item in record: if not [item] in C1: C1.append([item]) c1.sort () Return list(map(frozenset, C1)) def scanDataset(dataset, ck, minSupport): '' Scan dataset to determine frequent itemsets :param dataset: : Param ck: ck is a list of all candidate itemsets of size K: Param minSupport: Set minimum support threshold :return: SelectedSetCount = {} for record in dataset: # for candidateSet in ck: Subset (record): if candidateSet is not in selectedSetCount: selectedSetCount[candidateSet] = 1 else: SelectedSetCount [candidateSet] += 1 numItems = float(len(dataset) supportData = {} for key in selectedSetCount: Support = selectedSetCount[key] / numItems if support >= minSupport: retList.insert(0, key) supportData[key] = support return retList, supportData if __name__ == '__main__': From pprint import pprint dataset = loadDataSet() C1 = createC1(dataset) pprint(scanDataset(dataset, c1, 0.5))Copy the code
   
Copy the code