Writing above: Dealing with real data sets (i.e., changing time), how to mine and analyze these dynamic data sets or the same data sets at two different times is a valuable problem; Therefore, this paper proposes a new concept called emerging pattern to represent the items with significant changes in support, and designs a more efficient classification algorithm based on this concept

Efficient Mining of Emerging Patterns: Discovering Trends and Differences

motivation

In temporal database, it is usually used to study the changes of item sets. For example, in the financial field, trend prediction can be made by analyzing the changes of item sets. Another example is to compare the changes of different categories of items, comparative analysis can be carried out; Or when the total base is too large, there is no obvious change in the proportion of the total data, but the actual value changes greatly. According to the above requirements, this paper proposes emerging Pattern (EP) to quantify these changes

define

  • Item: The smallest unit in a data set, indivisible and represented by xix_ixi

  • Itemset: consists of a number of different items, represented by XXX, where kkk-itemsetitemsetitemset indicates that the itemset contains KKK items

  • Large/Small itemSet: Given a minimum support threshold of sigma \sigma sigma (in percentage form), If suppD (X) supp_ \ mathcal {D} suppD (X) (X) = countD (X) ∣ D ∣ p sigma \ frac {count_ \ mathcal {D} (X)} {\ mid \ mathcal {D} \ mid} \ ge \sigma∣ countD(X)≥σ, then use Largeσ(D)Large_\sigma(\mathcal{D})Largeσ(D) to represent the set of all items in D\mathcal{D}D that conform to this inequality, On the other hand is a Small sigma (D) Small_ \ sigma (\ mathcal {D}) Small sigma (D)

  • Interval closed: No element in an interval is smaller than and larger than the left boundary. Largeσ(D)Large_\sigma(\mathcal{D})Largeσ(D) is a closed interval

  • Transaction: A transaction consisting of a number of different items that can be divided into multiple item sets, represented by TTT

  • Support: To measure whether an item/item set is a quantitative indicator that users want, SuppDsupp_ \mathcal{D}suppD = countD(X)∣D∣ frac{count_\mathcal{D}(X)}{\mid \mathcal{D} \mid}∣ countD(X), Where the numerator represents the total number of occurrences of XXX in data set D\mathcal{D}D, and the denominator represents the number of transactions contained in the data set; This paper uses the rate of change of support (GrowthRate) to measure whether an item/item set is needed

    • Supp1 (X)supp_1(X)supp1(X) supp1(X) supp2(X)supp_2(X)supp2(X) supp2(X) = 0
    • Supp1 (X)supp_1(X)supp1(X) = 0 and supp2(X)≠supp_2(X) \not=supp2(X)= 0, Then GrowthRate(X)GrowthRate(X)GrowthRate(X) = ∞\infty∞;
    • Otherwise, GrowthRate(X)GrowthRate(X) = supp2(X)supp1(X)\frac{supp_2(X)}{supp_1(X)}supp1(X)supp2(X)

    Among them, Supp1 supp_1 supp1 (X) (X) (X) and supp2 supp_2 (X) (X) supp2 (X) is XXX in the original data set respectively D1 \ mathcal {D} _1D1 and change the data set of D2 \ mathcal {D} _2D2 D2mathcal {D}_2D2 is lower than d1mathcal {D}_1D1

  • Since this algorithm considers the rate of change, it is more intuitive to describe the problem with two-dimensional coordinates, as shown in the following figure.

    According to the three conditions l1l_1l1, l2l_2l2 and l3l_3l3, we can get the interval of 1,2 and 3 EP.

    • Interval 1: it is lower than two preset thresholds at the same time, which means that the proportion of this item in the whole is very small, but it changes greatly. Unfortunately, there is no good way to solve this situation at present
    • Interval 2: The main object of this paper is to restrict the retrieval space (rectangle BCDGBCDGBCDG) by using the largest item sets of the two data sets as the boundary, so as to rapidly and accurately mine all EP; For example, sales of a certain item have increased dramatically
    • Interval 3: greater than two preset ground thresholds at the same time, which means that the base of this item is very large, and the change cannot be observed obviously only by numerical value, so it needs to be highlighted in the form of ratio
  • Border: An ordered pair such as < L\mathcal{L}L, R\mathcal{R}R> is called a border, where L\mathcal{L}L is the left boundary and R\mathcal{R}R is the right boundary (Ps. Note that these two boundaries are sets, not a number, so they can be regarded as minimum and maximum sets respectively); Therefore, the boundary of the closed interval [L\mathcal{L}L, R\mathcal{R}R] is < L\mathcal{L}L, R\mathcal{R}R>

    • Similarly, large/small sets of items can also be bounds, Use LargerBorderσ(D)LargerBorder_\ Sigma (\mathcal{D})LargerBorderσ(D) and SmallBorder sigma (D) SmallBorder_ \ sigma (\ mathcal {D}) SmallBorder sigma (D) said
    • Second, in the data set, SmallBorderσ(D)SmallBorder_\sigma(\mathcal{D})SmallBorderσ(D) must be empty (no smaller border can exist), To find all epS for this, just specify LargerBorderσ(D)LargerBorder_\sigma(\mathcal{D})LargerBorderσ(D)

algorithm

Border-Diff

The main purpose of this process is to derive a new boundary from the difference between a pair of boundaries

(minimal itemsets) After the first step of the basic version of border-diff, L \ mathcal {L} L = {{1}, {1, 4}, {1, 3}, {1, 3, 4}, {1, 2}, {1, 2, 4}, {1, 2, 3}, {2, 3, 4}} (note is written in the enumeration order to); The second step is to remove supersets (i.e., non-minimal itemsets) and keep the minimal subset (supersets can be obtained through the subset enumeration tree extension). Mathcal {L}L = {{1}, {2, 3, 4}}

MBD-LLBorder

reference

  1. Efficient Mining of Emerging Patterns: Discovering Trends and Differences