Write in the front: now I start to contact the relevant knowledge of data mining, starting from mining algorithm, this paper records some personal understanding of HUIminer algorithm literature translation, if there is any error, please correct.

Frequent mining

The principle of

All super sets of infrequent itemsets are infrequent, and all subsets of frequent itemsets are frequent. When an infrequent item set is found, its superset must also be infrequent, so the superset is not checked.

disadvantages

Only the occurrence and absence of item sets are counted, and many incidental values (commodity price, occurrence times, etc.) cannot be added into the calculation, resulting in the phenomenon of high value and low frequency, high frequency and low value (such as 32G memory bar sales and latiao sales).


Efficient itemset Mining (Hui-Miner)

Because the original efficient itemset mining pattern produces many candidate sets with low thresholds, the utility list is used to replace the role of candidate sets.

sample

eachIn the utility table () have corresponding utility values ()

Is aboutThe set of theta is called the utility set

: for each item in the transaction sheet, and each term is given byThe utility value of the transaction is allThe sum of the utility values of theta

:The utility value of

:In the transaction tableNumber of occurrences in the term

:, the transaction item setAll items in theThe sum of the total utilities of PI

:, the utility setsAll the elements ofIn the transaction tableThe total utility of

:, the utility item setIn the databaseThe total utility of

:, in the databaseAll of theinThe sum of the utility values of the terms

:, database DB contains the set X that appears in the utility setThe total utility value of all transaction items of the item

When twu(X) is less than the set threshold, all supersets of X are not valid itemsets, becauseThere are

Ps. In fact, I think this is the premise of the whole algorithm, because it solves the problem that the utility value of the previous item set is neither monotony nor anti-monotony, so that the whole item set is possible to sort, and generate the residual utility item set, and construct the residual utility list (the highlight of this article).

Constructing a utility list

The utility table is a new table that has a smaller set of items (because unary items with inefficient values have been removed) after the twU operation has been deleted and sorted.

Initialize the utility list (unary)

Scan the database for the first time: Generate the TWU (I) of all items. If the value of an item is lower than the minimum threshold, the item set containing the item is skipped in subsequent processing, the items below the threshold are deleted, and the items greater than the threshold are arranged in ascending order. In the example, the threshold is 30.

Scan the database for the second time: generate the database view as shown in the following figure

:, all terms after the item set X in T are represented by T/X (because the previous ones are already sorted), for example, refer to the table above (enumeration tree is easier to understand)

: is the sum of utility values of the remaining items excluding the common prefix item set in each transaction item, and the calculation formula is:

Supplement:

Where each utility list of set X contains three properties:

: The id of the transaction containing set X

:

:

Binary utility table

Instead of scanning the database again,

By comparing the tids in two utility lists and taking the intersection to generate a new multivariate utility list,

The first row of figure (a) is the subscript of the item containing {e} in the transaction table T, and the second row is the subscript of the item containing {c} in the transaction table T (compared with Fig.4).

Figure (b) where

  • Iutils of each item is the sum of the two Iutils, and Rutils is the small Rutils of the two items (as shown in Fig.5, since they are all sorted, the combination of the front item and the back item will inevitably lead to the reduction of the utility value of the remaining items (Rutils), and the remaining uncombined items are less).
  • The sum of the values of the Iutils column is the utility value of the item set
  • Adding the values of the iutils and rutils columns gives you the ru value of the item set’s extension

Multiple utility table

The specific method is similar to the binary calculation. Select items from the binary item set, such as {ec} and {eb} to generate {ECB}. However, it should be noted that there is an overlap {e} between the calculations (as shown in figure (a) below), so u(e, T) should be subtracted (as shown in Figure (b) below)

Here is the pseudocode for the multivariate utility table:

HUI – Miner algorithm

Search space

Using an enumeration tree structure, each node is sorted by twU value as follows:

  1. The root node is an empty set
  2. The first layer node is a unary item set
  3. Starting at the second level, add elements at the end of each node whose TWU value is greater than the twU value of the last element in the parent node. (You can simply think of the following nodes as being added to the previous nodes individually, since the twU values are ordered at each level.)
  4. Repeat step 3 until all nodes are generated (no duplicate nodes appear for each layer of nodes)

Pruning strategy

  1. An item set is an efficient item set if the sum of all iutils in the corresponding utility list is greater than the threshold.
  2. If the sum of all iutils and Rutils in the utility list corresponding to the item set is greater than the threshold, the item set needs further determination.
  3. If the sum of all iutils and Rutils in the utility list corresponding to the item set is less than the threshold, the item set and all its extensions are inefficient and pruned.

Algorithm implementation

At the same time, in order to improve efficiency and reduce the scanning times of utility list, all item sets represented by child nodes of a node in the collection enumeration tree have the same prefix item set. For this reason, we store extension items and prefix items separately. The table in Fig.7(b) can be constructed as follows:


Afterword.

Personal perception is because the former frequent mining model generates candidate sets according to anti-monotonicity, and then the concept of TWU is proposed to make up for the lack of consideration of the real value of items in the previous model. The Utility value of a single item (set) is non-monotone, so the Hui-Miner algorithm proposes TWU (Transaction Weighted Utilities), which has anti-monotone, and designs a data structure — Utility-list. The Utility value calculation around the Utility List is the core of the algorithm

Reference:

1.Mengchi Liu, Junfeng Qu: Mining High Utility Itemsets without Candidate Generation

2. Hui-miner: An efficient high-utility Itemset mining algorithm