Write above: This algorithm is the first one that can fully mine all efficient plots. It is slightly different from up-span in definition, but it is this little difference that leads to a big change in the accuracy of mining results

Utility-Driven Mining of High Utility Episodes

The sample

The utility value table

Complex sequence of events

define

Most of the basic definitions can refer to the UP-span algorithm, and the following are some supplementary and key definitions.

The Occurrence period is defined in the same way as in the up-span, which is added here. A complex plot of alpha = <} {A, D – > B > \ alpha = \ big < \ lbrace A, D \ \ rbrace \ rightarrow B big > alpha = ⟨ ⟩ {} A, D, B, It is called complex because it contains A set of parallel occurrences ({A,D}\lbrace A,D \rbrace{A,D}) and A set of serial occurrences ({A,D},{B}\lbrace A,D \rbrace, \lbrace B \rbrace{A,D},{B}), Period of time the incident set as occSet (alpha) occSet (\ alpha) occSet (alpha) = {(T1, T2), [T1, T5], [T1, T6], [T3 and T5], [T3, T6]} \ lbrace [T_1, T_2], [T_1, T_5], [T_1, T_6], [T_3, T_5], [T_3, T_6] \ rbrace {[T1, T2], [T1, T5], [T1, T6], [T3 and T5], [T3, T6]}, it is important to note that Events AAA and DDD must occur simultaneously (i.e., at the same point in time) or they are serial events, which is inconsistent with the original scenario α\alphaα
The Maximum time duration is similar to the sliding window in the early plot mining algorithm. It is a time constraint (MTD) that needs to be specified by the user, so as to avoid the increase of calculation amount and the decrease of correlation between events due to the long time span. For any event α\alphaα, the time span Te−Ts≤MTDT_{\rm e} -t_ {\rm s} \le MTDTe−Ts≤MTD, it is obvious that the minimum occurrence time span Mo (α) Mo (\alpha) Mo (α) always satisfies this inequality.

Ps. It should be noted here that the definition in up-span is Te−Ts+1≤MTDT_{\rm e} -t_ {\rm s} +1 \le MTDTe−Ts+1≤MTD. The reason why the UMEpi algorithm is designed this way is to provide a more accurate boundary and to be easier to understand
There are two formulas for calculating episoder-weighted utility:
- The UP – Span: The EWU (alpha, mo (alpha)) = ∑ I = 1 k – 1 u (Ei, Ti) + ∑ I = TeTs + MTDu (SEi, Ti) EWU (\ alpha, mo (\ alpha)) = \ sum_} {\ rm I = 1 ^ {} \ rm k – 1 u (E_ {I} \ rm, T_ {I} \ rm) + \ sum_ {\ rm I = T_e} ^ + MTD} {T_s u (SE_ {I} \ rm, T_ {I} \ rm) EWU (alpha, mo (alpha)) = ∑ I = 1 k – 1 u (Ei, Ti) + ∑ I = TeTs + MTDu (SEi, Ti)
- TSpan: The EWU (alpha, mo (alpha)) = ∑ I = 1 ku (Ei, Ti) + ∑ I = TeTs + MTDu (SEi, Ti) EWU (\ alpha, mo (\ alpha)) = \ sum_} {\ rm I = 1 ^ {} \ rm k u (E_ {I} \ rm, T_ {I} \ rm) + \ sum_ {\ rm I = T_e} ^ + MTD} {T_s u (SE_ {I} \ rm, T_ {I} \ rm) EWU (alpha, mo (alpha)) = ∑ I = 1 ku (Ei, Ti) + ∑ I = TeTs + MTDu (SEi, Ti)
The alpha = < (E1),… , > (Ek), Ti (1 I k or less) or less \ alpha = \ big < (E_1), \ ldots, (E_} {\ rm k) \ big > and \; T_{\rm I}(1 \le I \le k)α=⟨(E1),… ,(Ek)⟩,Ti(1≤ I ≤ K), TSpan double-counting event sets at point in time TeT_{\rm e}Te, resulting in inaccuracy of EWU, for example:

Set of alpha = <} {A, D – > B >, mo (alpha) = \ alpha = \ [T1, T2] big < \ lbrace A, D \ \ rbrace \ rightarrow B big > and \, mo (\ alpha) = [T_1, T_2]α=⟨{A,D}→B⟩,mo(α)=[T1,T2], the following calculation can be obtained

$UP-Span: EWU(\alpha, mo(\alpha)) = u(\lbrace A,D \rbrace, T_1)+u(\lbrace B,C \rbrace, T_2)+u(\lbrace A,D,E \rbrace, T_3) \\ TSpan: EWU(\alpha, mo(\alpha)) = u(\lbrace A,D \rbrace, T_1)+u(B, T_2)+u(\lbrace B,C \rbrace, T_2)+u(\lbrace A,D,E \rbrace, T_3)$

It is obvious that the event BBB occurring at the time point T2T_2T2 is calculated repeatedly, which leads to the larger calculation of the final result. UMEpi algorithm adopts the calculation method of UP-span

It should be noted that EWU is different from TWU and SWU in that it can only be seen as local pruning. Instead of the other two which can be pruned in place from the beginning (because of occurrence and expansion modes) for example MTD=2, minUtil=50%, TU=94MTD =2, \; minUtil = 50\%, \; TU = 94 MTD = 2, minUtil = 50%, TU = 94 plots of alpha = < A, B, D > \ alpha = \ big < A, B, D \ big > alpha = ⟨ A, B, D ⟩ EWU (alpha) EWU (\ alpha) EWU (alpha) = $24 an inefficient use, should be pruned, But its extension plot

\big

⟨E→B→{A,B,D}⟩($48), < > B > {D, E}} {A, B, D > \ big < \ lbrace D, E \ rbrace \ \ rightarrow \ lbrace rightarrow B A, B, D \ rbrace \ big > ⟨ > B > {D, E} {A, B, D} ⟩ ($51) It’s an efficient plot. Therefore, in the plot utility mining, there is a phenomenon that “inefficient plot can be expanded into efficient plot”, which is also the difficulty in the mining process
→b→{a,b,d}>
(Remaining Utility of an episode)ru(α,Ti)=∑e ‘hosted α∧α≺e’ u(e ‘,Ti)ru(\alpha, T_{\rm i}) = \sum_{\rm e^\prime \not\in \alpha \land \alpha \prec e^\prime}u(e^\prime, T_{\rm I})ru(α,Ti)=∑e ‘∈α∧α≺e’ u(e ‘,Ti) where e ‘e^\primee’ is a event after the event α\alphaα on the time point TiT_{\rm I}Ti, which is the key of EWU calculation, Since EWU can only be counted as a boundary value if residual scenario utility (a potentially extensible set of events) is taken into account, Moreover, it should be noted that the time span of calculating the sum of residual utility of a certain plot of Z in a complex sequence of events is Te+1≤Ti≤Ts+MTDT_{\rm e}+1 \le T_{\rm I} \le T_{\rm s} + MTDTe+1≤Ti≤Ts+MTD

According to the residual utility of the plot, we can optimize EWU and obtain the new formula:
$EWU(\alpha, mo(\alpha)) = u(\alpha, mo(\alpha)) + ru(E_{\rm k}, T_{\rm e}) + \sum_{\rm i=T_{e}+1}^{T_{\rm s}+MTD}tu(T_{\rm i}) \\ = \sum_{\rm i=1}^{\rm k}u(E_{\rm i}, T_{\rm i}) + ru(E_{\rm k}, T_{\rm e}) + \sum_{\rm i=T_{\rm e}+1}^{T_{\rm s}+MTD}tu(T_{\rm i})$
The Te Ti Ts] + MTDT_ or less or less {\} rm e \ le T_ {I} \ rm \ le T_ rm {\ s]} + MTDTe Ti Ts] + MTD or less or less
Given a complex event sequence SSS, maximum time MTD, and user-set threshold minUtil, When an event (set) in SSS satisfies u(α)≥minUtil×TUu(\alpha) \ge minUtil \times TUu(α)≥minUtil×TU, we call the event (set) HUE

strategy

Episode-Weighted Downward Closure Property

Lexicographic sequence tree (Lexicographic sequence tree) is based on the root node of the prefix tree is empty, and all the children of the parent node (also called the prefix node) are formed by concatenation (string, and). These children nodes are ordered between them

Ps. AndEHINThe spatial retrieval enumeration tree is consistent

In the dictionary sequence tree we can use u(γ)≤EWU(γ)≤EWU(α)≤minUtil×TUu(\gamma) \le EWU(\gamma) \le EWU(\gamma) \le EWU(\alpha) \le minUtil \times TUu(γ)≤EWU(γ)≤EWU(α)≤minUtil×TU, where γ\gammaγ is the extended plot α\alphaα (the extensions are divided into Simult-connection and serial-connection), The derivation can be proved by comparing their definitions

algorithm

The algorithm is also divided into two steps: 1) scan to generate first-order HUEs; 2) The Hue-span method is constantly called to generate high-order HUEs with low-order combination

The Span of SimultHUE procedure:

The Span of SerialHUE procedure:

conclusion

The main contributions of this paper are as follows: 1) It is demonstrated that low-weight plots can also be combined with high-weight plots to generate high-weight plots, which has not been fully considered by previous HUEM algorithms; 2) A more compact EWU is redefined using the residual utility value, which greatly improves the computing speed; 3) Compared with up-span algorithm, moSet(α)moSet(\alpha)moSet(α) stores all the occurrence subsequences of the plot α\alphaα (i.e., potentially combinable episodes).

UMEpi algorithm

Utility-Driven Mining of High Utility Episodes

The sample

define

strategy

Episode-Weighted Downward Closure Property

algorithm

conclusion

Related Posts

Five common neural networks (1)- feedforward neural network and its parameters learning and back propagation

Build and deploy an alphabet recognition system

In the next five years, the market segment in Taiwan will usher in great changes