I am participating in the Mid-Autumn Festival Creative Submission contest, please see: Mid-Autumn Festival Creative Submission Contest for details

[Association rules] Looking for the most popular moon cake collocation – Wu Ren moon cake still?

The Mid-Autumn Festival is coming, and mooncakes are the most controversial theme of the year. There are even five benevolence mooncakes “roll” out of the moon cake circle

Now moon cakes are also popular in the style of mixing and matching, all kinds of moon cakes together

So the topic of this article is mooncakes, and the purpose is to find out which mooncake combinations are the most popular among people

The data set used in this paper is the sales record of a mooncake shop of a merchant, and the record format is:

2018-81-0-16-551; U00107335; YLjcK475224;3 3 4 0;11;21;31
2018-21-29-9-31; U0010609; kHZJU943213;2 1 3;11;01;11
2018-71-6-16-491; U00104128; wIVVz889572;2 4;11;01;01
2018-51-5-15-561; U00109576; ZEspU652465;3 3 3 4;11;11;11
2018-21-26-14-131; U00101209; ELeMt778295;4 2 6 7;11;01;41
Copy the code

Each field with; The meaning of the fields is the order time, user name, order number, commodity number, payment method, city number, status code

The field we pay attention to is the commodity number, which contains the number of the moon cake purchased this time. Each number is for one type of moon cake, separated by a space

The meaning of the commodity number is as follows:

0Fruit moon cake1Ice skin moon cake2: Salted egg yolk mooncake4: Chocolate heart mooncake5Five kernel moon cake6Mooncake filled with chocolate7: Oreo filled mooncake...Copy the code

Operating environment:

jdk=1.8
scala=2.1212.
Copy the code
Language used: Scala100Article ten thousand the dataCopy the code

Method overview: This paper aims to introduce how to use Apriori(association rule) algorithm in data mining algorithm to realize a moon cake collocation recommendation. We will preprocess data, generate frequent item sets and association rules, and finally generate specific data of moon cake collocation through association rules.

First of all, the Apriori algorithm is introduced

Apriori algorithm

Example: Beer and diapers:

When Wal-Mart analyzed sales records, they found that beer and diapers were often bought together, so they rearranged the shelves to put the two together, which actually increased beer sales. Why: When a dad buys diapers for his baby, he buys himself a beer?

Overview: Apriori algorithm is one of the most influential mining Boolean association rules of frequent itemsets algorithm, its name Apriori derived from the algorithm using Prior knowledge of the nature of frequent itemsets. Next, we will use the supermarket order example to understand the important concepts related to association analysis: Support, Confidence, Lift.

Support: The probability of occurrence of an event, in this case, the ratio of the occurrence of a product combination to the total number of occurrences. Support(‘Bread’) = 4/5 = 0.8 Support(‘Milk’) = 4/5 = 0.8 Support(‘Milk’) = 3/5 = 0.6 Confidence(Confidence) : It’s essentially A conditional probability, which is the probability of buying good B if you buy good A. Example: Confidence(‘Bread’ – > ‘Milk’) = Support(‘Bread+Milk’)/ Support(‘Bread’) = 0.6/0.8 = 0.75 Lift: Refers to the degree to which the appearance of good A increases the probability of the appearance of good B. Lift(A->B) = Confidence(A, B)/Support(B) example: Lift(‘Bread’ -> ‘Milk’) = 0.75/0.8 = 0.9375

There are three cases of Lift:

  • Lift(A->B)>1: indicates that A increases the occurrence probability of B.
  • Lift(A->B)=1: indicates that A does not increase or decrease the occurrence probability of B.
  • Lift(A->B)<1: indicates that A reduces the occurrence probability of B.

Principle: The process of mining association rules by this algorithm is the process of finding frequent itemsets:

  • Frequent item set: item set whose Support is greater than or equal to the Min Support threshold.
  • Infrequent set: item set whose support is less than the minimum.

Process:

  • K = 1, calculate the support degree of K item set;
  • Filter out items less than the minimum support;
  • If the item set is empty, the result of the corresponding k-1 item set is the final result. Otherwise K = K+1 repeat steps 2-3

Data preprocessing

Get Gets the order number of the item

val data: Iterator[String] = fromFile("\\output\\data").getLines()
    while(data.hasNext){
      val line: String = data.next()
      val orderList: String = line.split(";") (3)
      val orderID: Array[String] = orderList.split("")}Copy the code

Filter out the abnormal data of the single number of the moon cake type

orderID.filter(elem => elem.toInt >=0 || elem.toInt<=5)// The number ranges from 0 to 5
Copy the code

Each order number is processed with a specific symbol -> and then saved for subsequent use

orderID.mkString("- >")
val out = new FileWriter("\\output\\lastdata")
      out.write(orderID.mkString("- >") +"\n")
      out.close()
Copy the code

Generating association rules

Main program entry

 val data = new AprioriAlgorithm(new File("\\output\\lastdata"))
    data.runApriori()
    println("===Support Items===")
    data.toRetItems.foreach(println)
    println("===Association Rules===")
    data.associationRules.foreach(println)
Copy the code

The core code

This part preprocesses the input data and saves the delimited data to the Set

 var transactions: List[Set[String]] = List(a)var itemSet: Set[String] = Set(a)for (line <- Source.fromFile(inputFile).getLines()) {
    val elementSet = line.trim.split("- >").toSet
    if (elementSet.size > 0) {
      transactions = transactions :+ elementSet
      itemSet = itemSet ++ elementSet
    }
  }
Copy the code

Calculates the support of item sets

  var toRetItems: Map[Set[String], java.lang.Double] = Map(a)var associationRules: List[(Set[String].Set[String].Double)] = List(a)def getSupport(itemComb: Set[String) :Double = {
    val count = transactions.filter(transaction => itemComb.subsetOf(transaction)).size
    count.toDouble / transactions.size.toDouble
  }
Copy the code

K starts from 2 and mines frequent item sets. After the mining of 2 item sets is completed, it mines 3 item sets until the end

def runApriori(minSupport: Double = 0.45, minConfidence: Double = 0.7) = {
    var itemCombs: Set[(Set[String].Double)] = Set(a)var currentCSet: Set[Set[String]] = itemSet.map(word => Set(word))
    var k: Int = 2
    breakable {
      while (true) {
        val currentItemCombs: Set[(Set[String].Double)] = currentCSet.map(wordSet => (wordSet, getSupport(wordSet)))
          .filter(wordSetSupportPair => (wordSetSupportPair._2 > minSupport))
        val currentLSet = currentItemCombs.map(wordSetSupportPair => wordSetSupportPair._1).toSet
        if (currentLSet.isEmpty) break
        currentCSet = currentLSet.map(wordSet => currentLSet.map(wordSet1 => wordSet | wordSet1))
          .reduceRight((set1, set2) => set1 | set2)
          .filter(wordSet => (wordSet.size == k))
        itemCombs = itemCombs | currentItemCombs  / / merge
        k += 1}}for (itemComb <- itemCombs) {
      toRetItems += (itemComb._1 -> itemComb._2)
    }
    calculateAssociationRule(minConfidence)
  }
Copy the code

The association rule minConfidence= 0.7 was calculated

def calculateAssociationRule(minConfidence: Double = 0.7) = {
    toRetItems.keys.foreach(item =>
      item.subsets.filter(wordSet => (wordSet.size < item.size & wordSet.size > 0))
        .foreach(subset => {
          associationRules = associationRules :+ (subset, item diff subset,
            toRetItems(item).toDouble / toRetItems(subset).toDouble)
        }
        )
    )
    associationRules = associationRules.filter(rule => rule._3 > minConfidence)
  }
}
Copy the code

Program parameter setting

Set the parameters minSupport = 0.45, minConfidence = 0.7

After filtering out an item set, the result of the run is as follows

===Support Items===
(Set(0.1),0.8755124487551245)
(Set(0.4),0.45055494450554945)
(Set(2.1),0.45495450454954506)
(Set(2.0),0.45465453454654536)
(Set(4.1),0.45395460453954606)
===Association Rules= = = (Set(4),Set(0),0.9235499077679853)
(Set(2),Set(1),0.9270578647106764)
(Set(2),Set(0),0.9264466177669112)
(Set(4),Set(1),0.9305185488829679)
(Set(0),Set(1),0.9349706353443673)
(Set(1),Set(0),0.932282793867121)
Copy the code

As can be seen from the results, the combination of fruit moon cake and ice skin moon cake is the most popular, with a support degree of 0.8755 and a confidence degree of 0.9349 0.9322, indicating that many users will buy these two kinds of moon cake, and if they buy one of them, there is a 90% probability that they will buy the other one, so merchants can put the two kinds of products together. Increased sales, salted egg yolk mooncakes and chocolate mooncakes are still very popular, with a support rating of more than 0.45

Unexpectedly, the traditional ice skin mooncake is so popular that it has become a companion with the novel mooncake like fruit mooncake

More and more kinds of moon cakes, more and more collocation, some collocation is “fancy”, I hope to increase the sales of goods through the above similar means, to achieve the purpose of using the Internet to life

In the result is not to see the trace of five kernel moon cake, it is really hidden ah, but the older generation or five kernel has special feelings, anyway, I do not like to eat

extension

Collaborative filtering:

In collaborative filtering, these users are called neighbors, and then organize them into a sorted directory to recommend to you what they like. The point of the problem is how to find users who are similar to you, and how to organize the preferences of those neighbors into a sorted directory for users. Collaborative filtering simply is to use a certain interests, share a common experience of the group’s preferences to recommend user information of interest, personal information through cooperation mechanism to give a fair degree of response (e.g., grade) and recorded in order to achieve the purpose of filtering and screening information to help others, have different response is limited to particularly interested in, Keeping track of particularly uninteresting information is also important.

This is the end of today’s sharing, the content is not very difficult, Mid-Autumn Festival is coming, what is your favorite moon cake?