Python association rules mining shadow users

Python association rules mine lovers, gay friends, girlfriends, cheating men and dogs

Public account: You and the cabin by: Peter Editor: Peter

Hello, I’m Peter

This article is about the application of an algorithm in machine learning: association rule analysis

The whole story starts with a campus card. I believe all of you have used the campus card, which is a multi-functional campus information integration and management system for personal identity authentication, campus consumption, data sharing and so on. It stores a large amount of data, including: student consumption, dormitory access control, library access and so on.

In this paper, the consumption details of the one-card of students in a university in Nanjing during April 1-20, 2019 are used. From the statistical visualization analysis and association rule analysis, it is found that the use of the one-card of students and students’ lovers, gay friends, girlfriends, cheating men and single dogs and other interesting information.

The data set addresses used are as follows: github.com/Nicole456/A…

Import data

import pandas as pd
import numpy as np
import datetime 
import plotly_express as px
import plotly.graph_objects as go
Copy the code

1. Data 1: Basic information of each student’s campus card

2. Data 2: Detailed data of each consumption and recharge of campus card

3. Data 3: Details of access control

Data size

In [8]:

print("df1: ", df1.shape)
print("df2: ", df2.shape)
print("df3: ", df3.shape)
df1:  (4341, 5)
df2:  (519367, 14)
df3:  (43156, 6)
Copy the code

Missing value

 Missing values for each column
df1.isnull().sum(a)The percentage of missing values per column
df2.apply(lambda x : sum(x.isnull())/len(x), axis=0)
Copy the code

The number of contrast

Number of persons by sex

Number of students in different majors

In [16]:

df5 = df1["Major"].value_counts().reset_index()

df5.columns = ["Major","Number"]
df5.head()
Copy the code

The number of men and women in different majors

In [18]:

df6 = df1.groupby(["Major","Sex"])["CardNo"].count().reset_index()
df6.head()
Copy the code

fig = px.treemap(
    df6,
    path=[px.Constant("all"),"Major"."Sex"].# Important: Pass the data path
    values="CardNo",
    color="Major"   # specify a color change parameter
)

fig.update_traces(root_color="maroon")
# fig.update_traces(textposition="top right")
fig.update_layout(margin=dict(t=30,l=20,r=25,b=30))

fig.show()
Copy the code

Access Control Information

Address information

In [21]:

[" address "].str. Extract (r"(? P<Address_New>[\w]+)\[(?P<Out_In>[\w]+)\]") addressCopy the code

Access time

In [25]:

df8 = pd.merge(df3,df1,on="AccessCardNo")
df8.loc[:,'Date'] = pd.to_datetime(df8.loc[:,'Date'].format='%Y/%m/%d %H:%M',errors='coerce')

df8["Hour"] = df8["Date"].dt.hour
# df8["Minute"] = df8["Date"].dt.minute

Number of access control personnel per hour
df9 = df8.groupby(["Hour"."Out_In"]).agg({"AccessCardNo":"count"}).reset_index()
df9.head()
Copy the code

# Prepare canvas
fig = go.Figure()

# Add different data
fig.add_trace(go.Scatter(  
    x=df9.query("Out_In == 'out '") ["Hour"].tolist(),
    y=df9.query("Out_In == 'out '") ["AccessCardNo"].tolist(),
    mode='lines + markers'.# mode Mode selection
    name='go out')) # the name

fig.add_trace(go.Scatter(  
    x=df9.query("Out_In == 'enter '") ["Hour"].tolist(),
    y=df9.query("Out_In == 'enter '") ["AccessCardNo"].tolist(),
    mode='lines + markers', 
    name='the door')) 

fig.show()
Copy the code

Consumer information

In [30]:

Df10 = pd.merge(df2,df1[["CardNo","Sex"]],on="CardNo")Copy the code

Merge information

In [32]:

df10["Card_Sex"] = df10["CardNo"].apply(lambda x: str(x)) + "_" + df10["Sex"]
Copy the code

The main site

In [33]:

# Card_Sex: Card_Sex Df10 = (card10. Groupby ("Dept"). Agg ({"Card_Sex":"count","Money":sum}) .reset_index().sort_values("Money",ascending=False)) df11.head(10)Copy the code

Bar (df11,x="Dept",y="Card_Sex") fig.update_layout(title_text=' number of customers in different locations ',xaxis_tickangle=45)Copy the code

fig = px.bar(df11,x="Dept",y="Money")
fig.update_layout(title_text='Amount spent in different places',xaxis_tickangle=45) 

fig.show()
Copy the code

Association rule mining

Time to deal with

There are two main points of time processing:

Time format conversion
Time discretization: one type every 5 minutes

Here we default: if two times are in the same type, two people are considered to consume together

import datetime

def change_time(x) :
    Convert to standard time format
    result = str(datetime.datetime.strptime(x, "%Y/%m/%d %H:%M"))
    return result

def time_five(x) :
    # '2022-02-24 15:46:09 '--> '2022-02-24 15_9'
    res1 = x.split(":") [0]
    res2 = str(round(int(x.split(":") [1) /5))
    return res1 + "_" + res2
  
  
df10["New_Date"] = df10["Date"].apply(change_time)
df10["New_Date"] = df10["New_Date"].apply(time_five)
df10.head(3)
Copy the code

Mention personnel information for each time type:

Df11 ["Card_Sex"] = groupby(["New_Date"])["Card_Sex"].apply(list).reset_index( df11["Card_Sex"].apply(lambda x: List (set(x)) all_list = df11["Card_Sex"].tolist() # id = [] # for I in df10["New_Date"].tolist(): # lst = df10[df10["New_Date"] == i]["Card_Sex"].unique().tolist() # all_list.append(lst)Copy the code

Frequent itemset finding

In [44]:

import efficient_apriori as ea

Itemsets are set in itemsets
itemsets, rules = ea.apriori(all_list,
                min_support=0.005,  
                min_confidence=1
               )
Copy the code

A person

One person consumes the most data: 2,565 pieces of data.

len(itemsets[1])  Article # 2565.

# Partial data
{('181539 _ male',) :52,
 ('180308 _ women',) :47,
 ('183262 _ women',) :100,
 ('182958 _ male',) :88,
 ('180061 _ women',) :83,
 ('182936 _ male',) :80,
 ('182931 _ male',) :87,
 ('182335 _ women',) :60,
 ('182493 _ women',) :75,
 ('181944 _ women',) :67,
 ('181058 _ male',) :93,
 ('183391 _ women',) :63,
 ('180313 _ women',) :82,
 ('184275 _ male',) :69,
 ('181322 _ women',) :104,
 ('182391 _ women',) :57,
 ('184153 _ women',) :31,
 ('182711 _ women',) :40,
 ('181594 _ women',) :36,
 ('180193 _ women',) :84,
 ('184263 _ male',) :61.Copy the code

Two people

len(itemsets[2])  Article # 378.
Copy the code

After viewing all the data, the following results were obtained:

('180433 _ male'.'180499 _ women') :34
Suspicious cheating men who play with women's affections 1
('180624 _ male'.'181013 _ women') :36,
('180624 _ male'.'181042 _ women') :37.Suspicious cheating men who play with women's affections 2
('181461 _ male'.'180780 _ women') :38,    
('181461 _ male'.'180856 _ women') :34,
    
('181597 _ male'.'183847 _ women') :44,
    
('181699 _ male'.'181712 _ women') :31,
    
('181889 _ male'.'180142 _ women') :33.Suspicious cheating men who play with women's affections 3: NB
('182239 _ male'.'182304 _ women') :39,
('182239 _ male'.'182329 _ women') :40,
('182239 _ male'.'182340 _ women') :37,
('182239 _ male'.'182403 _ women') :35,
    
('182873 _ male'.'182191 _ women') :31,

('183343 _ male'.'183980 _ women') :44.Copy the code

1. Suspicious boy 1-180624

Go back to the raw data and look at the intersection of his time consumption with different girls.

(1) The intersection of 181013 and female students:

April 1, 7.36am: Probably had breakfast together; We had lunch at 11:54
Intersection of different time points such as 4.10 and 4.12

(2) The intersection of 181042 and female students:

2. Look at the suspicious cheating man playing with women’s affections 3

Data mining revealed that he was related to four girls at the same time.

('182239 _ male'.'182304 _ women') :39
('182239 _ male'.'182329 _ women') :40
('182239 _ male'.'182340 _ women') :37
('182239 _ male'.'182403 _ women') :35
Copy the code

In addition to possible boyfriend/girlfriend relationships, there are more gay friends or girlfriends in the 2 metadata:

('180450 _ women'.'180484 _ women') :35,
('180457 _ women'.'180493 _ women') :31,
('180460 _ women'.'180496 _ women') :31,
('180493 _ women'.'180500 _ women') :47,
('180504 _ women'.'180505 _ women') :43,
('180505 _ women'.'180506 _ women') :35,
('180511 _ women'.'181847 _ women') :42,
('180523 _ male'.'182415 _ male') :34,
('180526 _ male'.'180531 _ male') :33,
('180545 _ women'.'180578 _ women') :41,
('180545 _ women'.'180615 _ women') :47,
('180551 _ women'.'180614 _ women') :31,
('180555 _ women'.'180558 _ women') :36,
('180572 _ women'.'180589 _ women') :31,
('181069 _ male'.'181103 _ male') :44,
('181091 _ male'.'181103 _ male') :33,
('181099 _ male'.'181102 _ male') :31,
('181099 _ male'.'181107 _ male') :34,
('181102 _ male'.'181107 _ male') :35,
('181112 _ male'.'181117 _ male') :43,
('181133 _ male'.'181136 _ male') :52,
('181133 _ male'.'181571 _ male') :45,
('181133 _ male'.'181582 _ male') :33.Copy the code

3-4 people

The data of 3-4 yuan may be from classmates or friends in a dormitory, and the relative amount will be relatively small:

len(itemsets[3])  Article # 18.

{('180363 _ women'.'181876 _ women'.'183979 _ women') :40,
 ('180711 _ women'.'180732 _ women'.'180738 _ women') :35,
 ('180792 _ women'.'180822 _ women'.'180849 _ women') :35,
 ('181338 _ male'.'181343 _ male'.'181344 _ male') :40,
 ('181503 _ male'.'181507 _ male'.'181508 _ male') :33,
 ('181552 _ male'.'181571 _ male'.'181582 _ male') :39,
 ('181556 _ male'.'181559 _ male'.'181568 _ male') :35,
 ('181848 _ women'.'181865 _ women'.'181871 _ women') :35,
 ('182304 _ women'.'182329 _ women'.'182340 _ women') :36,
 ('182304 _ women'.'182329 _ women'.'182403 _ women') :32,
 ('183305 _ women'.'183308 _ women'.'183317 _ women') :32,
 ('183419 _ women'.'183420 _ women'.'183422 _ women') :49,
 ('183419 _ women'.'183420 _ women'.'183424 _ women') :45,
 ('183419 _ women'.'183422 _ women'.'183424 _ women') :48,
 ('183420 _ women'.'183422 _ women'.'183424 _ women') :51,
 ('183641 _ women'.'183688 _ women'.'183690 _ women') :32,
 ('183671 _ women'.'183701 _ women'.'183742 _ women') :35,
 ('183713 _ women'.'183726 _ women'.'183737 _ women') :36}
Copy the code

4 Metadata contains only one metadata:

conclusion

Association rule analysis is a classical data mining algorithm, which is widely used in consumption details data, supermarket basket data, finance, insurance, credit cards and other fields.

When we use association analysis technology to mine frequent combinations and strong association rules, we can specify the corresponding marketing strategy or find the relationship between different objects.

In fact, the above data mining process also has certain defects:

Too wide constraint: grouping statistics are only made according to the time interval type, ignoring students’ major, consumption place and other information
Too narrow: A narrow interval of 5 minutes can filter out a lot of information