Source: GZH Data kaleidoscope

Article link: mp.weixin.qq.com/s/7uBQ3_sR2…

Click on the blue letters above to follow us

The series of articles on causal inference is divided into two parts. The table of contents is as follows. The previous article can be viewed by clicking on the original article.

Using the Dowhy framework, causality is divided into two parts

Ok,

1.Dowhy causal inference framework

2. Data source and pretreatment

3. Data correlation exploration

The next

Realization of causal inference

1. Calculate the expected frequency and preliminarily judge the causal relationship

2. Create causal diagrams based on assumptions

3. Identify causal effects

4. Estimate causal effects

5. Refute the results

This article is based on the basis of the previous analysis, the previous data pretreatment process can be read the full text to view the source code, this article mainly focuses on the implementation of causal inference.

Realization of causal inference

After the completion of data pretreatment and correlation analysis, the correlation between various variables has a preliminary result, but it is still unknown whether there is a causal relationship between variables. Further causal inference needs to be carried out, that is, through four steps of modeling, identification, estimation and refutation using the Dowhy framework.

1. Calculate the expected frequency and preliminarily judge the causal relationship

According to the correlation analysis, customer cancellations are highly correlated with the three factors of “parking space”, “total number of living days” and “different reserved room type and allocated room type”. In addition to the above three factors, there are also some factors showing a weak correlation with customer cancellation, such as “scheduled change”, “special requirements” and other factors.

Correlation is not necessarily the same as causation, and it can be seen from Figure 9-10 that the proportion of positive and negative samples in the data set is unbalanced, so a preliminary study on causation is needed here. So for variable “cancel” and “different reserve room and distribution room”, the data set randomly selected 1000 observation data, the same number of statistical values of two variables, namely if the hotel assigned to customers and booking room customer cancel the order of the number of different room, repeat the above process average 10000 times, the implementation code is as follows.

Counts_sum =0 for I in range(1,10000): counts_i = 0 rdf = data.sample(1000) counts_i =rdf[rdf["is_canceled"]==rdf["different_room_assigned"]].shape[0] Counts_sum + = counts_i counts_sum / 517.9752 10000Copy the code

In theory, this number should be 50% of the total number of observations, because when the hotel assigns a room that does not match the room type, the customer either cancels the reservation or accepts the room type adjustment. If this number is close to 50% of the total number of observations, it can preliminarily indicate that there may be a causal relationship between the two variables.

The final expected frequency is 518, that is, if the customer is assigned a different room type from the reservation, the customer will cancel the reservation about 50% of the time.

Reservation change, namely variable “booking_changes”, is also one of the factors that cause different room types in hotel allocation and reservation, so it is important to remove the influence of this variable. Therefore, 1000 users with a predetermined change of 0 were randomly selected here, and the average value was taken after the above random test was repeated for 10,000 times. The implementation code is as follows.

Counts_sum =0 for I in range(1,10000): counts_i = 0 rdf =data[data["booking_changes"]==0].sample(1000) counts_i = RDF [RDF ["is_canceled"]== RDF ["different_room_assigned"]]. Shape [0] counts_sum+= counts_i counts_sum/10000 492.0499Copy the code

For customers with 0 scheduled change times, the expected frequency is 492, accounting for about 50% of the sample, which is in line with expectations.

For the users with scheduled changes, 1000 customers were also selected to conduct the above randomized trial 10,000 times, with the implementation code as follows.

Counts_sum =0 for I in range(1,10000): counts_i = 0 rdf =data[data["booking_changes"]>0].sample(1000) counts_i = RDF [RDF ["is_canceled"]== RDF ["different_room_assigned"]]. Shape [0] counts_sum+= counts_i counts_sum/10000 663.4134Copy the code

However, for customers whose scheduled change times are greater than 0, the final expected frequency is 663, and there is a big difference in expected frequency. This result suggests that “scheduled change” may be a confounding variable.

However, there may be more than one confounding variable affecting customer cancellations, in which case the Dowhy framework will infer all unspecified variables as potential confounding variables.

2. Create causal diagrams based on assumptions

Based on the exploration of expected frequency and the data analyst’s own experience, we make the following assumptions about the relationship between variables.

– Market segment, i.e. the “market_segment” field, includes two categories namely “individuals” and “travel agents”. This is the source of the hotel reservation. The reservation method will affect the time between the reservation and the arrival of the hotel, that is, the “lead _time” field.

– The country field indicates the destination country of the customer. The popularity of tourism in the target country will affect whether users will book hotels in advance, thus affecting “lead_time”. At the same time, different countries have different eating habits, so there is a certain correlation between the destination country and the food, namely the “meal” field.

Full article please move to GZH data kaleidoscope, mp.weixin.qq.com/s/7uBQ3_sR2…

This article uses the article synchronization assistant to synchronize