Use Python for correlation analysis of data

When we do data analysis, the data we use are often not one-dimensional, which makes it more difficult to analyze because we need to consider the relationship between dimensions. The analysis of these dimensional relations needs to be measured by some methods, and correlation analysis is one of them. This article will use Python to explain the correlation analysis of data.

Several concepts need to be introduced before correlation analysis, namely dimension, covariance and correlation coefficient. Take figure 1 as an example. This is an employee information statistics table. There are N employees, namely employee 1, employee 2 and…… , employee N, each employee has five attributes, namely height, weight, age, length of service and education background. Each employee’s information is an observation, also called a sample, which is referred to as observation in this paper. An attribute of each employee is called an indicator, also called variable, dimension or attribute, which is referred to as dimension in this paper. So there are n observations and 5 dimensions in this diagram.

Figure 1. Employee information sheet

The definition of covariance is E{[x-e (X)] [y-e (Y)]}, denoted as Cov(X, Y), which is the expectation of the product of the difference between the two dimensions and their respective expectations. Expectation is usually the mean value in discrete data. For example, in Figure 1, height represents X, weight represents Y, and E(X) is the mean value of height. E(Y) is the mean of body weight, and the covariance is calculated by using the difference between them and E(X) and E(Y). The correlation coefficient is Cov(X, Y)/[σ(X)σ(Y)], named ρXY, where σ(X) and σ(Y) represent the standard deviations of X and Y, respectively, so the correlation coefficient is the product of the covariance of two variables divided by their standard deviations. Similarly, if an observation has p dimension, calculating covariance between each dimension and all dimensions, will form a PXP matrix, the matrix of the covariance between each number is its corresponding dimensions, is called a covariance matrix, the matrix of covariance matrix calculated according to the above methods further correlation matrix can be obtained.

Here is python code to explain correlation analysis.

First of all, data set. The data used in this paper comes from seaborn, which is the data of iris, a very famous flower. It is very simple to obtain by executing the following code. There is a problem to be reminded here. Some people will make an error in load_dataset and cannot read the data, because iris dataset does not exist. This may be because of seaborn version. The url is https://github.com/mwaskom/seaborn-data, download data into seaborn – data folder, the folder is in commonly seaborn installation directory or is the current working directory.

import seaborn as sns
data = sns.load_dataset('iris')
df = data.iloc[:, :4Select * from the first four columnsCopy the code

This data set has 150 rows and 5 columns, so we only use the first 4 columns. A sample dataset is shown in Figure 2.

Figure 2. Sample dataset

Now let’s do the correlation analysis.

Let’s start with a simple analysis of the correlation between column 1 and column 3 in the dataset, that is, the relationship between the sePAL_LENGTH and petal_length columns. Here we can use numpy, scipy and pandas. First, numpy.

import numpy as np
X = df['sepal_length']
Y = df['petal_length']
result1 = np.corrcoef(X, Y)
Copy the code

The result1 result is a two-dimensional matrix, as shown in Figure 3.

Figure 3. Calculation results of RESULT1

The values on the main diagonal of the matrix are all 1 (the main diagonal is the diagonal line from the upper left corner to the lower right corner), because the values on the main diagonal are the correlation between each observation and itself, so they are all 1. After all, X=1X, Y=1Y, and each observation is equal to 1 times itself. The other numbers in figure 3 that are not 1 are the correlation values, and there are two of them, which are equal because they represent ρXY and ρYX, respectively, which are equal. Similarly, we can find the correlation of the four dimensions in DF, the code is as follows, where Rowvar represents the column dimension.

result2 = np.corrcoef(df, rowvar=False)
Copy the code

The result is shown in Figure 4.

Figure 4. Calculation results of Result2

Figure 4 is a 4×4 matrix with 16 data, representing the correlation between each dimension and other dimensions (including each dimension and itself). The main diagonal is 1, and the other numbers are symmetric about the main diagonal, which is a symmetric matrix.

Next we use SCIPY for analysis. Here’s the code.

import scipy.stats as ss
result3 = ss.pearsonr(X, Y)
Copy the code

The result is (0.8717537758865831, 1.0386674194498099e-47), which returns two numbers. The first number is the correlation value of X and Y, which is the same as the previous numpy calculation result. The second number is the probability that the two are not correlated. This is the p-value that we often say in statistics, but this value refers to the probability of no correlation, that is, the smaller the value, the more correlated. Our value here is very small, indicating that the degree of linear correlation between the two is relatively large. Of course, if the correlation value is 1, then the p value is 0. There is no method to calculate correlation matrix in SCIPY.

Finally, the pandas method.

Since df is the DataFrame format for pandas, it can be used directly. Here’s the code.

result4 = X.corr(Y)
result5 = df.corr()
Copy the code

The resulT4 result is 0.8717537758865833, and the result5 result is shown in Figure 5. These two results are the same as those obtained earlier.

Figure 5. Calculation results of RESULt5

Next up is the drawing. For analyzing correlation, there are two common graphs, one is a scatter diagram and the other is a thermal diagram. Scatterplot can see clearly in the distribution and trend, all the coordinates for the data analysis, it can be more intuitive understanding of the relationship between each dimension data, but this method also has faults, which are not suitable for large amount of data, because the data volume is too big, to generate images will be very slow, pictures too much is bad for observation at the same time; However, thermal maps accurately describe the relationships of various dimensions from the perspective of numerical values or colors, which convey less information, but are more suitable for large data volumes. Let’s first introduce scatter diagrams.

Scatter diagrams can be generated using seaborn or pandas. The code for Seaborn is as follows.

sns.pairplot(df)
sns.pairplot(df , hue ='sepal_width');
Copy the code

The result of the first line of code is shown in FIG. 6, which is a large graph containing 16 subgraphs, each of which is the correlation graph between each dimension and some other dimension. Among them, the graph on the main diagonal is the histogram of data distribution of each dimension. The second line of code draws the same graph, but uses sepal_width data as the standard to color each data point. The result is shown in Figure 7. As can be seen from the figure, there are 23 different values in the sepal_width column, and each value has a color, so the generated figure is colored.

Figure 6. Ordinary correlation diagram drawn by Seaborn

Figure 7. Seaborn’s correlation diagram based on a certain column data

Another way to draw is to use pandas.

import pandas as pd
pd.plotting.scatter_matrix(df, figsize=(12.12),range_padding=0.5);
Copy the code

As shown in Figure 8, seaborn is generally recommended because it is slightly less customizable and refined.

Figure 8. The correlation diagram generated by pandas

And finally, the thermal map. The code is as follows.

import matplotlib.pyplot as plt
figure, ax = plt.subplots(figsize=(12.12))
sns.heatmap(df.corr(), square=True, annot=True, ax=ax)
Copy the code

There is also a small problem when you start writing this code, as shown in Figure 9. Seaborn is a library based on Matplotlib, so just upgrade Matplotlib. My original matplotlib version was 3.1.1, but it has now been upgraded to 3.2.2. This bug has been fixed. The normal figure is shown in Figure 10. In the second line square=True indicates whether each subgraph is displayed as a square, which is set to True, and annot=True indicates whether the value of each subgraph is displayed in the graph, which is also set to True.

Figure 9. Heat map generated by old matplotlib

Figure 10. Heat map generated by the new version of Matplotlib

From data calculation to visualization, this article introduces a variety of python methods to find correlation between multidimensional data. We can choose corresponding methods according to our own needs.

About the author: Mort, a data analysis enthusiast, is good at data visualization, pays more attention to machine learning, and hopes to learn more and communicate with friends in the industry.

Appreciate the author

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the ministry, tsinghua university, Peking University, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, such as Google, Microsoft, government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Recommended reading:

Read common cache problems in high concurrency scenarios \

Using Django to develop DApp\ based on Ethereum smart contract

Let’s read Python tasks like celery\

5 minutes on chain calls in Python

Create a Bitcoin price alert application in Python

▼ clickBecome a community member and click on itIn the see

Use Python for correlation analysis of data

Related Posts

InnoDB learning (5) MVCC multi-version concurrency control

Seven Niuyun: Big data platform developed based on Go

Five of the most common DNS fault diagnosis and troubleshooting methods