Public account: You and the cabin by: Peter Editor: Peter

Pandas Initial exploration of data

This paper introduces the initial exploration of Pandas data. When we generate or import data, through data exploration work, we can quickly understand and understand the basic information of data, such as the type, index, maximum value and missing value of the field in the data, so that we can have a preliminary understanding of the overall picture of data.

Mind mapping

Simulated data

The method introduced in this paper uses an analog data, including character type, numerical type and time type. At the same time, there are deliberately missing values in the data:

To read the data using pandas’ read_excel method:

Generate a Series data as well:

The data sample

Head and tail data check

  • Head (N) : The default number of heads is 5, and N can be specified
  • Tail (N) : The default value is 5 tails. You can view N tails

Randomly viewing the sample

The default value is to view a random item of data. You can also specify the number of items to view:

View the data shape shape

The shape here refers to how many rows and columns the data has. By looking at the shape of the data, you can know the size of the data

  • DataFrame type: Two numeric values representing rows and columns
  • Series type: only number of rows

Data size

The data size is the total number of data in the data, which is the product of two values in the result of shape

df.size  # 56 = 7 * 8
Copy the code

Data dimension NDIM

How many dimensions is the data, two, three, etc

Basic data info info

Display data type, index status, number of columns, column attribute name, occupied memory and other information; Series data does not have this method

Data type dtypes

df.dtypes  The data type of each column attribute
s.dtype  # no s, resulting in a type
Copy the code

Column attributes and row indexes

Check by axes; DataFrame data has both row indexes and column names, while Series data has only row indexes.

View row index

View the row index through the specialized INDEX attribute

View column properties

df.columns
Copy the code

View the data

Two methods or properties to view:

  • values
  • to_numpy()

Viewing missing Values

In a data frame, the value is True if there is a missing value, otherwise the value is False:

Memory_usage ()

View memory usage for each column, in bytes:

df.memory_usage()
s.memory_usage()
Copy the code

statistics

Description Only for numeric data, you can view statistics of data in this field

Overall information describe

Returns the number, mean, variance, and quartile of numerical data

df.describe()
Copy the code

Look at the mean

A DataFrame is usually computed as a Series, and a Series is computed as a specific value

The following code calculates the mean by column:

df.mean()  # by column

# the results
age         21.714286
chinese    111.285714
math       117.000000
english    119.571429
dtype: float64
Copy the code

To view the mean of a column:

df["math"].mean()  # 117.0
Copy the code

The following code calculates the mean by line:

df.mean(1)  # count by row

0    89.50
1    96.25
2    87.50
3    93.50
4    89.25
5    95.50
6    95.25
dtype: float64
Copy the code

Pandas has built-in mathematical methods

A variety of mathematical calculation functions built into Pandas

Column 0 is the default value, and 1 is the value of row 0

df.abs(a)# the absolute value
df.mode() # the number
df.mean() # return the mean of all columns
df.mean(1) # return the mean of all rows
df.max(a)Return the maximum value of each column
df.min(a)Return the minimum value of each column
df.median() # return the median of each column
df.std() # Return the standard deviation of each column, Bessel corrected sample standard deviation
df.var() # unbiased variance
df.corr() Return the correlation coefficient between columns
df.count() # return the number of non-null values in each column
df.prod() # LianCheng
df.mad() # Mean absolute deviation
df.cumprod() # cumulative multiplication
df.cumsum(axis=0) # add up, add up
df.nunique() # deduplicate quantity, the quantity of different values
df.sem() # Standard error of the mean
df.idxmax() The index name of the maximum value of each column
df.idxmin() # minimum
df.cummin() # Cumulative minimum
df.cummax() # Cumulative maximum
df.skew() Sample skewness (third order)
df.kurt() Sample kurtosis (4th order)
df.quantile() # Sample quantile (values of different %)
Copy the code

conclusion

In this paper, we will introduce data exploration in Pandas to help us quickly understand the basic information of data and facilitate subsequent data processing and analysis.