The Pandas data processing module is used for data processing, such as data type conversion, missing value processing, descriptive statistical analysis, and data summarization

First, sequence and data box construction

The core operation objects of the Pandas module are sequences and data boxes. A field in a data set. A data box is a data set containing at least two fields (or sequences).

1. Construct sequence

The sequence constructed from the list, meta-ancestor and one-dimensional array in Numpy is like sequence 1. The row index (row number) of the first column of the sequence automatically starts from 0. The second column is the actual value of the sequence

Different by dictionary construction, the first column’s concrete row name corresponds to the key in the dictionary, and the second column’s actual value of the sequence corresponds to the value in the dictionary.

You can also build from a column in the data box

2. Obtaining sequence elements

The methods of indexing one-dimensional arrays and mathematical and statistical functions can be applied to sequences, but sequences have their own special processing methods.

For those built by dictionary type, you can index the line number as well as the line name

If you need to perform mathematical functions on a sequence, the NUMpy module is preferred

If you need to calculate the statistical function of a sequence, the sequential method is preferred

3. Construct data enclosures

A data box is actually a data set, with rows representing each observation and columns representing variables. Sequences of different data types can be stored in a database, while arrays and sequences can only hold homogeneous data.

When constructing data boxes manually, the dictionary method is preferred because the other constructs do not have specific variable names

It can also be constructed by reading external data

Second, the reading of external data

1. Read text files

Use the read_table or read_csv function in Pandas

Filepath_or_buffer: specifies the path where TXT or CSV files are stored. Sep: Specifies the separator between fields in the original data set. Default is Tab. Header: Indicates whether to use the first row in the original data set as the table header. By default, the first row is used as the field name. Names: If there is no field in the original data set, you can use this parameter to add a table header to the data box during data reading. Index_col: Specifies some columns in the original data set as the row index (label) of the data box; Usecols: Specifies which variable names to read from the original data set; Dtype: When reading data, you can set a different data type for each field of the original data set; Skiprows: Specifies the number of lines to skip from the beginning of the data set when reading data; converters: specifies the number of lines to skip from the beginning of the data set. Skipfooter: specifies the number of rows to skip at the end of the data set when reading data. Nrows: Specifies the number of rows to read data; Na_values: Specifies which feature values in the original data set are missing values. Skip_blank_lines: Whether to skip blank lines in the original data set when reading data. Default: True; Parse_dates: If True, attempt to parse the row index of the data box; If the argument is a list, try to parse the corresponding date column; If the argument is a nested list, merge some columns into date columns; If the argument is a dictionary, the corresponding column (that is, the value in the dictionary) is parsed and a new field name (that is, the key in the dictionary) is generated. Thousands: Specifies the thousands character in the original data set. Comment: specifies the comment character. When reading data, if the comment specified at the beginning of the line is encountered, the line will be skipped. Encoding: If the file contains Chinese characters, sometimes you need to specify the character encoding.

A =pd.read_table("F:\ TXT ",sep=",",skiprows=2,skipfooter=3,comment="#",encoding=" UTf8 ",thousands="&",parse_da Tes = {" birthday ": [0]}) aCopy the code

The original data set separates each column with a comma, changing the sep parameter, merging the new field birthday, and the comment parameter specifies the special row to skip, including Chinese re-encoding, and the thousandth character to ensure normal reading of numeric data

2. Reading of spreadsheets

Use the read_excel function

IO: Specify the specific path to the spreadsheet; Sheetname: Specifies the Sheet number to read in the spreadsheet. Sheetname can be an integer or a Sheet name. Header: Whether the first row of the dataset is required as the table header. Skiprows: Specifies the number of starting lines to skip when reading data; Skip_footer: read data is, specify the number of last lines to skip; Index_col: Specifies which columns are used as the row index (label) of the data box; Names: If there is no field in the original data set, you can use this parameter to add a table header to the data box during data reading. Parse_cols: specifies the field to be parsed; Parse_dates: If True, attempt to parse the row index of the data box; If the argument is a list, try to parse the corresponding date column; If the argument is a nested list, merge some columns into date columns; If the argument is a dictionary, the corresponding column (that is, the value in the dictionary) is parsed and a new field name (that is, the key in the dictionary) is generated. Na_values: Specifies which special values in the raw data represent missing values; Thousands: Specifies the thousands character in the original data set. Convert_float: Converts all numeric fields to floating point fields by default; Converters: Specifies the form to be converted for certain columns, using the form of a dictionary;

B =pd.read_excel(IO ="F:\ -- Python \data_test02.xlsx",header=None,converters={0: STR},names=['ID',"name",'color',"price"]) bCopy the code

As for the first column, which is actually a character type, you need to use the case parameters to prevent data from automatically turning into numeric fields when read in

3. Read database data

Run the PIP install pymysql or pysmSQL command (corresponding to MYSQL and SQL Server respectively) CASE1: connect in pymysql

Host: specifies the MySQL server to be accessed. User: specifies the user name for accessing the MySQL database. Password: specifies the password for accessing the MySQL database. Database: specifies the database name for accessing the MySQL database. Port: specifies the port number for accessing the MySQL database. Charset: specifies the character set to read from the MySQL database. If the database table contains Chinese characters, you can set this parameter to UTF8 or GBK. CASE2: Connect in PYMSSQL

In pymysql, the host parameter of the connect function indicates the server to be accessed, while the corresponding parameter of the PyMSSQL function is server

MYSQL for example:

Conn = pymysql.connect(host='localhost', user='root', Password ='test', database='test', port=3306, charset='utf8') Conn) # close connection conn.close() # data output userCopy the code

Data type conversion and descriptive statistics

It involves how to understand the data, such as the size of the data to be read, the type of each variable, the corresponding value of important statistical indicators, and the frequency statistics of each unique value of discrete variables. Take second-hand car information of a platform as an example:

Cars =pd.read_table("F:\sec_cars.csv",sep=",")Copy the code

The last five rows are tail

Print (cars.shape)Copy the code

Print (cars.dtypes)Copy the code

Object refers to the character type, but the license plate time should be the date type, the new car price should be floating point type, modify below

The to_datetime function in the PANDAS module is used to set the date type

Cars.boarding_time =pd.to_datetime(cars.boarding_time,format="%Y year %m month ")Copy the code

The price of a new car contains “ten thousand” characters, so it cannot be directly converted to a data type. There are three steps: first, convert the field into a string through STR method, then remove the word “ten thousand” by slicing, and finally use astype method to realize data type conversion.

#修改新车价格
cars.New_price=cars.New_price.str[:-1].astype("float")

Copy the code
# Reexamine the cars.dtypes data typeCopy the code

Next, you need to have an idea of the data, which characterizes all numerical variables in the data using basic statistics (minimum, mean, median, maximum, etc.). Descriptive analysis of data can be done using the Describe method:

Cars.describe ()Copy the code

Further understand the shape distribution of the data, such as whether the data is biased and whether it is characterized by “sharp peaks and thick tails”, skewness and kurtosis of one-time statistical numerical variables

The columns method is used to return the names of all the variables in the data set, and to get all the numeric variables using Boolean indexes and slicing methods

[1:] num_variables=cars.columns[Cars.dtypes!="object"][1:] num_variablesCopy the code

In the custom function, skew method for calculating skewness and Kurt method for calculating kurtosis are used, and then the calculation results are combined into the sequence

Def skew_kurt(x): skewness=x.skew() kurtsis=x.kurt() return pd.Series([skewness,kurtsis],index=["skew","kurt"])Copy the code

Use the Apply method, whose purpose is to perform a statistical operation on the specified axis (axis=0, vertical columns) (the operation function is a custom function)

cars[num_variables].apply(func=skew_kurt,axis=0)

Copy the code

The above statistical analysis is all for numerical variables. For character variables (such as second-hand car Brand, Discharge, etc.), describe method can be used. The difference is that “object” needs to be passed to include parameter in the form of a list

Describe (include=["object"])Copy the code

The above results contain four statistical values of discrete variables, namely, the number of non-missing observations, the unique level number, the discrete value with the highest frequency and the specific frequency. Taking used car brands as an example, there were 10,984 used cars, including 104 brands, among which Buick was the most with 1,346 cars.

Further statistics are needed for the frequency of each discrete value, or even the corresponding frequency. Take the standard Discharge Discharge of second-hand cars as an example

Freq= Cars.discharge.value_counts () Cars. shape[0] Freq_df = pd.dataframe ({'Freq':Freq,'Freq_ratio':Freq_ratio}) Freq_df.head()Copy the code

The data box consists of two columns, which are the frequency and frequency corresponding to various standard displacement of second-hand car respectively. The row index (label) of the data box is the different standard displacement of second-hand car. If the reader needs to set the row label as a column in the data box, the reset_index method can be used. The inplace parameter is set to True, indicating that the operation is performed directly on the original data set, affecting the change of the original data set. Otherwise, only the change preview is returned

Freq_df.head() freq_df.head ()Copy the code

Processing of character and date data

How to manipulate character variables based on data boxes, and how to handle date-based data, such as how to extract the year, month, day of the week from date-based variables, and how to calculate the time difference between two days

Df =pd.read_excel("F:\data_test03.xlsx")Copy the code

Dtypes = df.dtypesCopy the code

Change the birthday to a date type and the phone number tel to a string

Df.birthday =pd.to_datetime(df.birthday,format="%Y/%m/%d" df.dtypesCopy the code

The current date is subtracted from the date of birth and the date of commencement. The current date is obtained using the today function in the DATetime module Pandas. Since we are counting years apart, we need to further extract the year dt.year in the date

Df ["age"]=pd.datetime.today().year-df.birthday.dt.year df["workage"]=pd.datetime.today().year-df.start_work.dt.yearCopy the code

Hiding the middle four digits of the phone number TEL is the replace method in the string. Since the objects handled by the replace method are all observations in the variable, which is a repetitive work, sequential apply method is considered. Note that the func arguments in the Apply method use anonymous functions. The idea behind hiding the middle four digits of the phone number is to replace the middle four digits with an asterisk

Df.tel = df.tel.apply(func = lambda x: x.place (x[3:7],"****"))Copy the code

For the mailbox domain name acquisition, is the split method in the string, the idea is to follow the at sign style in the mailbox, and then take out the second element (that is, the list index is 1)

Df ["email_domain"]=df.email.apply(func = lambda x: x.split("@")[1])Copy the code

To get the professional information of a person from the other variable, a string regular expression is used. Whether it is a string “method” or a string re, the STR method is used on the variable once before it is used. Findall is a matching query function followed by the re symbol (.*?). Used for grouping, by default returns matching content in parentheses

Df ['profession'] = df.other.str.findall(' profession :(.*?)) '),Copy the code

To drop some variables in a data set, use the drop method of a data box. The first argument accepted by this method is the list of variables to be deleted. In particular, you need to set the Axis parameter to 1 because the silent DROP method is used to delete row records in the data box

Df.drop (['birthday','start_work','other'], axis = 1, inplace = True)Copy the code

List some common “methods” for date-based data:

5. Common data cleaning methods

Whether the data set is repeated, missing, data integrity and consistency, whether there are outliers in the data, etc

1. Repeated observation treatment

Df =pd.read_excel("F:\data_test04.xlsx") #Copy the code

The any function is True if one of the criteria is True

# Check whether there is duplication. Any (df.duplicated())Copy the code

To delete duplicates, nplace=True means to operate directly on the original data set

Df. drop_duplicates(inplace = True) dfCopy the code

The original 10 lines of observations, although rearranged to produce 7 lines of observations, were deleted with row numbers 3, 8, and 9.

2. Handle the missing value

There may be two reasons for the loss of observations. On the one hand, it is caused by human factors (such as omissions in the recording process, personal privacy and reluctance to disclose, etc.), and on the other hand, it is caused by machine or equipment failure (such as power failure or equipment aging, etc.).

In general, there are three ways to deal with missing values (denoted by NaN in Python) : deletion, replacement, and interpolation. The deletion method refers to the direct deletion of missing observations when the missing observations proportion is very low (e.g., less than 5%), or the direct deletion of missing variables when the missing variables proportion is very high (e.g., more than 85%). The substitution method refers to the direct substitution of those missing values by some constant. For example, for continuous variables, the mean or median can be used for substitution, and for discrete variables, mode can be used for substitution. Interpolation method is to predict missing values according to other non-missing variables or observations. Common interpolation methods include regression interpolation, K-nearest neighbor interpolation and Lagrange interpolation.

CASE1: delete method

Df =pd.read_excel("F:\data_test05.xlsx")Copy the code

Two methods are used to deal with missing values in the data set, namely, row deletion method, that is, all the row records containing missing values are deleted, using dropna method; Variable deletion method: Since the age variable has the most missing values in the original data set, drop method is used to delete the age variable

Df.dropna ()Copy the code

Df. drop("age",axis=1)Copy the code

CASE2: Fillna method is required for the replacement of missing values. The parameter of method in this method can accept ‘FFill’ and ‘bfill’, representing forward and backward filling respectively. Forward padding means replacing the missing value with the previous value (as shown in the left figure), and backward padding means replacing the missing value with the next value (as shown in the right figure). The last record in the right figure still contains the missing value because the backward padding method cannot find the next value of the missing value to replace. In the author’s opinion, the forward or backward filling of missing values is generally suitable for time-series data sets, because such data have consistency before and after, while general independent samples are not suitable for this method.

Df.fillna (method = 'ffill')Copy the code

Df.fillna (method = 'bfill')Copy the code

Another alternative is to use the fillna method again, but instead of using the method argument, use the value argument. Replace all missing values with a constant 0.

# replace df.fillna(value=0)Copy the code

Or a more flexible replacement method is adopted, that is, different replacement values are used for each missing variable respectively (dictionaries are required to be passed to the value parameter), mode replacement is used for gender, mean replacement is used for age, and median replacement is used for income.

# to use statistics to replace df. Fillna (value = {' gender ': df. Gender. Mode () [0],' age: df. Age. The scheme (), 'income: df. Income. The median ()})Copy the code

It should be noted that the code above does not actually change the result of the DF data box because the dropna, DROP, and fillna methods do not set the inplace parameter to True. The reader can actually change the data set you are working with by selecting an appropriate missing value handling method and setting the inplace parameter of this method to True.

3. Outlier handling

Outliers are those observations that are far from normal, or “out of group” observations. The occurrence of outliers is usually caused by human record errors or equipment failures, etc. The occurrence of outliers will have serious consequences for model creation and prediction. Of course, outliers are not always bad. In some cases, finding outliers can be good for business, such as destroying “phishing” sites or shutting down “wool pulling” users.

For outlier detection, generally USES two methods: one is n method of standard deviation, standard deviation method of judging formula is outlinear > ̅ + n sigma | | x, where x ̅ as the sample mean, sigma as the sample standard deviation, when n = 2, meet the conditions of observation is the outliers, when n = 3, meet the conditions of observation is extreme value; The other is the box-plot discrimination method. The judgment formula of the box-plot is OutLinear >Q3+ nIQR or OutLinear < Q1-NIQR, where Q1 is the lower quartile (25%), Q3 is the upper quartile (75%), and IQR is the difference between the upper quartile and the lower quartile. When n=1.5, The observation that meets the condition is an outlier, and when n=3, the observation that meets the condition is an extreme outlier.

The selection criteria of these two methods are as follows: if the data approximately obey normal distribution, the n standard deviation method is preferred, because the data distribution is relatively symmetric. Otherwise the boxplot method is preferred because quantiles are not affected by extreme values. When there are anomalies in the data, the abnormal values can generally be deleted by deletion method (on the premise that the proportion of abnormal observations is not too large), and replacement method (the maximum value lower than the upper limit of discrimination or the minimum value higher than the lower limit of discrimination can be considered as replacement, and the mean or median can be used as replacement).

sun=pd.read_table("F:\sunspots.csv",sep=",")
sun.head()

Copy the code

Xbar = sunspot.counts. Mean () XSTD = sunspot.counts. \n',any(sunspot. Counts > xbar + 2 * XSTD) print(sunspot. \n',any(sunspots. Counts < xbar - 2 * XSTD) # q = sunspots. Counts Quantile (q = 0.75) IQR = q3-Q1 print( \n', \n', \n', \n', \n', \n', \ any(sunspots.Copy the code

Both standard deviation test and boxplot test found that there were outliers in sunspot data, and the outliers were above the upper limit and critical value.

Next, histogram of the number of sunspots and kernel density curve were drawn to test whether the data approximately followed the normal distribution, and then a final outlier discrimination method was selected:

Use ("ggplot") # draw the histogram Plot (kind="hist",bins=30,normed=True) # Plot the kernel density plotCopy the code

Regardless of the histogram or kernel density curve, the data distribution shape presented is biased, and belongs to the right. For this reason, the boxplot method is chosen to determine the outliers in the sunspot data. The next step is to select a deletion or replacement method to handle these outliers

Replace here with a maximum below the upper limit of the test or a minimum above the lower limit of the test

UL = Q3 + 1.5 * IQR print Replace_value = sun.counts[sun.counts < UL].max() print(' Sun. Counts [SUN. Counts > UL] = replace_valueCopy the code

If the box-plot method is used to distinguish outliers, the outlier years are considered when the number of sunspots exceeds 148.85 in a year, and the outliers of these years are replaced by 141.7.

6. Acquisition of data subset

The syntax for retrieving cubes in Pandas is rows_select, cols_select, and ILOC, loC, and IX. The syntax for retrieving cubes is “rows_select, COLs_select”.

Iloc can filter data only by row and column numbers. Readers can interpret “I” in ILOC as “INTEGER”, that is, only [rows_SELECT, cols_SELECT] can be specified as an integer list. The index method is similar to the index method of array, which starts from 0, can take the number at intervals, and still cannot take the upper limit for slices.

Loc is more flexible than ILOC in that the reader can interpret the “L” in LOC as “label”, that is, you can specify specific row labels (row names) and column labels (field names) to [rows_SELECT, cols_SELECT]. Note that labels are not indexes. Also, you can specify roWS_SELECT as a specific filter, which is not possible in ILOC.

Ix is a hybrid of ILOC and LOC. Readers can understand IX as “Mix”. This “method” absorbs the advantages of ILOC and LOC and makes the acquisition of data block subsets more flexible.

1. The line number of the original data is the same as the line label (name)

# construct data set df1 = pd. DataFrame ({' name ': [' zhang', 'bill', 'the king 2', 'a', 'li five'], 'gender' : [' male ', 'female' and 'female' and 'female' and 'male'], 'age: [23,26,22,25,27]}. columns = ['name','gender','age']) dfCopy the code

Fetches three rows in a data set and returns two columns specified by name and age

Df1. Iloc [1:4, [0, 2]]Copy the code

② LOC is no longer an index, but a row name

df1.loc[1:3, ['name','age']]

Copy the code

③ IX has the same effect as LOC, but the filter of variable names can be used either column number or specific variable name

Ix [1:3,['name','age']] #Copy the code

2. The line number of the original data is inconsistent with the line label (name) or there is no line number

Df2 = df1.set_index('name') df2Copy the code

Fetch the middle three rows of a data set

df2.iloc[1:4,:]

Copy the code

② The LOC uses line labels and cannot write line numbers

Df2. loc[[' li si ',' li si ',' Di yi '],:]Copy the code

③ IX is now the same as ILOC

df2.ix[1:4,:]

Copy the code

3. Take out the gender and age of all males

Only LOC and IX can be used for conditional filtering of certain columns

Df1. Loc [df1. Gender = = 'male', [' name ', 'age']] df1. Ix [df1. Gender = = 'male', [' name ', 'age']]Copy the code

7. PivotTable functions

pd.pivot_table(data, values=None, index=None, columns=None, 
               aggfunc='mean', fill_value=None, margins=False, 
               dropna=True, margins_name='All')

Copy the code

Data: Specifies the data set from which the PivotTable needs to be constructed; Values: Specifies the field list to be pulled into the Value box. Index: Specifies the list of fields to be pulled into the Row Label box. Columns: Specifies the list of columns to be dropped into the column labels box. Aggfunc: a statistical function that specifies a numerical value. The default is the statistical mean. You can also specify other statistical functions in the Numpy module. Fill_value: Specifies a scalar to fill in missing values; Margins: bool Specifies whether to display the total values of the rows and columns, default to False. Dropna: bool Indicates whether to delete the missing field in an entire column. The default value is True. Margins_name: specifies the total name of the row or column, default is All;

1. Mean value statistics of single grouped variables

Summary statistics based on single grouping variable color (mean of price)

diamonds=pd.read_table("F:\diamonds.csv",sep=",")
diamonds.head()

Copy the code

Pivot_table (data = diamonds, index = 'color', values = 'price', margins = True, margins_name = 'pivot_table ')Copy the code

2. Contingency table of two grouped variables

For a contingency table, both rows and columns need to specify a grouping variable, so both the index parameter and the columns parameter need to specify a grouping variable. The aggfunc parameter needs to specify the size function in the NUMPY module. By setting parameters like this

Numpy as NP pd. Pivot_table (data = diamonds, index = 'clarity', columns = 'cut', Values = 'carat', aggfunc = np. Size,margins = True, margins_name = 'total ')Copy the code

Merge and join tables

For vertical merges between multiple tables, ensure that the number of columns and data types of multiple tables are the same. For horizontal extensions between multiple tables, you must ensure that multiple tables have matching fields in common. The Pandas module also provides functions for merging tables, concat, and merge.

1. Merge functions concat

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None)

Copy the code

Objs: Specifies the objects to be merged. This can be a list of sequences, data boxes, or panel data. Axis: Specifies the axis on which data is to be merged. The default is 0, indicating the rows to be merged, or 1, indicating the columns to be merged; Join: Specifies the merge method. The default value is outer, which means that all data is merged. If changed to inner, the common data is merged. Join_axes: specifies the retained data axes after merging data; Ignore_index: bool indicates whether to ignore the index of the original data set. Default value: False. If True, the original index is ignored and a new index is generated. Keys: Add a new index to the merged data to distinguish the data parts;

# construct data set and df1 df2 df1 = pd. The DataFrame ({' name ': [' zhang', 'bill', 'the king 2'], 'age:,25,22 [21]. 'gender' : [' male 'and' female 'and' male ']}) df2 = pd. The DataFrame ({' name ': [' a', 'Zhao Wu],' age: [23, 22]. 'gender' : [' female 'and' female ']}) the vertical merger of pd # dataset. Concat ([df1, df2], keys = [' df1 ', 'df2'])Copy the code

To distinguish the merged DF1 dataset from df2 dataset, the concat function uses the keys parameter. If ignore_index is set to True, the keys parameter is no longer valid

pd.concat([df1,df2] , keys = ['df1','df2'],ignore_index=True)

Copy the code

2. Merge function merge

The merge function can merge only two tables at a time. If n tables need to be joined, the merge function must be used n-1 times

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'))

Copy the code

Left: specifies the main table to join; Right: specifies the secondary table to join; How: Specifies the connection mode, which defaults to inner, with other options such as left, right, and outer; On: Specifies the common field to join two tables; Left_on: specifies the common field in the main table to join; Right_on: Specifies common fields in the secondary table to join; Left_index: parameter of type bool, whether to use the row index of the primary table as a common field of the table join. The default value is False. Right_index: bool indicates whether to use the row index in a secondary table as a common field in a join. The default value is False. Suffixes: If there are overlapping variable names in the result of data connection, use their respective prefixes to distinguish them.

Construct dataset df3 = Pd. DataFrame ({" id ": [1, 2, 3, 4, 5], 'name' : [' zhang ', 'bill', 'the king 2', 'a', 'Zhao Wu],' age:,24,25,23,25 [27], 'gender' : [' male 'and' male 'and' male 'and' female 'and' female ']})  df4 = Pd. DataFrame ({" Id ":,2,2,4,4,4,5 [1], 'kemu: [' subject 1', '1' subjects, subjects' 2 ', '1' subjects, subjects' 2 ', '3' subjects, subjects' 1 '], 'score' :,81,87,75,86,74,88 [83]}) Df5 = pd. DataFrame ({" id ":,3,5 [1], 'name' : [' zhang ', 'the king 2', 'Zhao Wu],' income: [13500180, 00150, 00]}) data connection # # 3 tables first df3 and df4 axle merge1 = connection  pd.merge(left = df3, right = df4, how = 'left', left_on='id', right_on='Id') merge1Copy the code

Merge2 = pd.merge(left = merge1, right = df5, how = 'left') merge2Copy the code

To horizontally expand the three tables into a wide table, merge the tables twice. As shown in the code above, the first merge joins DF3 and DF4. Since the common fields of the two tables are inconsistent, the parameter values of left_ON and right_ON need to be specified respectively. The second merge merges the first result with DF5. Left_on and right_ON parameters do not need to be specified at this time because the result of the first merge contains the ID variable. Therefore, the same variable is automatically selected for table join.

Group aggregation operation

Another very common operation in the database is grouping aggregation, that is, grouping statistics of numeric variables according to certain grouping variables. Taking jewelry data as an example, the number, minimum X, average price and maximum depth of jewelry under each color and knife work combination were counted

Use the groupby and aggregate methods in Pandas.

The first step is to specify grouping variables. This can be done using the groupby “method” of the data box. The second step is to calculate the respective statistics for different numerical variables. In the second step, it is necessary to explain to the reader that variable names and statistical functions must be controlled in dictionary form

# groupby; Groupby (by = ['color','cut']) result = group. aggregate({'color':np.size, grouped = diamonds. Groupby (by = ['color','cut']) 'x':np.min, 'price':np.mean, 'depth':np.max}) resultCopy the code

.

Rename the result # dataset. Rename (columns = {' color ':' counts', 'x' : 'min_x', 'price' : 'avg_price', 'the depth' : 'max_depth'}, inplace=True) resultCopy the code

. The grouped variables color and cut become the row indexes of the data box. If you need to convert these two row indexes into variable names for the data box, you can use the data box’s reset_index method

Reset_index (inplace=True) resultCopy the code

.

conclusion

Pandas module

Read_csv Reads text files. For example, TXT and CSV. Read_excel reads spreadsheetsCopy the code

Pymysql/PMSSQL modules

Close Closes the connection "method" between the database and Python.Copy the code

Pandas module

Read_sql Functions for reading database data Head /tail Displays the first and last rows of a data box Method SHAPE Returns the number of rows and columns of a data box Method dtypes Returns the methods of the data types of variables in a data box to_dateTime Converts variables to date-time functions Columns return the "method" of the variable name of the data box index Return the "method" of the data box row index Apply sequence or data box mapping "method" value_counts Method of sequence value frequency statistics reset_index Method of converting row indexes to variables Drop_Duplicates Method of deleting duplicated items DROP The name of a variable or method of observation The "method" for deleting missing values fillna The "method" for filling missing values Quantile statistical sequence Quantile Plot sequences or data boxes The "method" for drawing subsets of ILOC/LOC/IX data blocks The "method" pivot_table is the function for building pivottables Groupby Specifies the methods of group variables. Aggregate Specifies the methods of aggregate statistics. Rename Specifies the methods of variable names in data boxes.Copy the code