I often meet two types of friends. Those who can crawler but don’t know how to do further data analysis, and those who usually do analysis with Excel but don’t know much about Python. If and you are very similar, that the following system will be very suitable for you, suggested to collect first.

Excel is the most commonly used tool in data analysis. This paper introduces how to use Python to complete data processing and analysis in Excel through functional programming by comparing the functions of Python and Excel. The 36 functions that you will use to generate and import data, clean data, preprocess data, sort data, filter data, sort summary, and perspective are described in the 1,787 page documentation for pandas.

The content of this article is divided into 9 parts as follows:

01 Generate a data table

The first part is to generate the data table, there are two common generation methods, the first is to import external data, the second is to directly write data. The file menu in Excel provides the function of obtaining external data, supporting the import of multiple data sources of databases and text files and pages.

Get external data

Python supports importing from many types of data. To import data into Python, we need to import the library into PANDAS. For convenience, we also import the numpy library.

1 import numpy as np
2 import pandas as pd</pre>
Copy the code

Import table

The following are the methods to import data from Excel and CSV format files and create data tables. The code is in minimalist mode, which has many optional parameter Settings, such as column name, index column, data format and so on. Interested parties may refer to pandas’ official documentation.

1 df=pd.DataFrame(pd.read_csv('name.csv',header=1))
2 df=pd.DataFrame(pd.read_excel('name.xlsx'))
Copy the code

Create table

Another way to create a table is to write data directly to the table. In Excel, you can enter data directly into a cell. In Python, you can do this by using the following code. The DateFrame function in the LIBRARY is used to generate a table with six rows of data and six fields per row. We deliberately set some NA values and problematic fields in the data, such as containing Spaces, etc. This will be dealt with later in the data cleaning step. Later we will uniformly name the data tables df, short for DataFrame.

1 df = pd DataFrame ({" id ": [6] 1001100 2100 3100 4100 5100, 2 'date' : pd. Date_range (' 20130102 ', periods = 6), 3 'city' : [' Beijing ', 'SH', 'through', 'give it', 'Shanghai', 'Beijing'], 4 'age' : [23,44,54,32,34,32], 5 'category: [' 100 - A', '100 - B', '110 - A', '110 - C', '210 - A', '130 - F], 6' price ': [1200, np. Nan, 2133543 3, np, nan, 4432]}. 7 columns =['id','date','city','category','age','price'])Copy the code

This is the newly created table, we have no index column, the price field contains the NA value, and the city field contains some dirty data.

02 Data table Check

The second part is to check the data table. The amount of data processed in Python is usually quite large. For example, the data of New York taxi and the riding data of Citibike introduced in our previous article both amount to tens of millions, so we cannot understand the overall situation of the data table at a glance. Some method must be used to obtain the key information of the data table. Another purpose of table checking is to get an overview of the data, such as the size of the entire table, the space occupied, the data format, whether there are empty values and duplicates, and the specific data contents. Prepare for subsequent cleaning and pretreatment.

Data dimensions (rows and columns)

Row and column numbers can be viewed in Excel with the CTRL down cursor key and CTRL right cursor key. Python uses shape to look at the dimensions of a table, i.e. the number of rows and columns. The result returned by shape (6,6) indicates that the table has 6 rows and 6 columns. Here’s the code.

Df.shape 3 (6, 6)Copy the code

Data table information

Use the info function to view the overall information of the data table, which returns a lot of information, including data dimension, column name, data format, and space occupied.

1 # df.info() 2 df.info() 3 class 'pandas.core.frame.DataFrame'&gt; 5 RangeIndex: 6 entries, 0 to 5 6 Data columns (total 6 columns): 7 id 6 non-null int64 8 date 6 non-null datetime64[ns] 9 city 6 non-null object 10 category 6 non-null object 11 age 6 non-null int64 12 price 4 non-null float64 13 dtypes: Datetime64 [ns](1), Float64 (1), INT64 (2), Object (2) 14 Memory Usage: 368.0 bytesCopy the code

Viewing data Formats

Excel determines the format of the data by selecting the cell and looking at the numeric type in the Start menu. The dtypes function is used in Python to return data formats.

Dtypes is a function to view data formats, either for all data in a table at once, or for specifying a column to view separately.

2dtypes 3 4ID int64 5Date dateTime64 [ns] 6City object 7category Object 8age int64 9price float64 10dType: 13df['B'].dtype 14 15dtype('int64')Copy the code

Check the null value

The way to see null values in Excel is to locate null values in a data table using the Location Criteria function. Location conditions are in the Find and Select directory under The Start directory.

Isnull is a Python function that checks for null values and returns a logical value, True if null values are included and False if not. The entire table can be checked, or a single column can be null-checked.

df_isnull

Isnull () 3 40 False 51 True 62 False 73 False 84 True 95 False 10Name: price, dTYPE: boolCopy the code

View unique values

The way to see unique values in Excel is to color them using “conditional formatting.” Use the unique function in Python to view unique values.

View unique values

Unique is a function that looks at Unique values, which can only be checked for specific columns in a data table. Here is the code that returns a unique value in the column. Similar to the result after deleting duplicates in Excel.

1 df['city']. Unique ()34array(['Beijing ', 'SH', 'guangzhou ', 'Shenzhen',' Shanghai ', 'Beijing '], dtype=object)Copy the code

View table values

The Values function in Python is used to view Values in a data table. Returns as an array, with no header information.

Values 3 4array([[1001, Timestamp('2013-01-02 00:00:00'), 'Beijing ', '100-a ', 23, 5 1200.0], 6 [1002, Timestamp('2013-01-03 00:00:00'), 'SH', '100-B', 44, nan], 7 [1003, Timestamp('2013-01-04 00:00:00'), ' guangzhou ', '110-a ', 54, 8 2133.0], 9 [1004, Timestamp('2013-01-05 00:00:00'), 'Shenzhen', '110-C', 32, 10 5433.0], 11 [1005, Timestamp('2013-01-06 00:00:00'), 'shanghai', '210-A', 34, 12 nan], 13 [1006, Timestamp('2013-01-07 00:00:00'), 'BEIJING ', '130-f ', 32, 14 4432.0]], dtype=object)Copy the code

View column names

The Colums function is used to view column names in the data table individually.

Df.columns34inde (['id', 'date', 'city', 'category', 'age', 'price'], dtype='object')Copy the code

Look at the first 10 rows of data

The Head function is used to view the first N rows in a table. By default, Head () displays the first 10 rows. Set up the following code to view the data for the first three rows.

Df.head (' '3')Copy the code

df_head(3)

Tail () is used to view the last N rows in the table. By default, Tail () displays the last 10 rows. You can set the parameter value to determine the number of rows to view. The following code sets the data to look at the last three lines.

'1' # check the last 3 lines' df.tail(' 3 ') 'Copy the code

03 Data table Cleaning

The third part is to clean the problems in the data table. The main content covers the handling of null values, case, data formats and duplicate values. There is no logical validation between data.

Handling null values (delete or fill)

We intentionally set several NA values in the price field when creating the table. Null values can be handled in a variety of ways, either by directly deleting data containing null values, or by populating null values, such as 0 or mean values. Null values can also be inferred based on the logic of different fields.

Null values can be handled in Excel through the Find and replace function, which uniformly replaces null values with 0 or mean values. This can also be done by “locating” null values.

Find and replace null values

Python is flexible in handling nulls. You can use the Dropna function to delete data containing nulls in a table, or you can use the fillna function to fill nulls. In the following code and results, you can see that after using the Dropna function, the two fields containing the NA value are missing. Returns a data table with no null values.

Dropna (how='any')Copy the code

Empty values can also be filled with numbers. The following code uses the fillna function to fill a null value field with the number 0.

Df. Fillna (value=0)Copy the code

We choose the filling method to deal with empty values. The mean value of price column is used to fillna field, and the fillna function is also used. The mean function is used to calculate the current mean value of price column in the value to be filled, and then the mean value is used to fillna. You can see that the two null value fields show up as 3299.5

Df ['price'].fillna(df['price'].fillna(df['price'].mean()) 3 40 1200.0 51 3299.5 62 2133.0 73 5433.0 84 3299.95 4432.0 10Name: price, dType: float64Copy the code

df_nan

Clean up the space

In addition to null values, whitespace in characters is also a common problem in data cleaning. Here is the code for clearing whitespace in characters.

Df ['city']=df['city'].map(str.strip)Copy the code

Case conversion

In English fields, different capitalization of letters is also a common problem. Excel has functions like UPPER and LOWER, and Python has functions of the same name to deal with case. Such a problem exists in the city column of the data table. We convert all letters in the city column to lowercase. Here is the code and results.

2df['city']=df['city'].str.lower()Copy the code

lower

Changing the data format

You can modify the data format by using the Set Cell Format function in Excel. The Astype function is used in Python to modify the data format.

Set the cell format

Dtype in Python is the function that looks at the data format, and the counterpart is the Astype function that changes the data format. The following code changes the value of the price field to an int format.

2df['price'].astype('int')340 120051 329962 213373 543384 329995 443210Name: price, dtype: int32Copy the code

Changing column names

Rename is a function that changes the name of a column, and we will change the category column in the table to category-size. Below is the code and the results of the changes.

df_rename

Deleting duplicate values

Many data tables also contain the problem of duplicate values. In Excel, there is a “Delete duplicate” function in the data directory, which can be used to delete duplicate values in the data table. By default, Excel keeps the data that appears first and deletes the data that repeats later.

Delete duplicates

In Python, the drop_duplicates function is used to delete duplicate values. Let’s take the city column in the data table for example, where there are duplicate values in the city field. By default, drop_duplicates() is a duplicate value that will be deleted (same as Excel logic). Adding the keep= ‘last’ parameter removes the first duplicate value and keeps the last one. Below is the detailed code and comparison results.

In the original city column, Beijing was duplicated, being the first and last place respectively.

1 df['city']
2 0   beijing
3 1      sh
4 2  guangzhou
5 3   shenzhen
6 4   shanghai
7 5   beijing
8 Name: city, dtype: object
Copy the code

Use the default drop_duplicates() function to delete duplicate values. In the result, the first value of Beijing is reserved, and the last value is deleted.

2 df['city']. Drop_duplicates () 3 0 Beijing 4 1 sh 5 2 Guangzhou 6 3 shenzhen 7 4 Shanghai 8 Name: city, dtype: objectCopy the code

After setting the keep= ‘last’ parameter, contrary to the previous result of deleting the duplicate value, the first occurrence of Beijing is deleted, and the last occurrence of Beijing is preserved.

2 df['city']. Drop_duplicates (keep='last') 3 1 sh 4 2 Guangzhou 5 3 shenzhen 6 4 Shanghai 7 5 Beijing 8 Name: city, dtype: objecCopy the code

Value modification and replacement

The final problem in data cleaning is numeric modification or replacement, which is possible in Excel using the Find and replace function.

Find and replace null values

Python uses the replace function to replace data. There are two ways to write Shanghai in the city field in the data table, namely Shanghai and SH. We use the replace function to replace SH.

Df ['city'].replace('sh', 'shanghai') 3 0 beijing 4 1 shanghai 5 2 guangzhou 6 3 shenzhen 7 4 shanghai 8 5 beijing 9 Name: city, dtype: objectCopy the code

This article, the second in a series, covers parts 4-6, table generation, table viewing, and data cleaning.

04 Data Preprocessing

The fourth part is the preprocessing of the data, the cleaning of the data for the later statistical and analysis work. It mainly includes data table merging, sorting, numerical sorting, data grouping and marking, etc.

Table merge

The first is to merge different tables. Here we create a new table DF1 and merge the two tables DF and DF1. Excel does not directly complete the data table merge function, can be achieved step by step through the VLOOKUP function. This can be done in Python in one go with the merge function. The dF1 table is set up to merge with the DF table.

1 # create df1 df1 = pd data table 2. The DataFrame ({" id ": [8] 1001100 2100 3100 4100 5100 6100 7100. 3 'gender':['male','female','male','female','male','female','male','female'], 4 'pay' : [' Y ', 'N', 'Y', 'Y', 'N', 'Y', 'N', 'Y'], 5 'm - point:,12,20,40,40,40,30,20 [10]})Copy the code

df1

The merge function is used to merge the two tables in the inner way. The data in the two tables is matched together to generate a new table. And call it df_inner.

Df_inner =pd.merge(df,df1,how='inner')Copy the code

df_inner

In addition to inner, there are left, right, and outer ways to merge. The differences between these approaches are explained and contrasted in detail in my other articles.

1# Other tables match the schema 2df_left=pd.merge(df,df1,how='left')3df_right=pd.merge(df,df1,how='right')4df_outer=pd.merge(df,df1,how='outer')Copy the code

Set index column

After completing the merge of the data table, we set the index column for the DF_INNER data table. The index column has many functions, such as data extraction, summary, and data filtering. The function that sets the index is set_index.

Set_index ('id');Copy the code

df_inner_set_index

Sort (by index, by value)

In Excel, you can sort the data table directly by using the sort button under the data directory, which is relatively simple. Python uses the ort_values and sort_index functions for sorting.

The sorting

In Python, you can sort a table of data either by index or by the value of the specified column. First we sort the data table by the age of the users in the AGE column. The function used is sort_values.

Sort_values (by=['age'])Copy the code

The Sort_index function is used to sort a table by the value of the indexed column.

2df_inner. Sort_index ()Copy the code

sort_index

The data packet

Excel can use the VLOOKUP function to perform approximate matching to complete the grouping of values, or use PivotTable to complete the grouping. The corresponding Python uses the WHERE function to group data.

The Where function is used to judge and group the data. In the following code, we judge the values of the price column, dividing those that meet the criteria into one group and those that do not meet the criteria into another group and marking them with the group field.

2df_inner['group'] = np. Where (df_inner['price'] >3000,'high','low')Copy the code

where

In addition to the WHERE function, you can also group the data by judging the values of multiple fields. In the following code, the data in the city column is equal to Beijing and the data in the price column is greater than or equal to 4000 is marked as 1.

Loc [(df_inner['city'] == 'Beijing ') & (df_inner['price'] >= 4000), 'sign']=1Copy the code

sign

The data were

The opposite of grouping data is sorting numbers, which is available in Excel under the data directory. The split function is used in Python to implement the split.

The data were

The packets in the category column of the data table contain two pieces of information, the category ID preceded by a number and the size value followed by letters. It is connected with a hyphen. We use the split function to split this field and match the split table back to the original table.

Select * from df_inner; select * from df_inner; The column names are called category and size 2 pd.dataframe ((x.split('-') for x indf_inner['category']),index=df_inner.index,columns=['category','size'])Copy the code

split

Df_inner =pd.merge(df_inner,split,right_index=True, left_index=True)Copy the code

merge_1

05 Data Extraction

The fifth part is data extraction, which is the most common work in data analysis. This part mainly uses three functions, LOC, ILOC and IX. Loc function is extracted by label value, ILOC is extracted by position, and IX can be extracted by label and position at the same time. The following describes how to use each of these functions.

Extract by Label (LOC)

The Loc function extracts data by index label of the data table, and the following code extracts a single piece of data in index column 3.

Loc [3] 3ID 1004 4date 2013-01-05 00:00:00 5City shenzhen 6category 110-c 7age 32 8price 5433 9gender female 10m-point 40 11pay Y 12group high 13sign NaN 14category_1 110 15size C 16Name: 3, dtype: objectCopy the code

The colon is used to limit the range of extracted data. The colon is preceded by the starting tag value and followed by the ending tag value. Rows 0 through 5 are extracted below.

Loc [0:5]Copy the code

The dF_inner_loc1 Reset_index function is used to restore the index, where we reset the date field to the index of the table and extract the data by date.

1# alter index 2df_inner.reset_index()Copy the code

reset_index

Df_inner =df_inner. Set_index ('date')Copy the code

set_index_date

Use a colon to limit the range of data to be extracted, with a blank colon indicating a start of 0. Extract all data before January 4, 2013.

Df_inner [:'2013-01-04']Copy the code

Extraction by lift

Extraction by location (ILOC)

The ilOC function is used to extract data from the data table by location, where the number before and after the colon is no longer the label name of the index, but the location of the data, starting at 0.

1# Use ilOC to extract data by location regionCopy the code

iloc1

In addition to extracting data by region, the ILOC function can also extract data by position, with 0,2, and 5 in square brackets indicating the position of the row and the number in square brackets indicating the position of the column.

Iloc [0,2,5],[4,5]Copy the code

iloc2

Extract by label and Location (IX)

Ix is a hybrid of LOC and ILOC that can extract data by index label as well as by location. The position of the row in the following code is set by index date and the position of the column.

Ix [:'2013-01-03',:4]Copy the code

Extraction by conditions (regions and conditional values)

In addition to retrieving data by label and location, you can also retrieve data by specific conditions. Loc and ISIN functions are used together to extract data according to specified conditions.

Use isIN function to judge whether the value of city is Beijing.

2 df_inner['city']. Isin ([' Beijing ']) 3 4 date 5 2013-01-02 True 6 2013-01-05 False 7 2013-01-07 True 8 2013-01-06 False 9 2013-01-03 False 10 2013-01-04 False 11 Name: city, dtype: boolCopy the code

The ISIN function is nested into the data extraction function of LOC, and the true data is extracted. Here we change the judgment condition to whether the city value is Beijing and Shanghai. If so, extract the data.

1# First judge whether the city column contains Beijing and Shanghai, and then extract the data of the compound condition. 2df_inner.loc[df_inner['city'].isin(['beijing','shanghai'])]Copy the code

Loc was extracted according to screening conditions

Value extraction can also perform a similar task of data sorting, extracting the specified value from the combined value.

1 category=df_inner['category'] 2 0 100-A 3 3 110-C 4 5 130-F 5 4 210-A 6 1 100-B 7 2 110-A 8 Name: category, dtype: 11pd.dataframe (category. STR [:3])Copy the code

category_str

06 Data Filtering

The sixth part is divided into data screening, using and, or, not three conditions with greater than, less than and equal to the data screening, and counting and summing. Similar to filtering in Excel and countifs and Sumifs.

Filter by condition (and, or, not)

The “Filter” function is provided in the Excel data directory to filter data tables according to different criteria. Python uses loC functions in conjunction with filtering criteria to perform filtering. Sumif and countif functions in Excel can be combined with sum and count functions.

screening

Use the “and” criteria to filter, if you are over 25 years old and the city is Beijing. After screening, only one data meets the requirements.

1 # "and" used to filter 2 df_inner. Loc [(df_inner [' age '] > 25) & (df_inner [' city '] = = 'Beijing'), ['id','city','age','category','gender']]Copy the code

with

Use “or” criteria to filter, age over 25 or city in Beijing. After screening, 6 data meet the requirements.

1 # "or" used filter 2 df_inner. Loc [(df_inner [' age '] > 25) | (df_inner [' city '] = = 'Beijing'), ['id','city','age','category','gender']].sort 3(['age'])Copy the code

Add the price field and the sum function after the previous code to sum the price field values according to the filtered result, which is equivalent to the function of Sumifs in Excel.

1 # on screening the sum of data in price fields after 2 df_inner. Loc [(df_inner [' age '] > 25) | (df_inner [' city '] = = 'Beijing'), 3 ['id','city','age','category','gender','price']].sort(['age']).price.sum() 4 5 19796Copy the code

Use the “not” condition to filter, city does not equal Beijing. There are four pieces of data that meet the criteria. Sort the filter results by the ID column.

1 # use "not" filtered condition 2 df_inner. Loc [(df_inner [' city ']! = 'Beijing'), [' id ', 'city', 'age', 'category', 'gender']]. Sort ([' id '])Copy the code

non

Add the city column after the previous code and use the count function to count. Equivalent to the countifs function in Excel.

Loc [(df_inner['city']!= 'Beijing '), ['id','city','age','category','gender']].sort(['id']).city.count() 34Copy the code

Another way to filter is to use the query function. Here is the code and the filtering results.

2df_inner. Query ('city == [' Beijing ', 'Shanghai ']')Copy the code

query

Add the price field and sum function after the previous code. Summing the filtered price field is equivalent to the sumifs function in Excel.

Df_inner. Query ('city == [' Beijing ', 'Shanghai ']').price. Sum () 3 12230Copy the code

This is the third part, which introduces the content of Parts 7-9, data summary, data statistics, and data output.

07 Data Summary

The seventh part is about categorizing and summarizing data. Excel uses categorization and Pivottability to summarize data by specific dimensions. The main functions used in Python are groupBY and Pivot_table. The following describes how to use these two functions respectively.

Collect classified

The Excel data directory provides the Classification summary function, which can summarize data tables by specified fields and summary mode. Python uses the Groupby function to do this and supports multiple levels of categorization.

Classification summary 1

Groupby is a function for categorizing and summarizing columns. You can specify column names or multiple column names at the same time. Groupby groups columns in the order in which they appear. At the same time, the summary method after grouping should be formulated. The common methods are counting and summing.

Df_inner. Groupby ('city').count()Copy the code

Groupby You can set the column name in groupBY to summarize specific columns. The id field is counted by city in the following code.

Df_inner. Groupby ('city')[' ID '].count() 3 city 4 Beijing 2 5 Guangzhou 1 6 Shanghai 2 7 shenzhen 1 8  Name: id, dtype: int64Copy the code

Add the second column name on the basis of the previous, distribute the count summary of the two fields of city and size.

Df_inner. Groupby (['city','size'])['id'].count() 3 city size 4 Beijing A 1 5 F 1 6 Guangzhou A 1 7 shanghai A 1 8 B 1 9 shenzhen C 1 10 Name: id, dtype: int64Copy the code

In addition to counting and summing, the summarized data can also be calculated in multiple dimensions at the same time. In the following code, the price field is summarized by city and the quantity, total amount and average amount of price are calculated respectively.

1 # Summarize the city field and calculate the sum and mean of the prices. 2 df_inner.groupby('city')['price'].agg([len,np.sum, np.mean])Copy the code

groupby1

pivot

The Insert directory in Excel provides the PivotTable function to summarize data tables by specific dimensions. PivotTable functionality is also provided in Python. The same effect is achieved through the pivot_table function.

pivot

Pivottables are also a common way to categorize and summarize data and are more powerful than Groupby. The following code sets city as a row field, size as a column field, and price as a value field. Calculate the quantity and amount of price separately and summarize by row and column.

1 # PivotTable 2 pd.pivot_table(df_inner,index=['city'],values=['price'],columns=['size'],aggfunc=[len,np.sum],fill_value=0,margins=True)Copy the code

pivot_table

08 Data Statistics

The ninth part is data statistics, which mainly introduces data sampling, standard deviation, covariance and correlation coefficient.

Data sampling

The data analysis function of Excel provides the data sampling function, as shown in the figure below. Python uses the sample function to sample data.

Data sampling

Sample is a function that takes data samples, and you just set the number of n. The function automatically returns the result of participation.

Df_inner. Sample (n=3)Copy the code

Simple random sampling

The Weights parameter is the sampling weight. You can change the sampling result by setting different Weights. The data with higher Weights is more likely to be selected. The weights of the six data items are manually set. Set the first four to 0 and the last two to 0.5 respectively.

2 weights= [0, 0, 0, 0, 0.5, 0.5] 3 df_inner. Sample (n=2, weights=weights)Copy the code

Manually set the sampling weight to 1

As can be seen from the sampling results, the last two data with high weights are selected.

Manually set sampling weight 2

The Sample function also has an argument, replace, which is used to set whether to replace the Sample.

Df_inner. Sample (n=6, replace=False)Copy the code

Do not put back after sampling

Sample (n=6, replace=True)Copy the code

Sample and put it back

Descriptive statistics

Data analysis in Excel provides the function of describing statistics. Descriptive statistics of data can be used in Python through Describe.

Descriptive statistics

The Describe function is a function that performs descriptive statistics, automatically generating data such as quantity, mean, standard deviation, etc. The following code takes a descriptive count of the data table and uses the round function to set the decimal place to display the result. And transpose the resulting data.

Describe ().round(2).tCopy the code

Standard Deviation The Std function in Python is used to calculate the standard deviation of a particular column of data.

Standard deviation 2df_inner['price'].std() 31523.3516556155596Copy the code

Covariance The data analysis function in Excel provides the calculation of covariance. In Python, the cov function calculates the covariance between two fields or fields in a data table.

covariance

The Cov function is used to calculate the covariance between two fields, either for a specific field or between columns in an entire data table.

Cov (df_inner['m-point']) 3 17263.200000000001Copy the code

cov

Correlation analysis The Excel data analysis function provides correlation coefficient calculation function, python through corR function to complete correlation analysis operation, and return correlation coefficient.

The correlation coefficient

Corr functions are used to calculate correlation coefficients between data, which can be calculated individually for specific data or for individual columns in the entire data table. The correlation coefficient is between -1 and 1, close to 1 is positive correlation, close to -1 is negative correlation, 0 is not correlated.

Df_inner ['price'].corr(df_inner['m-point']) 3 df_inner['m-point']Copy the code

corr

09 Data Output

The ninth part is data output. The processed and analyzed data can be output in XLSX format and CSV format.

Write to excel

To_excel ('excel_to_python. XLSX ', sheet_name='bluewhale_cc')Copy the code

excel

Write the CSV

2 df_inner. To_csv ('excel_to_python.csv')Copy the code

In the process of data processing, most of the basic work is repetitive and mechanical, for this part of the basic work, we can use custom functions to automate. The following is a brief introduction to automating data table information retrieval.

1 # to create data table 2 df = pd DataFrame ({" id ": [6] 1001100 2100 3100 4100 5100, 3: 'date' pd. Date_range (' 20130102 ', periods = 6), 4 'city' : [' Beijing ', 'SH', 'through', 'give it', 'Shanghai', 'Beijing'], 5 'age' : [23,44,54,32,34,32], 6 'category' : [' 100 - A ', '100 - B', '110 - A', '110 - C', '210 - A', '130 - F], 7' price ': [1200, np. Nan, 2133543 3, np, nan, 4432]}. 6 columns =['id','date','city','category','age','price']) 6 columns =['id','date','city','category','age','price']) 7 columns =['id','date','city','category','age','price'] 13 type = x.types 14 colums=x.columns 15 print(' \n',shape) 16 print(' data format :\n',types) 17 print(' column name :\n',colums) 18 19 Data dimension (row, column): 23 (6, 6) 24 25 id int64 26 date datetime64[ns] 27 city object 28 category object 29 age int64 30 price float64 31 dtype: Index(['id', 'date', 'city', 'category', 'age', 'price'], dtype='object')Copy the code

That’s the basics of doing data analysis in Python. What do you mainly do with Python, do you do a lot of data analysis? Welcome to comment and discuss in the comments section, please remember to forward, like, comment and follow!

The most detailed network! How to do data analysis in Python? This article tells you

01 Generate a data table

Import table

Create table

02 Data table Check

Data dimensions (rows and columns)

Data table information

Viewing data Formats

Check the null value

View unique values

View table values

View column names

Look at the first 10 rows of data

03 Data table Cleaning

Handling null values (delete or fill)

Clean up the space

Case conversion

Changing the data format

Changing column names

Deleting duplicate values

Value modification and replacement

04 Data Preprocessing

Table merge

Set index column

Sort (by index, by value)

The data packet

The data were

05 Data Extraction

Extract by Label (LOC)

Extraction by location (ILOC)

Extract by label and Location (IX)

Extraction by conditions (regions and conditional values)

06 Data Filtering

Filter by condition (and, or, not)

07 Data Summary

Collect classified

pivot

08 Data Statistics

Data sampling

Descriptive statistics

09 Data Output

Related Posts

Why is the RTC so popular?

Practice for tuning Spark parameters

What is the least popular programming language?