Introduction to the

Today we’ll cover the advanced tutorials for Pandas, including reading and writing files, selecting subsets, and graphical representations.

Read and write files

A key step in data processing is to read the file for analysis, and then write the analysis results to the file again.

Pandas supports reading and writing in a variety of file formats:

In [108]: pd.read_
 read_clipboard() read_excel()     read_fwf()       read_hdf()       read_json        read_parquet     read_sas         read_sql_query   read_stata
 read_csv         read_feather()   read_gbq()       read_html        read_msgpack     read_pickle      read_sql         read_sql_table   read_table
Copy the code

The Titanic. CSV provided by Pandas’ website will be used as an example to explain how Pandas can be used.

Titanic. CSV provides the information of more than 800 passengers on Titanite. It is a matrix of 891 rows x 12 columns.

We use Pandas to read the CSV:

In [5]: titanic=pd.read_csv("titanic.csv")
Copy the code

The read_CSV method converts the CSV file to a DataFrame for pandas.

By default, we use the DF variable directly, and the first 5 lines and the last 5 lines are displayed by default:

In [3]: titanic Out[3]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S 1 211 Cumings, Mrs. John Bradley (Florence Briggs Th... female 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female... 0ston /O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) Female... 0 113803 53.1000 C123 S 4 5 03 Allen, Mr. William Henry male... 0 373450 8.0500 NaN S.. . . . . . . . . . . . 886 887 0 2 Montvila, Rev. Juozas male ... 0 211536 13.0000 NaN S 887 888 11 Graham, Miss. Margaret... 0 112053 30.0000 B42 S 888 889 0 3 Miss. Catherine Helen "Carrie" female... 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male... 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick Male... 0 370376 7.775 NaN Q [891 rows x 12 columns]Copy the code

You can use head(n) and tail(n) to specify a specific number of rows:

In [4]: titanic.head(8) Out[4]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S 1 211 Cumings, Mrs. John Bradley (Florence Briggs Th... female 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female... 0ston /O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) Female... 0 113803 53.1000 C123 S 4 5 03 Allen, Mr. William Henry male... 0 373450 8.0500 NaN S 5 6 0 3 Moran, Mr. James male... 0 330877 8.4583 NaN Q 6 7 0 1 McCarthy, Mr. Timothy J male... E46 S 7 8 0 3 Palsson, Master. Gosta Leonard male... 1 349909 21.075 NaN S [8 rows x 12 columns]Copy the code

Use dtypes to view the data types for each column:

In [5]: titanic.dtypes
Out[5]: 
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
Copy the code

Use to_excel to convert DF to excel files, and use read_excel to read excel files again:

In [11]: titanic.to_excel('titanic.xlsx', sheet_name='passengers', index=False)

In [12]: titanic = pd.read_excel('titanic.xlsx', sheet_name='passengers')
Copy the code

Use info() to do a preliminary count of DF:

In [14]: titanic.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: Float64 (2), INT64 (5), Object (5) Memory Usage: 83.6+ KBCopy the code

The choice of DF

Select column data

DF’s head or tail methods can only display all column data. The following methods can select specific column data.

In [15]: ages = Titanic ["Age"] In [16]: ages. Head () Out[16]: 0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 Name: Age, dtype: float64Copy the code

Each column is a Series:

In [6]: type(titanic["Age"])
Out[6]: pandas.core.series.Series

In [7]: titanic["Age"].shape
Out[7]: (891,)
Copy the code

You can also choose from several options:

In [8]: age_sex = titanic[["Age", "Sex"]]

In [9]: age_sex.head()
Out[9]: 
    Age     Sex
0  22.0    male
1  38.0  female
2  26.0  female
3  35.0  female
4  35.0    male
Copy the code

If multiple columns are selected, the result returned is of type DF:

In [10]: type(titanic[["Age", "Sex"]])
Out[10]: pandas.core.frame.DataFrame

In [11]: titanic[["Age", "Sex"]].shape
Out[11]: (891, 2)
Copy the code

Select row data

We talked about how to select column data, now let’s look at how to select row data:

Select customers over 35 years old:

In [12]: above_35 = titanic[titanic["Age"] > 35] In [13]: above_35.head() Out[13]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 0 PC 17599 71.2833 C85 C 6 7 0 1 McCarthy, Mr. Timothy J male... 0 17463 51.8625 E46 S 11 12 11 Bonnell, Miss. Elizabeth female... 0 113783 26.5500c103 S 13 14 03 Andersson, Mr. Anders Johan male... 5 347082 31.2750 NaN S 15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) Female... 0 248706 16.000 NaN S [5 rows x 12 columns]Copy the code

Use ISIN to select Pclass in 2 and 3 for all customers:

In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])] In [17]: class_23.head() Out[17]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris Male 22.0 1 0 A/5 21171 7.2500 NaN S 2 3 1 3 Heikkinen Miss. Laina female 26.0 0 STON/O2. 3101282 7.9250 NaN S 4 50 3 Allen, Mr. William Henry Male 35.0 00 373450 8.0500 NaN S 5 6 0 3 Moran, Mr. James Male NaN 0 0 330877 8.4583 NaN Q 7 8 0 3 Palsson, Master. Gosta Leonard Male 2.0 3 1 349909 21.0750 NaN SCopy the code

Isin above is equal to:

In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
Copy the code

Filter Age is not empty:

In [20]: age_no_na = titanic[titanic["Age"].notna()] In [21]: age_no_na.head() Out[21]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S 1 211 Cumings, Mrs. John Bradley (Florence Briggs Th... female 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female... 0ston /O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) Female... 0 113803 53.1000 C123 S 4 5 03 Allen, Mr. William Henry male... 0 373450 8.0500 NaN S [5 rows x 12 columns]Copy the code

Select both rows and columns

We can select both rows and columns.

Rows and columns can be selected using LOC and ILOC. The difference between loC and ILOC is that LOC selects by name and ILOC selects by number.

Select age>35 from passenger name:

In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]

In [24]: adult_names.head()
Out[24]: 
1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object
Copy the code

The first value in loC represents row selection and the second value represents column selection.

Use ILOC to select:

In [25]: titanic.iloc[9:25, 2:5] Out[25]: Pclass Name Sex 9 2 Nasser, Mrs. Nicholas (Adele Achem) female 10 3 Sandstrom, Miss. Marguerite Rut female 11 1 Bonnell, Miss. Elizabeth female 12 3 Saundercock, Mr. William Henry male 13 3 Andersson, Mr. Anders Johan male .. . . . 20 2 Fynney, Mr. Joseph J male 21 2 Beesley, Mr. Lawrence male 22 3 McGowan, Miss. Anna "Annie" female 23 1 Sloper, Mr. William Thompson male 24 3 Palsson, Miss. Torborg Danira female [16 rows x 3 columns]Copy the code

Plot plots

How to convert DF into a diversified graphical display?

To use matplotlib on the command line, start ipython’s QT environment:

ipython qtconsole --pylab=inline
Copy the code

Use plot directly to show the passenger information we read above:

import matplotlib.pyplot as plt

import pandas as pd

titanic = pd.read_excel('titanic.xlsx', sheet_name='passengers')

titanic.plot()
Copy the code

The horizontal coordinate is the index in DF, and the column coordinate is the name of each column. Note that the columns above show only numeric types.

We only show age information:

titanic['Age'].plot()
Copy the code

The default is bar chart, we can convert the form of graph, such as dot chart:

Titanic. The plot. Scatter (x = "PassengerId", y = "Age", alpha = 0.5)Copy the code

Select PassengerId as X-axis and age as Y-axis in the data:

In addition to scatter plots, many other images are supported:

[method_name for method_name in dir(titanic.plot) if not method_name.startswith("_")]
Out[11]: 
['area',
 'bar',
 'barh',
 'box',
 'density',
 'hexbin',
 'hist',
 'kde',
 'line',
 'pie',
 'scatter']
Copy the code

Here’s another box:

titanic['Age'].plot.box()
Copy the code

It can be seen that most of the passengers are between 20 and 40 years old.

You can also graph the selected columns separately:

titanic.plot.area(figsize=(12, 4), subplots=True)
Copy the code

Specify a specific column:

titanic[['Age','Pclass']].plot.area(figsize=(12, 4), subplots=True)
Copy the code

You can also draw and fill:

fig, axs = plt.subplots(figsize=(12, 4));
Copy the code

Draw an empty graph and fill it:

titanic['Age'].plot.area(ax=axs);

axs.set_ylabel("Age");

fig
Copy the code

Create a new column using an existing column

Sometimes we need to transform an existing column to get a new one. For example, if we want to add an Age2 column with the value Age +10, we can do this:

titanic["Age2"]=titanic["Age"]+10; Titanic [["Age","Age2"]]. Head () Out[34]: Age2 0 22.0 32.0 1 38.0 48.0 2 26.0 36.0 3 35.0 45.0 4 35.0 45.0Copy the code

Columns can also be renamed:

titanic_renamed = titanic.rename( ... : columns={"Age": "Age2", ... : "Pclass": "Pclas2"})Copy the code

Convert column names to lowercase:

titanic_renamed = titanic_renamed.rename(columns=str.lower)
Copy the code

The statistical

Let’s look at the average age of passengers:

Titanic [r]. "Age" scheme () Out [35] : 29.69911764705882Copy the code

Select the median:

Titanic [["Age", "Fare"]]. Median () Out[36]: Age 28.0000 Fare 14.4542 dType: float64Copy the code

More information:

titanic[["Age", "Fare"]].describe()
Out[37]: 
              Age        Fare
count  714.000000  891.000000
mean    29.699118   32.204208
std     14.526497   49.693429
min      0.420000    0.000000
25%     20.125000    7.910400
50%     28.000000   14.454200
75%     38.000000   31.000000
max     80.000000  512.329200
Copy the code

Use AGG to specify specific aggregation methods:

titanic.agg({'Age': ['min', 'max', 'median', 'skew'],'Fare': ['min', 'max', 'median', 'mean']})
Out[38]: 
              Age        Fare
max     80.000000  512.329200
mean          NaN   32.204208
median  28.000000   14.454200
min      0.420000    0.000000
skew     0.389108         NaN
Copy the code

You can use groupby:

Titanic [["Sex", "Age"]].groupby("Sex").mean() Out[39]: Age Sex female 27.915709 male 30.726645Copy the code

Groupby all columns:

titanic.groupby("Sex").mean()
Out[40]: 
        PassengerId  Survived    Pclass        Age     SibSp     Parch  
Sex                                                                      
female   431.028662  0.742038  2.159236  27.915709  0.694268  0.649682   
male     454.147314  0.188908  2.389948  30.726645  0.429809  0.235702   
Copy the code

You can also select specific columns after groupby:

Titanic. Groupby ("Sex")["Age"]. Mean () Out[41]: Sex female 27.915709 male 30.726645 Name: Age, dType: float64Copy the code

Can be categorized into count:

titanic["Pclass"].value_counts()
Out[42]: 
3    491
1    216
2    184
Name: Pclass, dtype: int64
Copy the code

The above is equivalent to:

titanic.groupby("Pclass")["Pclass"].count()
Copy the code

DF restructuring

You can sort by a column:

titanic.sort_values(by="Age").head()
Out[43]: 
     PassengerId  Survived  Pclass                             Name     Sex  \
803          804         1       3  Thomas, Master. Assad Alexander    male   
755          756         1       2        Hamalainen, Master. Viljo    male   
644          645         1       3           Baclini, Miss. Eugenie  female   
469          470         1       3    Baclini, Miss. Helene Barbara  female   
78            79         1       2    Caldwell, Master. Alden Gates    male   
Copy the code

Sort by multiple columns:

titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()
Out[44]: 
     PassengerId  Survived  Pclass                       Name     Sex   Age  \
851          852         0       3        Svensson, Mr. Johan    male  74.0   
116          117         0       3       Connors, Mr. Patrick    male  70.5   
280          281         0       3           Duane, Mr. Frank    male  65.0   
483          484         1       3     Turkula, Mrs. (Hedwig)  female  63.0   
326          327         0       3  Nysveen, Mr. Johan Hansen    male  61.0   
Copy the code

Select the specific row and column data, in the following example we will select the part of the data that is female:

female=titanic[titanic['Sex']=='female'] female_subset=female[["Age","Pclass","PassengerId","Survived"]].sort_values(["Pclass"]).groupby(["Pclass"]).head(2) female_subset Out[58]: Age Pclass PassengerId passenger 1 38.0 1 2 1 356 22.0 1 357 1 726 30.02 727 1 443 28.02 444 1 855 18.03 856 1 654 18.0 3, 655Copy the code

Pivot can be used to convert the axis:

female_subset.pivot(columns="Pclass", values="Age")
Out[62]: 
Pclass     1     2     3
1       38.0   NaN   NaN
356     22.0   NaN   NaN
443      NaN  28.0   NaN
654      NaN   NaN  18.0
726      NaN  30.0   NaN
855      NaN   NaN  18.0

female_subset.pivot(columns="Pclass", values="Age").plot()
Copy the code

This article is available at www.flydean.com/02-python-p…

The most popular interpretation, the most profound dry goods, the most concise tutorial, many tips you didn’t know waiting for you to discover!

Welcome to pay attention to my public number: “procedures those things”, understand technology, more understand you!