This is the fourth day of my participation in the August Text Challenge.More challenges in August

Hello, everyone, I am the talented brother.

In daily data processing, we often need to type transform some data for the convenience of subsequent processing. Because I don’t like to remember them, I search for them every time I don’t remember the specific function method, which still feels a little Fei time.

Today let’s take a look at some common conversion operations and save them for a rainy day.

Directory:

[toc]

1. Specify the data type when loading data

In general, to save trouble I just pd.dataframe (data) or pd.read_xx(filename).

Such as:

import pandas as pd

df = pd.read_excel('Data type Conversion case data.xlsx')
df
Copy the code
countries popularity score Yearning degree
0 China 10 10 10.0
1 The United States 6 5.8 7.0
2 Japan 2 1.2 7.0
3 Germany 8 6.8 6.0
4 Britain 7 6.6 nan

We look at the dtypes attribute

df.dtypes
Copy the code
Country Object Popularity INT64 Score FLOAT64 Desirability FLOAT64 DType: objectCopy the code

You can see that the country field is of type Object, the popularity is of type int, and the score and desirability are both of type float. In fact, for the degree of yearning we might want an int integer, and the country field is a string.

So, we can specify each field data type with the parameter dtype when loading the data.

import pandas as pd

df = pd.read_excel('Data type Conversion case data.xlsx',
                   dtype={
                       'countries':'string'.'Desirability':'Int64'
                   }
                  )
df
Copy the code
countries popularity score Yearning degree
0 China 10 10 10
1 The United States 6 5.8 7
2 Japan 2 1.2 7
3 Germany 8 6.8 6
4 Britain 7 6.6

Look again at the dtypes attribute

df.dtypes
Copy the code
Country String Popularity INT64 Score FLOAT64 Yearning int64 DType: objectCopy the code

Similarly, when creating DataFrame data, you can use the dtype parameter to set the data type (in this case, all fields).

df = pd.DataFrame({'A': [1.2.3.4.].'B': [1.3.5.7]
                  },
                  dtype='float32'
                 )
df.dtypes
Copy the code
A    float32
B    float32
dtype: object
Copy the code

2. Astype Converts data types

For existing data, we often use astype to convert data types, either for a Series column or for multiple columns at the same time.

In [1]: df. Popularity astype('float')
Out[1] :0    10.0
1     6.0
2     2.0
3     8.0
4     7.0Name: Popularity, dType: float64 In [2]: df.astype({'countries':'string'.'Desirability':'Int64'})
Out[2]: Country popularity rating0China10  10.0    10
1The United States6   5.8     7
2Japan2   1.2     7
3Germany8   6.8     6
4Britain7   6.6  <NA>
Copy the code

3. Pd. to_xx Converts data types

3.1.pd. to_datetime converts to the time type

  • The string for the date like is converted to a date
  • Timestamp conversion to date, etc
  • Numeric strings are converted to dates in format

If an unconverted case is encountered, an error will be reported by default. You can set errors=’ COERce ‘to NaT for the unconverted case.

# Convert a string to a date
In [3]: s = pd.Series(['3/11/2000'.'3/12/2000'.'3/13/2000'])

In [4]: s
Out[4] :0    3/11/2000
1    3/12/2000
2    3/13/2000
dtype: object

In [5]: pd.to_datetime(s, infer_datetime_format=True)
Out[5] :0   2000- 03 -11
1   2000- 03 -12
2   2000- 03 -13
dtype: datetime64[ns]

You can also convert the timestamp to a date
In [6]: s = pd.Series([1490195805.1590195805.1690195805])

In [7]: pd.to_datetime(s, unit='s')
Out[7] :0   2017- 03 -22 15:16:45
1   2020- 05 -23 01:03:25
2   2023- 07 -24 10:50:05
dtype: datetime64[ns]

In [8]: s = pd.Series([1490195805433502912.1590195805433502912.1690195805433502912])

In [9]: pd.to_datetime(s, unit='ns')
Out[9] :0   2017- 03 -22 15:16:45.433502912
1   2020- 05 -23 01:03:25.433502912
2   2023- 07 -24 10:50:05.433502912
dtype: datetime64[ns]

# Number string is converted to date according to format
In [10]: s = pd.Series(['20200101'.'20200202'.'202003'])

In [11]: pd.to_datetime(s, format='%Y%m%d', errors='ignore')
Out[11] :0    20200101
1    20200202
2      202003
dtype: object

In [12]: pd.to_datetime(s, format='%Y%m%d', errors='coerce')
Out[12] :0   2020- 01-011   2020- 02-022          NaT
dtype: datetime64[ns]
Copy the code

It should be noted that the default start time for the date conversion of the above timestamp is 1970-01-01, which will differ by 8 hours for the domestic time. We have the following ways to process it.

In [13]: s
Out[13] :0    1490195805
1    1590195805
2    1690195805
dtype: int64

Greenwich time by default
In [14]: pd.to_datetime(s, unit='s')
Out[14] :0   2017- 03 -22 15:16:45
1   2020- 05 -23 01:03:25
2   2023- 07 -24 10:50:05
dtype: datetime64[ns]

# Add 8 hours to the start time
In [15]: pd.to_datetime(s, unit='s', origin=pd.Timestamp('the 1970-01-01 08:00:00'))
Out[15] :0   2017- 03 -22 23:16:45
1   2020- 05 -23 09:03:25
2   2023- 07 -24 18:50:05
dtype: datetime64[ns]

Add 8 hours manually
In [16]: pd.to_datetime(s, unit='s') + pd.Timedelta(days=8/24)
Out[16] :0   2017- 03 -22 23:16:45
1   2020- 05 -23 09:03:25
2   2023- 07 -24 18:50:05
dtype: datetime64[ns]
Copy the code

3.2.pd.to_numeric is converted to a numeric type

In [17]: s = pd.Series(['1.0'.'2', -3])

In [18]: pd.to_numeric(s)
Out[18] :0    1.0
1    2.0
2   -3.0
dtype: float64

In [19]: pd.to_numeric(s, downcast='signed')
Out[19] :0    1
1    2
2   -3
dtype: int8

In [20]: s = pd.Series(['apple'.'1.0'.'2', -3])

In [21]: pd.to_numeric(s, errors='ignore')
Out[21] :0    apple
1      1.0
2        2
3       -3
dtype: object

In [22]: pd.to_numeric(s, errors='coerce')
Out[22] :0    NaN
1    1.0
2    2.0
3   -3.0
dtype: float64
Copy the code

3.3.pd.to_timedelta to a time difference type

Convert numbers, time difference string like, and so on to time difference data types

In [23] :import numpy as np

In [24]: pd.to_timedelta(np.arange(5), unit='d')
Out[24]: TimedeltaIndex(['0 days'.'1 days'.'2 days'.'3 days'.'4 days'], dtype='timedelta64[ns]', freq=None)

In [25]: pd.to_timedelta('1 days 06:05:01. 00003')
Out[25]: Timedelta('1 days 06:05:01. 000030')

In [26]: pd.to_timedelta(['1 days 06:05:01. 00003'.'15.5 us'.'nan'])
Out[26]: TimedeltaIndex(['1 days 06:05:01. 000030'.'0 days 00:00:00. 000015500', NaT], dtype='timedelta64[ns]', freq=None)
Copy the code

4. Determine the data type intelligently

The convert_dtypes method can be used to do smart data type conversions, see

5. Data type filtering

See in some learning groups often have friends ask how to filter the specified data type of the field, today we also introduce.

Pandas provides a select_dtypes() function that filters by field data type to select required fields and excluded fields.

Signature: 
df.select_dtypes(include=None, exclude=None) - >'DataFrame'
Docstring:
Return a subset of the DataFrame's columns based on the column dtypes.
Copy the code

There are the following data types:

Number: number, int, float

Boolean: the Boolean

Time: datetime64

Time difference: timedelta64

Category: the category

String: string

Object: the object

In [27]: df
Out[27]: Country popularity rating0China10  10.0  10.0
1The United States6   5.8   7.0
2Japan2   1.2   7.0
3Germany8   6.8   6.0
4Britain7   6.6   NaN

In [28]: df.dtypes
Out[28] : the countryobjectPopularity INT64 Score FLOAT64 Desirability float64 DType:object

In [29]: df.select_dtypes(include='int')
Out[29]: Popularity0    10
1     6
2     2
3     8
4     7

In [30]: df.select_dtypes(include='number')
Out[30]: Popularity rating Desirability0    10  10.0  10.0
1     6   5.8   7.0
2     2   1.2   7.0
3     8   6.8   6.0
4     7   6.6   NaN

In [31]: df.select_dtypes(include=['int'.'object'])
Out[31]: National popularity0China10
1The United States6
2Japan2
3Germany8
4Britain7

In [32]: df.select_dtypes(exclude=['object'])
Out[32]: Popularity rating Desirability0    10  10.0  10.0
1     6   5.8   7.0
2     2   1.2   7.0
3     8   6.8   6.0
4     7   6.6   NaN
Copy the code

The above is all the content of this time, I hope to help you, if you like, remember to click a thumbs-up, brush a look at oh!

In fact, there are more parameter methods that you can use either help or? Take a look at the demo!