Pandas 1

Dumb bird cloud: “If you like, please pay attention to Python big guy talk.”

This section describes the basic usage of the PANDAS data structure. The following code creates a sample data object:

In [1]: index = pd.date_range('1/1/2000', periods=8)

In [2]: s = pd.Series(np.random.randn(5), index=['a'.'b'.'c'.'d'.'e'])

In [3]: df = pd.DataFrame(np.random.randn(8.3), index=index, ... : columns=['A'.'B'.'C'])
   ...: 
Copy the code

The Head and Tail

Head () and tail() are used to quickly preview Series and DataFrame. The default display is 5 pieces of data, and the number to display can also be specified.

In [4]: long_series = pd.Series(np.random.randn(1000))

In [5]: long_series.head()
Out[5] :0   1.157892
1   1.344312
2    0.844885
3    1.075770
4   0.109050
dtype: float64

In [6]: long_series.tail(3)
Out[6] :997   0.289388
998   1.020544
999    0.589993
dtype: float64
Copy the code

Properties and underlying data

Pandas can access metadata through several properties:

shape:
- The axis dimension of the output object, consistent with NDARray
Axis labels
- Series: Index(Only this axis)
- DataFrame: Index(line) andcolumn

Note: Assigning values to attributes is safe!

In [7]: df[:2]
Out[7]: 
                   A         B         C
2000- 01- 01 0.173215  0.119209 1.044236
2000- 01. 0.861849 2.104569 0.494929

In [8]: df.columns = [x.lower() for x in df.columns]

In [9]: df
Out[9]: 
                   a         b         c
2000- 01- 01 0.173215  0.119209 1.044236
2000- 01. 0.861849 2.104569 0.494929
2000- 01- 03  1.071804  0.721555 0.706771
2000- 01- 04 1.039575  0.271860 0.424972
2000- 01- 05  0.567020  0.276232 1.087401
2000- 01- 06 0.673690  0.113648 1.478427
2000- 0107 -  0.524988  0.404705  0.577046
2000- 01- 08 1.715002 1.039268 0.370647
Copy the code

The Pandas object (Index, Series, DataFrame) is a container for arrays that store data and perform calculations. The underlying array for most types is numpy.ndarray. However, pandas and third-party support libraries generally extend the Numpy type system to add custom arrays (see Data Types).

To get data from Index or Series, use the.array property.

In [10]: s.array Out[10]: <PandasArray> [0.4691122999071863, -0.2828633443286633, -1.5090585031735124, -1.1356323710171934, Length: 1.2121120250208506] 5, dtype:float64

In [11]: s.index.array
Out[11]: 
<PandasArray>
['a'.'b'.'c'.'d'.'e']
Length: 5, dtype: object
Copy the code

Array generally refers to ExtensionArray. What is ExtensionArray and why it is used is not the subject of this section. See Data Types for more information.

Extract the Numpy array using to_numpy() or numpy.asarray().

In [12]: s.to_numpy()
Out[12]: array([ 0.4691.0.2829.1.5091.1.1356.1.2121])

In [13]: np.asarray(s)
Out[13]: array([ 0.4691.0.2829.1.5091.1.1356.1.2121])
Copy the code

When Series and Index are ExtensionArray, to_numpy() copies the data and casts the values. See data types for details.

To_numpy () controls the types of data generated by numpy.ndarray. For datetime with a time zone, Numpy does not provide the datetime type for timezone information. Pandas provides two representations:

One is numpy.ndarray with Timestamp, which provides the correct TZ information.
The other is dateTime64 [ns], which is also numpy.ndarray, and the value is converted to UTC, but time zone information is removed.

The time zone information can be saved using dtype=object.

In [14]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))

In [15]: ser.to_numpy(dtype=object)
Out[15]: 
array([Timestamp('2000-01-01 00:00:00 + 0100', tz='CET', freq='D'),
       Timestamp('2000-01-02 00:00:00 + 0100', tz='CET', freq='D')],
      dtype=object)
Copy the code

Or dtype=’datetime64[ns]’.

In [16]: ser.to_numpy(dtype="datetime64[ns]")
Out[16]: 
array(['the 1999-12-31 T23:00:00. 000000000'.'the 2000-01-01 T23:00:00. 000000000'],
      dtype='datetime64[ns]')
Copy the code

Getting the raw data in the DataFrame is a bit more complicated. If all columns in a DataFrame are of the same data type, dataframe.to_numpy () returns the underlying data:

In [17]: df.to_numpy()
Out[17]: 
array([[0.1732.0.1192.1.0442],
       [0.8618.2.1046.0.4949],
       [ 1.0718.0.7216.0.7068],
       [1.0396.0.2719.0.425 ],
       [ 0.567 ,  0.2762.1.0874],
       [0.6737.0.1136.1.4784],
       [ 0.525 ,  0.4047.0.577 ],
       [1.715 , 1.0393.0.3706]])
Copy the code

If the DataFrame is homogeneous, pandas will modify the original Ndarray and the changes will be reflected in the data structure. This is not the case for heterogeneous data, that is, DataFrame columns with different data types. Unlike axis tags, you cannot assign a value to an attribute of a value.

: : : tip

When processing heterogeneous data, the data type of the output nDARray applies to all types of data involved. If the DataFrame contains a string, the output data type is Object. If there are only floating point numbers or integers, the output data type is floating point.

: : :

Previously, pandas recommended retrieving data from Series or DataFrame with Series. Values. Traditional libraries and online tutorials still use this method, but pandas has improved on it and now recommends.array or to_numpy instead of.values. Values have the following disadvantages:

Values cannot determine whether to return Numpy array or ExtensionArray when Series contains an extension type. Series. Array returns only ExtensionArray and does not copy data. Series.to_numpy() returns a Numpy array at the cost of copying and casting the value of the data.
When a DataFrame contains multiple data types, dataframe. values copies the data and casts the values of the data to the same data type, which is an expensive operation. Datafame.to_numpy () returns an array of Numpy, which is cleaner and does not treat all data in a DataFrame as one type.

To accelerate the operation

With numEXPr and its bottleneck support library, PANDAS can speed up certain types of binary numeric and Boolean operations.

These two support libraries are particularly useful when dealing with large data sets, and the acceleration is significant. Numexpr uses intelligent partitioning, caching, and multi-core technologies. Bottleneck is a set of proprietary Cython routines that are exceptionally fast when handling arrays containing NANS values.

Look at the following example (DataFrame contains 100 columns X 100,000 rows of data) :

operation	Version 0.11.0 (ms)	The old version (ms)	Increase ratio
`df1 > df2`	13.32	125.35	0.1063
`df1 * df2`	21.71	36.63	0.5928
`df1 + df2`	22.04	36.50	0.6039

It is highly recommended that you install both support libraries, see Recommended Support Libraries for more information.

These two support libraries are enabled by default and can be set with the following options:

New in version 0.20.0

pd.set_option('compute.use_bottleneck'.False)
pd.set_option('compute.use_numexpr'.False)
Copy the code

Binary operation

There are two key points to note when performing binary operations between the PANDAS data structures:

Broadcast mechanism between multi-dimensional (DataFrame) and low-dimensional (Series) objects;
Missing value handling in calculations.

These two problems can be handled at the same time, but here’s how to handle them separately.

Match/broadcast mechanism

DataFrame supports the add(), sub(), mul(), div(), radd(), and rsub() methods to perform binary operations. The broadcast mechanism focuses on the input Series. These functions are called by matching index or columns with the Axis keyword.

In [18]: df = pd.DataFrame({ .... :'one': pd.Series(np.random.randn(3), index=['a'.'b'.'c']),... :'two': pd.Series(np.random.randn(4), index=['a'.'b'.'c'.'d']),... :'three': pd.Series(np.random.randn(3), index=['b'.'c'.'d'])})... : In [19]: df
Out[19]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 0.613172

In [20]: row = df.iloc[1]

In [21]: column = df['two']

In [22]: df.sub(row, axis='columns')
Out[22]: 
        one       two     three
a  1.051928 0.139606       NaN
b  0.000000  0.000000  0.000000
c  0.352192 0.433754  1.277825
d       NaN 1.632779 0.562782

In [23]: df.sub(row, axis=1)
Out[23]: 
        one       two     three
a  1.051928 0.139606       NaN
b  0.000000  0.000000  0.000000
c  0.352192 0.433754  1.277825
d       NaN 1.632779 0.562782

In [24]: df.sub(column, axis='index')
Out[24]: 
        one  two     three
a 0.377535  0.0       NaN
b 1.569069  0.0 1.962513
c 0.783123  0.0 0.250933
d       NaN  0.0 0.892516

In [25]: df.sub(column, axis=0)
Out[25]: 
        one  two     three
a 0.377535  0.0       NaN
b 1.569069  0.0 1.962513
c 0.783123  0.0 0.250933
d       NaN  0.0 0.892516
Copy the code

You can also use Series to align a level of a multi-index DataFrame.

In [26]: dfmi = df.copy()

In [27]: dfmi.index = pd.MultiIndex.from_tuples([(1.'a'), (1.'b'),... : (1.'c'), (2.'a')],... : names=['first'.'second'])... : In [28]: dfmi.sub(column, axis=0, level='second')
Out[28]: 
                   one       two     three
first second                              
1     a      0.377535  0.000000       NaN
      b      1.569069  0.000000 1.962513
      c      0.783123  0.000000 0.250933
2     a            NaN 1.493173 2.385688
Copy the code

Series and Index also support the divmod() built-in function, which performs both divisible down and modular operations to return two tuples of the same type as on the left. The following is an example:

In [29]: s = pd.Series(np.arange(10))

In [30]: s
Out[30] :0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [31]: div, rem = divmod(s, 3)

In [32]: div
Out[32] :0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
9    3
dtype: int64

In [33]: rem
Out[33] :0    0
1    1
2    2
3    0
4    1
5    2
6    0
7    1
8    2
9    0
dtype: int64

In [34]: idx = pd.Index(np.arange(10))

In [35]: idx
Out[35]: Int64Index([0.1.2.3.4.5.6.7.8.9], dtype='int64')

In [36]: div, rem = divmod(idx, 3)

In [37]: div
Out[37]: Int64Index([0.0.0.1.1.1.2.2.2.3], dtype='int64')

In [38]: rem
Out[38]: Int64Index([0.1.2.0.1.2.0.1.2.0], dtype='int64')
Copy the code

Divmod () also supports element-level operations:

In [39]: div, rem = divmod(s, [2.2.3.3.4.4.5.5.6.6])

In [40]: div
Out[40] :0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int64

In [41]: rem
Out[41] :0    0
1    1
2    2
3    0
4    0
5    1
6    1
7    2
8    2
9    3
dtype: int64
Copy the code

Missing value and filling missing value operation

The arithmetic functions of Series and DataFrame support the fill_value option, which replaces a missing value ata location with a specified value. For example, if two dataframes are added, the sum is still a NaN unless there is a missing value in both dataframes at the same place. If there is a missing value in only one DataFrame, fill_value can be used to specify a value instead of a NaN. NaN can also be replaced with the desired value by fillna.

Df2 is created for Pandas. It is written that df2 is used for Pandas.

df2 = pd.DataFrame({ .... :'one': pd.Series(np.random.randn(3), index=['a'.'b'.'c']),... :'two': pd.Series(np.random.randn(4), index=['a'.'b'.'c'.'d']),... :'three': pd.Series(np.random.randn(3), index=['a'.'b'.'c'.'d'])})... :Copy the code

In [42]: df
Out[42]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 0.613172

In [43]: df2
Out[43]: 
        one       two     three
a  1.394981  1.772517  1.000000
b  0.343054  1.912123 0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 0.613172

In [44]: df + df2
Out[44]: 
        one       two     three
a  2.789963  3.545034       NaN
b  0.686107  3.824246 0.100780
c  1.390491  2.956737  2.454870
d       NaN  0.558688 1.226343

In [45]: df.add(df2, fill_value=0)
Out[45]: 
        one       two     three
a  2.789963  3.545034  1.000000
b  0.686107  3.824246 0.100780
c  1.390491  2.956737  2.454870
d       NaN  0.558688 1.226343
Copy the code

Comparison operation

Similar to the arithmetic operations in the previous section, Series and DataFrame also support eq, NE, lt, gt, LE, GE, and other binary comparison operations:

The serial number	abbreviations	English	Chinese
1	eq	equal to	Is equal to the
2	ne	not equal to	Is not equal to
3	lt	less than	Less than
4	gt	greater than	Is greater than
5	le	less than or equal to	Less than or equal to
6	ge	greater than or equal to	Greater than or equal to

In [46]: df.gt(df2)
Out[46]: 
     one    two  three
a  False  False  False
b  False  False  False
c  False  False  False
d  False  False  False

In [47]: df2.ne(df)
Out[47]: 
     one    two  three
a  False  False   True
b  False  False  False
c  False  False  False
d   True  False  False
Copy the code

These operations produce an object of the same type as the input on the left, that is, dType is bool. These Boolean objects can be used for indexing operations, see the Boolean indexing section.

Boolean simplified

Empty, any(), all(), and bool() can simplify data summarization to a single Boolean value.

In [48]: (df > 0).all()
Out[48]: 
one      False
two       True
three    False
dtype: bool

In [49]: (df > 0).any()
Out[49]: 
one      True
two      True
three    True
dtype: bool
Copy the code

You can further reduce the above result to a single Boolean value.

In [50]: (df > 0).any().any()
Out[50] :True
Copy the code

The empty property verifies that the PANDAS object is empty.

In [51]: df.empty
Out[51] :False

In [52]: pd.DataFrame(columns=list('ABC')).empty
Out[52] :True
Copy the code

Validates the Boolean value of the single-element pandas using the bool() method.

In [53]: pd.Series([True]).bool()
Out[53] :True

In [54]: pd.Series([False]).bool()
Out[54] :False

In [55]: pd.DataFrame([[True]]).bool()
Out[55] :True

In [56]: pd.DataFrame([[False]]).bool()
Out[56] :False
Copy the code

: : : danger warnings

The following code:

>>> if df:
.    pass
Copy the code

>>> df and df2
Copy the code

The code above attempts to compare multiple values, so both operations trigger an error:

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
Copy the code

: : :

See the various pits sections for details.

Is the object being compared equivalent

In general, more than one way will give the same result. Take df + df and df * 2 as examples. Using what we learned in the previous section to test if these two methods give the same result, we would normally use (df + df == df * 2).all(), but this expression gives False:

In [57]: df + df == df * 2
Out[57]: 
     one   two  three
a   True  True  False
b   True  True   True
c   True  True   True
d  False  True   True

In [58]: (df + df == df * 2).all()
Out[58]: 
one      False
two       True
three    False
dtype: bool
Copy the code

DataFrame df + df == df * 2 This is because the comparison between the two NaN values is unequal:

In [59]: np.nan == np.nan
Out[59] :False
Copy the code

To verify that data is equivalent, n-dimensional frameworks such as Series and DataFrame provide equals() methods, which are used to verify that NaN values are equal.

In [60]: (df + df).equals(df * 2)
Out[60] :True
Copy the code

Note: Series and DataFrame indexes must be in the same order to validate True:

In [61]: df1 = pd.DataFrame({'col': ['foo'.0, np.nan]})

In [62]: df2 = pd.DataFrame({'col': [np.nan, 0.'foo']}, index=[2.1.0])

In [63]: df1.equals(df2)
Out[63] :False

In [64]: df1.equals(df2.sort_index())
Out[64] :True
Copy the code

Compares array objects

Comparing data elements with scalar values to the PANDAS data structure is simple:

In [65]: pd.Series(['foo'.'bar'.'baz'= =])'foo'
Out[65] :0     True
1    False
2    False
dtype: bool

In [66]: pd.Index(['foo'.'bar'.'baz'= =])'foo'
Out[66]: array([ True.False.False])
Copy the code

Pandas also compares data elements in two equally long arrays:

In [67]: pd.Series(['foo'.'bar'.'baz']) == pd.Index(['foo'.'bar'.'qux'])
Out[67] :0     True
1     True
2    False
dtype: bool

In [68]: pd.Series(['foo'.'bar'.'baz']) == np.array(['foo'.'bar'.'qux'])
Out[68] :0     True
1     True
2    False
dtype: bool
Copy the code

ValueError is raised when comparing Index or Series objects of unequal length:

In [55]: pd.Series(['foo'.'bar'.'baz']) == pd.Series(['foo'.'bar'])
ValueError: Series lengths must match to compare

In [56]: pd.Series(['foo'.'bar'.'baz']) == pd.Series(['foo'])
ValueError: Series lengths must match to compare
Copy the code

Note: This operation differs from Numpy’s broadcast mechanism:

In [69]: np.array([1.2.3]) == np.array([2])
Out[69]: array([False.True.False])
Copy the code

Numpy cannot perform a broadcast operation and returns False:

In [70]: np.array([1.2.3]) == np.array([1.2])
Out[70] :False
Copy the code

Merge overlapping datasets

Sometimes two approximate datasets are merged, two datasets, one of which has more data than the other. For example, two data sequences showing specific economic indicators, one of which is a “high quality” indicator and the other is a “low quality” indicator. In general, lower-quality sequences may contain more historical data or cover a wider range of data. Therefore, to merge the two DataFrame objects, the missing values in one DataFrame are filled with data from similar tags in the other DataFrame under specified conditions. To do this, use the combine_first() function in the following code.

In [71]: df1 = pd.DataFrame({'A': [1., np.nan, 3..5., np.nan], .... :'B': [np.nan, 2..3., np.nan, 6.]})... : In [72]: df2 = pd.DataFrame({'A': [5..2..4., np.nan, 3..7.],... :'B': [np.nan, np.nan, 3..4..6..8.]})... : In [73]: df1
Out[73]: 
     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  3.0
3  5.0  NaN
4  NaN  6.0

In [74]: df2
Out[74]: 
     A    B
0  5.0  NaN
1  2.0  NaN
2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0

In [75]: df1.combine_first(df2)
Out[75]: 
     A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0
Copy the code

Common DataFrame merge methods

The combine_first() method above calls the more generic DataFrame.combine() method. This method extracts another DataFrame and the combinator function, aligns it with the input DataFrame, and passes the combinator function paired with Series (for example, columns with the same name).

The following code duplicates the combine_first() function above:

In [76] :def combiner(x, y):. :returnnp.where(pd.isna(x), y, x) .... :Copy the code