Public account: Youerhuts author: Peter editor: Peter

Hello, I’m Peter

In a real business situation, our data might exist in different library tables. In many cases, we need to carry out multi-table join query to achieve data extraction, through SQL join, such as left Join, left Join, inner Join and so on to achieve.

Pandas also has merge functions, such as concat, Append, Join, and merge. This article focuses on the merge function, which is the most important data merge function in Pandas.

Would you give up SQL?

Pandas serialized articles

There have been 13 new articles in the Pandas series, all in case + graphic style. There are many personal recommended articles:

parameter

Learn from pandas.pydata.org/pandas-docs…

pd.merge(left,   # Two data boxes to be merged
         right, 
         how='inner'.# 'left', 'right', 'outer', 'inner', 'cross'
         on=None.The default is the same key
         left_on=None.# Specify different connection fields: the key is different, but the value of the key has the same content
         right_on=None, 
         left_index=False.# Join by index
         right_index=False, 
         sort=False.# Sort or not
         suffixes=('_x'.'_y'),   # Change the suffix
         copy=True, 
         indicator=False.# Display field source
         validate=None)

Copy the code

The specific explanation of parameters is as follows:

  • Left and Right: Data frames to be merged

  • {‘ left ‘, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default: ‘inner’

    1, left: left connection, retain all data left; Right is similar; Analogous to SQL left Join or right Join

    SQL = full outer join SQL = full outer join SQL = full outer join SQL

    3. Inner: cross join, similar to SQL inner Join

    Cross: Creates the Cartesian product of two DataFrame frames, keeping the left order by default

  • On: column property of the join; The default is the same field for both DataFrame

  • Left_on/right_ON: Specifies two different keys to join

  • Left_index and right_index: Merge by index

  • Suffixes: Specify our own desired suffix

  • Indictor: Shows where fields come from

Simulated data

We created four DataFrame boxes; Df1, df2 and df3 have the same key userID. Df4 has a similar key, userid1, and the value is also AC, which has the same parts as the userID value of DF1 or DF2.

import pandas as pd
import numpy as np
Copy the code

Parameters left and right

Left and how are two data frames that need to be connected.

  • Pd. merge(left,right), personal habit
  • left.merge(right)

The graphical process is as follows:

  • The two data enclosures DF1 (left) and DF2 (right) have the same userID
  • The default is to associate by the same field (key), fetch the same value in the key (AC), and show all records for each key, such as a with multiple records

How parameters

inner

Inner is called inner join. It will directly associate with the same column attribute userID and fetch the same data information a and C under the attribute

⚠️ The above diagram uses how=”inner” by default

outer

It’s called outer join, and in the process of concatenation it takes the union of the keys in the two data boxes

  • To fetch the union of all intersection keys. In this case, the union of users
  • NaN if a key does not have data in a data box

The graphical process is as follows:

  • The join is based on the same field: userID
  • Keep all the data on both sides, so abcDE exists
  • If an edge does not have a value under the key, the result is supplemented with a NaN. For example, if b exists in the userID of DF1 but not in DF3, then score corresponding to b is NaN, similar to CD. If e exists in DF3 but does not exist in DF1, the value of age is NaN

left

Based on the key in the left data box; If the left side exists but the right side does not, then the right side is represented by NaN

The graphical process is as follows:

  • The difference is that there is no e;
  • When how=”left”, only the values below the userID in dF1 (left) are retained, excluding e

right

Based on the value of the key in the right data box; If the right side exists but the left side does not, the left side is represented by NaN

The graphical process is as follows:

  • When how=”right”, only all userids in DF3 (right) are retained
  • Result Only userID a and E of DF3 are retained

cross

Cartesian product: The amount of data n1*n2 occurs when the data in two data boxes are cross-matched

The graphical process of the Cartesian product is as follows:

  • The amount of data that appears is 4*2, and the data under the userID is cross-matched
  • The same field userID in the final result is given the default suffix to avoid confusion_x, _y

Parameters on

If two data boxes to be joined have the same key, the default is to use the same key for the join.

In all of the illustration examples above, the on parameter is joined by default with the same key, so it can sometimes be omitted.

Here’s another example:

You can also swap the left and right positions:

The above two examples are for data boxes that have only one key of the same. What if more than one key is used to join? To illustrate this with an example from the website, let’s create two DataFrame: df5 and DF6

Now merge the two data boxes:

The graphical process of merging is as follows:

  • The fields key1 and key2 of the two connections are specified with the on parameter
  • Only if the values of key1 and key2 in two data boxes are identical (intersection) are retained. For example, key1=K0,key2=K0 and key1=K1,key2=K0.

Looking at an example of how=”outer” joins:

Take a look at the graphic process:

  • Specify the two keys for the connection, key1 and key2
  • Using how=”outer” preserves all data in both data boxes. If there is no key value in a data box, NaN is taken

Parameters left_on and right_on

When joining, both data boxes have the same fields, such as userID or key1 and key2. But how do two data boxes not have the same key, but the values of these keys have the same part, such as our dF1, df3:

At this point, we use left_on and right_on arguments, specifying the key of the connection on both sides:

If we do not specify, the system will report an error because the two data boxes do not have the same key and cannot be connected:

Parameter suffixes

The default suffixes are _x_ and _y if the same fields appear as a result of the connection. This parameter is to change our default suffix. Let’s review the formation of the Cartesian product;

Now we can specify the desired suffix:

indicator

This parameter indicates which DataFrame the generated record came from: both, left_only, and right_only

With arguments, a new field _merge is displayed:

By default, the source is not displayed without an argument.

conclusion

The merge function is very powerful, and is used frequently in work. It is possible to implement join effects in SQL. Hopefully, the diagrams in this article will help you understand the use of the merge function. Pandas also has several merger-related functions, such as Join, concat, and Append, that will be covered in the next article.