Dry! Visually explain and visualize each complex DataFrame operation

While most data scientists may commend Pandas for its ability to prepare data, many may not be able to take advantage of all of these capabilities. Manipulating data frames can quickly become a complex task, so instructions, visualizations, code, and tips are provided in each of the eight techniques in Pandas to remember how to do it.

! [](https://static001.geekbang.org/infoq/da/da1d667840a5eef6d6489a818f34824d.png)

Pandas provides a wide variety of DataFrame operations, but many of them are complex and seem unapproachable. This article introduces eight basic DataFrame operations that cover almost everything a data scientist needs to know about operations. Each method will include instructions, visualizations, code, and tips for remembering it.

Pivot

A PivotTable creates a new “PivotTable” that projects existing columns in the data into the elements of the new table, including indexes, columns, and values. The initial DataFrame will be the columns that become the index, and these columns will be displayed as unique values, and the combination of the two columns will be displayed as values. This means that Pivot cannot handle duplicate values.

! [](https://static001.geekbang.org/infoq/10/107a215b464a664caa5b7675785040e9.png)

The code for rotating a DataFrame named _df_ is as follows:

! [](https://static001.geekbang.org/infoq/4c/4c0903573c257ea487429db33068a114.png)

Remember: Pivot — is a Pivot around something outside the realm of data processing. In sports, people can “spin” around their feet: the giant panda’s spin is similar to. The state of the original DataFrame rotates around the central element of the DataFrame to a new element. Some elements are actually being rotated or transformed (for example, column “bar”) and are therefore important.

Melt

Melt can be considered “opaque” because it converts matrix-based data (with two dimensions) to list-based data (columns represent values and rows represent unique data points), while pivots do the opposite. Consider a two-dimensional matrix with one dimension of “B” and “C” (column names) and another dimension of “A”, “B” and “C” (row indexes).

We select an ID, a dimension, and a column/column containing values. Columns containing values are converted to two columns: one for the variable (the name of the value column) and one for the value (the number contained in the variable).

! [](https://static001.geekbang.org/infoq/41/41b1a37529a694f137e389c4ade2768a.png)

The result is each combination of the ID column values (A, B, c) and value columns (b, C) and their corresponding values, organized in a list format.

Mels can be performed as on DataFrame _df_ :

! [](https://static001.geekbang.org/infoq/c9/c987325fabdf73795fbbc7a2e34b0610.png)

Remember: To Melt like a candle is to break a solidified composite into smaller individual elements (wax droplets). Fusing a two-dimensional DataFrame decompress its solidified structure and record its fragments as items in a list.

Explode

Is a useful way to get rid of lists of data. When a column explodes, all the lists in it will be under the same index as the new column (to prevent this, just call.reset_index () after that). Non-list items such as strings or numbers are not affected, and empty lists are NaN values (you can clear them using.dropna ()).

! [](https://static001.geekbang.org/infoq/ba/ba17ac3a1f9a4b74d9c1868ddaa7ebf5.png)

DataFrame _df _Explode is very simple:

! [](https://static001.geekbang.org/infoq/ef/efa30b5aa4d870919888aba2a612de73.png)

Remember: Explode releases all of its internal content -Explode lists separate its elements.

Stack

Stack takes dataframes of any size and “stacks” columns into subindexes of existing indexes. Thus, the resulting DataFrame has only one column and two level indexes.

! [](https://static001.geekbang.org/infoq/e1/e1830f54ac7b78cbd327570f324eb323.png)

Stacking a table named _df_ is as simple as _df.stack () _.

To access the height value of the dog, an index-based retrieval, such as df.loc [‘dog’], is called only twice. Loc [‘ height ‘].

Remember: in appearance, _ stack_ takes on the two-dimensional nature of a table and stacks columns into multilevel indexes.

Unstack

Unstacking takes multiple index Dataframes and stacks them, converting the index of the specified level to a column of the new DataFrame with the corresponding value. Calling the stack after calling it on the table does not change the stack (due to the presence of “0”).

! [](https://static001.geekbang.org/infoq/f2/f275946ef0ab3f7d9962f1cf09727208.png)

The parameter in the stack is its level. In the list index, an index of -1 returns the last element. This is the same as the level. Level -1 indicates that the last index level (the rightmost one) will be unstacked. As another example, when the level is set to 0 (the first index level), the values will become columns, and the subsequent index level (the second index level) will become the index of the transformed DataFrame.

! [](https://static001.geekbang.org/infoq/f8/f826aa0d40da7e4d48a37801f914cb59.png)

The stack can be performed in the same manner as the stack, but with the level argument: df.unstack (level = -1).

Merge

Merging two Dataframes combines them in columns (horizontally) between shared “keys.” This key allows tables to be merged, even if they are not sorted in the same way. The completed merge DataFrame by default adds the suffixes __x_ and __y_ to the value column.

! [](https://static001.geekbang.org/infoq/5c/5c08b4b5ed40192653ae58652e796a05.png)

To merge two dataframes df1 and DF2 (where DF1 contains leftkey and DF2 contains rightkey), call:

! [](https://static001.geekbang.org/infoq/e4/e4a3a1b7a98a6c1bac156fdef516ece5.png)

Merging is not a feature of pandas, but is attached to the DataFrame. Always assume that the DataFrame on which the merge is based is the “left table” and that the DataFrame called as an argument in the function is the “right table” with the corresponding key.

By default, the merge function performs an internal join: if the key name of each DataFrame is not listed in another key, that key is not included in the merged DataFrame. On the other hand, if a key is listed twice in the same DataFrame, each value combination for the same key will be listed in the merge table. For example, if DF1 has three key _foo_ values and DF2 has two values of the same key, there will be six entries in the final DataFrame, where leftkey = foo and rightkey = foo.

! [](https://static001.geekbang.org/infoq/2f/2f066dc326181130934736c9aa4f53a3.png)

Remember: Merging data frames is like merging lanes when driving horizontally. Imagine that each column is a lane on a highway. In order to merge, they must merge horizontally.

Join

In general, a join is preferable to a merge because it has a cleaner syntax and a greater likelihood of joining two Dataframes horizontally. The syntax for the join is as follows:

! [](https://static001.geekbang.org/infoq/9a/9adb0fc8a6eaef21f80b31df41c11216.png)

When using a join, the common key columns (like _right_on_ and left_ON in the merge) must be named the same. The how argument is a string that represents one of four _ join _ methods that can combine two dataframes:

‘left’ : includes all _ elements of _df1, including _ elements of _df2 only if the key is _df1_. Otherwise, the missing portion of the merged DataFrame of _DF2_ will be marked as NaN.
‘Right’ : ‘left’, but on another DataFrame. All _ elements including _df2, including _ elements of _df1 only if their key is a key of _DF2_.
Outer: includes all elements from DataFrames, even if the key does not exist elsewhere – missing elements are marked as NaN’s.
“Inner” : the key containing only the element exists in the two data frame keys (intersection). Default merge.

Remember: If you have used SQL, the word “join” should immediately be associated with add by column. If not, then “join” and “Merge” have very similar meanings in terms of definition.

Concat

While merges and joins work horizontally, concatenation or concat for short, dataframes are joined in rows (vertically). For example, consider using _pandas. Concat ([dF1, df2]) _ concatenation of two dataframes dF1 and DF2 with the same column name:

! [](https://static001.geekbang.org/infoq/f7/f732a03b257a496e28756a4cae4ddc9c.png)

Although a concat determinant join can be used by setting the axis parameter to _1_, it is easier to use a join.

Note that concat is the pandas function, not one of the dataframes. Therefore, it accepts a list of Dataframes to connect to.

If another column of a DataFrame is not included, that column is included by default, with the missing value listed as NaN. To prevent this, add an additional parameter _join =’inner’_ that concatenates only columns shared by two dataframes.

! [](https://static001.geekbang.org/infoq/01/0143c30bc68cb5c4f571bd8cb117f8b1.png)

Remember: You can concatenate other items in lists and strings. Concatenation is the appending of additional elements to an existing body rather than adding new information (like a column by column join). Since each index/row is a separate item, concatenation adds other items to the DataFrame, which can be thought of as a list of rows.

Append is another way to combine two dataframes, but it performs the same functionality as concat, is less efficient and versatile.

If you like, welcome to like and forward!

Past wonderful links:

Github file download is accelerated online

Advanced Python Development (open source in PDF)

Don’t leave after you see the surprise!

I carefully organized 2TB video courses and books related to computer /Python/ machine learning/deep learning, worth 1W yuan. Follow the wechat official account “Computer and AI” and click the menu below to get the web disk link.

! [](https://static001.geekbang.org/infoq/90/903237ffd0a3b3ae06272386f26ecb9e.png)

Dry! Visually explain and visualize each complex DataFrame operation

Explode

Related Posts

Docker Development Tutorial (a) Installation and Testing (Windows Home Edition)

Six lines of code! Convert PDF to Word in Python

Parameter learning and gradient descent