12_ diagram Pandas repeated value processing
Pandas uses two functions to handle repeated values:
- Duplicated () : Determines whether there is a duplicated value
- Drop_duplicates () : deletes duplicate values
Pandas serialized articles
Pandas’ articles have been serialized and are welcome to read them:
Simulated data
In this paper, two different data are simulated:
1. One copy of order data, which will be used later
import pandas as pd
import numpy as np
Import a copy of simulated data: set aside
df1 = pd.read_excel("Order duplicate value. XLSX")
df1
Copy the code
2. Another data from the simulation:
df2 = pd.DataFrame(np.ones([10.2]), # generate 6*2 data with all 1s
columns=["col1"."col2"]
)
df2
Copy the code
Two more fields have been added: both are selected by random sampling from the list
# Add two columns
list1 = ["a"."b"]
list2 = [2.3]
# Select 10 elements from the list at random
df2["col3"] = np.random.choice(list1,10)
df2["col4"] = np.random.choice(list2,10)
df2
Copy the code
duplicated()
The DataFrame () function checks whether there are duplicate values in the data. It is used to mark whether the values in the Series and the rows in the DataFrame are duplicate.
Each row of data is compared to the record before it.
grammar
For DataFrame type data:
pandas.DataFrame.duplicated(subset=None,keep='first')
Copy the code
Or for Series data:
pandas.Series.duplicated(keep='first')
Copy the code
The three values of the keep parameter are described as follows:
- First: Marks duplicates True unless they occur for the first time
- Last: Marks duplicates as True except for the last
- False: marks all duplicates as True
Determine which data duplicates
This function allows you to determine which data is duplicate: duplicate is marked True, otherwise False
Parameters subset
df2.duplicated(subset=["col3"]) Look separately to see if the col3 column is repeated
# the results
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 True
dtype: bool
df2.duplicated(subset=["col1"]) # col1 by itself: all 1, all repeat
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
dtype: bool
Copy the code
The above two examples look at whether a single field is repeated. The following example looks at multiple properties:
df2.duplicated(subset=["col3"."col4"]) # Look at both col3 and COL4
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 False
8 False
9 True
dtype: bool
Copy the code
Parameters keep
df2.duplicated(subset=["col3"],keep="last")
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False # for the first time
8 True
9 False # for the first time
dtype: bool
Copy the code
- The keep argument above uses last, meaning that the last piece of data is the initial value, and the previous value is compared to it to see if there are duplicate values
- In the following example, keep uses first(the default), treating the first occurrence of the data as the initial value and comparing subsequent data to it; If the repeated mark is True
df2.duplicated(subset=["col3"],keep="first") # Default is first
0 False # for the first time
1 True
2 False # for the first time
3 True
4 True
5 True
6 True
7 True
8 True
9 True
dtype: bool
df2.duplicated(subset=["col3"],keep=False) # Mark all duplicate values to True
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
dtype: bool
Copy the code
drop_duplicates()
This function is used to remove duplicate values from data
Syntactic form
- Subset: specifies the optional deletion of duplicate values according to one or more specified column attributes. The default is all column properties
- Keep: indicates the data retained after the deduplication is deleted. The default value is the first data
- Inplace: indicates whether duplicate is generated or modified on the original data. The function of this parameter is the same for pandas
- Ingoore_index: is the index of generated data metadata, or from 0,1,2… Natural number to n minus 1
Here is the explanation of the parameters from the website:
Use all default parameters
The above results have two characteristics:
- The index is also the index of the original data
- The retained data is the first of each value (if duplicate values exist)
- To determine whether it is repeated, all column attributes are used
- The data at the top is the same as the data below to determine whether the duplicate is False (compare sequence number).
Parameters subset
Subset specifies which properties you want to delete:
1. Delete by a single attribute field
2, through multiple field properties to delete
Parameters keep
The keep parameter keeps the data we want: the first or the last
1, keep = “first”
2, keep = “last”
With duplicated() to see if the data is duplicated, you can see that the data with duplicated indexes 7 and 9 are False because they are the last occurrence
Parameter ignore_index
This parameter indicates whether the index of the generated data is an index of the original data or a direct reranking
Parameter inplace
If you use the default value False:
If inplace is True, no data will be generated because the data is changed directly from the original data: let’s look directly at DF2
Practical cases
At the beginning of the article, we have imported the data, a few requirements:
- Each order may have multiple states, or only one
- We want to find out all the data for orders whose final order status is “passed”
For example, for order S1, there are 3 states, two of which are passed, but we only want to get the most recent one: 2021-01-06
Solution Step 1: First find out all the approved orders, only S7 failed
The following code can also be used to find out which order was finally approved:
order_pass = df1.query("Status == 'through '") ["Order Number"].unique()
order_pass
Copy the code
Solution Step 2: Filter out the information about the orders that are passed. The following two methods are provided
Solution Step 3: Delete DF3
df3.drop_duplicates(
subset="Order Number".# Remove weight according to order number
keep="last".Keep the last one
inplace=True.# Modify in place
ignore_index=True # Index rearrange
)
df3 # No S7 in the result
Copy the code