Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”

Sample data:

import pandas as pd

df = pd.DataFrame({'a': ['Python'.'Python'.'Java'.'Java'.'C'].'b': [2.2.6.8.10]})
df
Copy the code

Only check whether a single column has duplicate values

  1. usevalues_counts()Count the number of occurrences of values in a column. By default, the results are sorted in descending order. You only need to check whether the number of values in the first row is 1 to determine whether there are duplicate values.
df['a'].value_counts()
Copy the code

  1. usedrop_duplicates()Delete the duplicate value, reserving only the first value, and check whether the processed value is the same as the original valuedfEqual, ifFalseThat means there are duplicate values.
df.equals(df.drop_duplicates(subset=['a'], keep='first'))

False
Copy the code


Determine whether all columns have duplicate rows. Drop_duplicates () is used to delete duplicate values and retain only the first values. Subset is not applicable to columns, but all columns by default.

df.equals(df.drop_duplicates(keep='first'))

False
Copy the code


Statistics the number of duplicate rows

len(df) - len(df.drop_duplicates(keep="first"))
Copy the code


Duplicate rows are deleted first, reserving only the first duplicate rows to create a unique data set. Then use drop_duplicates() to delete all duplicate data in df, excluding the first duplicate values, and merge the two result sets. Use drop_duplicates() to delete newly generated data sets to obtain duplicate rows.

df.drop_duplicates(keep="first").append(df.drop_duplicates(keep=False)).drop_duplicates(keep=False)
Copy the code

For startersPythonOr they want to get startedPythonYou can search on wechat [A new vision of PythonSometimes a simple question card for a long time, but others may dial a point will suddenly see light, heartfelt hope that we can make progress together.