This article is participating in Python Theme Month. See the link to the event for more details

The table used in this article is as follows:

Let’s look at the original situation:

Import pandas as pd df = pd.read_excel(r 'c :\Users\admin\Desktop\ test.xlsx ') print(df)Copy the code

result:

Category goods Sales Price 0 Fruit Apple 34 12 1 home appliance TV 56 3498 2 home appliance refrigerator 78 2578 3 Books Python from getting started to giving up 2578 4 Fruit grape 789 7Copy the code

1. Obtain the unique value

There are two common ways to get unique values. The first is to remove duplicates, which was discussed earlier in the article describing the handling of duplicate values in pandas. The second method is obtained through the unique() method

Import pandas as pd df = pd.read_excel(r'C:\Users\admin\Desktop\ test.xlsx ') print(df[' category '].unique())Copy the code

result:

[' fruit ', 'household appliances',' books ']Copy the code

2. Interval segmentation

Interval segmentation is a process of dividing a set of numbers into, say, 10 people, three groups by age, and that’s called interval segmentation

2.1 the cut method

The method has a parameter bins to indicate the segmentation interval

Import pandas as pd df = pd.read_excel(r 'c :\Users\admin\Desktop\ test.xlsx ') print(pd.cut(df[" price "], bins=[7, 12, 78, 2578]))Copy the code

result:

0 (7.0, 12.0] 1 NaN 2 (78.0, 2578.0] 3 (12.0, 78.0] 4 NaN Name: dtype: category Categories (3, interval[int64, right]): [(7, 12] < (12, 78] < (78, 2578]]Copy the code

Let’s look at the results. According to the four values provided by Bins, it is obvious that the values in the price column are divided into three intervals, namely (7, 12], (12, 78] and (78, 2578]), and they are all open on the left and closed on the right

2.2 qcut method

Similar to the cut method, this method does not need to specify the segment in advance, but only the number of slices. The data is then divided into a predetermined number of slices, depending on how the data is shred. The principle is to have the same number of data in each group as possible.

Df = pd.read_excel(r'C:\Users\admin\Desktop\ test.xlsx ') print(pd.qcut(df[' price '], 3))Copy the code

result:

0 (6.999, 34.0] 1 (1744.667, 3498.0] 2 (1744.667, 3498.0] 3 (34.0, 1744.667] 4) Category Categories (3, interval[float64, right]): [(6.999, 34.0] < (34.0, 1744.667] < (1744.667, 3498.0]] category Categories (3, interval[float64, right]): [(6.999, 34.0] < (34.0, 1744.667] < (1744.667, 3498.0]]Copy the code

Note: In the case of uniform data distribution, the segmentation interval obtained by these two methods is basically the same. When the data distribution is not uniform, that is, when the variance is large, the deviation of the segmentation interval will be very large