Hello, I’m Genius.

Recently I encountered a requirement when processing data, the core of which was to find the maximum number of consecutive days of behavior. Here, we use Beijing air quality data as a case for demonstration. The requirement is to find out the longest duration of continuous air quality pollution in Beijing and determine its cycle.

From the data in the figure above, we can see that the air quality pollution lasted for 6 days from January 21 to January 26.

However, in the actual data processing, our original data will often be larger, not necessarily directly visible. Next, we introduce several solutions for your reference.

1. Obtain case data

You can reply 0427 directly in the background to obtain the case data, but also through the following ways to obtain the case data.

import akshare as ak

air_quality_hist_df = ak.air_quality_hist(city="Beijing", period="day", start_date="2021-01-01", end_date="2021-04-26")
air_quality_hist_df.head()
Copy the code

Since we only need to use aqi and grade the quality and pollution according to international standards, we simply do the data processing as follows :(the data obtained by the background directly replies to 0427 is the processed data.)

import pandas as pd

# Reset index
aqi = air_quality_hist_df['aqi'].reset_index()
Change the aQI column to int
aqi.aqi = aqi.aqi.astype('int')
# Air quality grading using a subunit
aqi['Air Quality'] = pd.cut(aqi.aqi,
                        bins=[0.100.500],
                        labels=['good'.'pollution'])
Take 10 samples to preview
aqi.sample(10)
Copy the code

2. Find the number of days of continuous pollution

In combination with the case of “” last time, here we provide a new way to solve the problem, a total of two ways to solve the problem.

The following solution comes from Xiao Ming ge and CAI Ge

2.1. Idea 1: Calculate the difference by time sorting and then count by group

Only brother last time the solution is this idea, look back at the original code is relatively immature, today we look at xiao Ming brother’s solution, very wonderful.

Step 1: Screen air quality for pollution data

t = aqi.query('Air quality ==" pollution "')
t.sample(5)
Copy the code

Step 2: Add auxiliary columns (auxiliary columns may not be added to the original datatOn)

The logic here is roughly as follows:

  • The auxiliary ranking column (sorted in chronological order) is the number of days between
  • Then use the time field (time) and the interval days to calculate the difference to a date
  • If these dates are the same, then these days are continuous polluted days
groupids = pd.to_datetime(aqi.time)-pd.to_timedelta(aqi.time.rank(),unit='d')
groupids.sample(5)
Copy the code

Step 3: Count groups to obtain the consecutive days, and calculate the minimum and maximum values of groups to obtain the start and end dates of continuous pollution

t.groupby(groupids).agg({
    'time': lambda x:f'{x.min()}~{x.max()}'.# Ask for start and end dates
    'Air Quality':"count".# Find consecutive days
}).nlargest(5.'Air Quality') # Take the top 5 groups of data with the largest air quality field
Copy the code

The complete code is as follows:

t = aqi.query('Air quality ==" pollution "')
t.groupby(
    pd.to_datetime(t.time)-pd.to_timedelta(t.time.rank(),unit='d')
         ).agg(
    {
    'time': lambda x:f'{x.min()}~{x.max()}'.'Air Quality':"count",
    }
).nlargest(5.'Air Quality')
Copy the code

2.2. Idea 2: Compare the air quality markers of two adjacent days

There are two ways to do this. One is to create helper columns using a loop, and the other is to create helper columns using shift and cumsum.

Solution 1: Create helper columns using loops

  • Create a helper column. The value of the helper column is obtained as follows
  • If the air quality is good, the auxiliary column value is +1; If the current air quality is different from the previous day, the auxiliary column value is also +1
  • If none of the above is met, the auxiliary column value remains unchanged
last = None
num = 0
groupids = []
for v inAir quality. Values:ifv ! = lastorv ! ='pollution':
        num += 1
    groupids.append(num)
    last = v
Copy the code

Using this logic, we can obtain the auxiliary column data as follows:

We can see that the number of consecutive polluted days is obtained by grouping the count by the auxiliary columns, as shown in the red area.

aqi.groupby(groupids).agg(
    {
    'time': lambda x:f'{x.min()}~{x.max()}'.'Air Quality':"count",
    }
).nlargest(5.'Air Quality')
Copy the code

Solution 2: Create helper columns using shift and cumsum

  • Create the shift column for air quality and move it one bit
  • If the shift column and the air quality column are equal, the column is judged to be 0, otherwise it is 1
  • Auxiliary column judgment column sum

We can also find that continuous days of air quality (both good and polluted) can be obtained by grouping the counts according to the auxiliary columns, as shown in the red area.

(
    aqi.query('Air quality ==" pollution "') # Here's the filter for polluted weather.groupby((aqi. Air quality. Shift ()! = aq.air quality).cumsum())# auxiliary column
    .time.agg(['count'.'min'.'max']) # Count and get date range
    .nlargest(5.'count'))Copy the code

According to the output of Xiao Mingge, adjust the code as follows:

aqi.query("Air quality ==' pollution '").groupby((aqi. Air quality! = aq.air quality. Shift ()).cumsum()).agg({'time': lambda x: f"{x.min()}~{x.max()}".'Air Quality': "count"}).nlargest(5.'Air Quality')
Copy the code

Above is all the content of this time, in fact, we may also encounter similar scenes in daily work and life, such as: calculate the user continuous login days, calculate the user continuous payment days, calculate the continuous rain days in the southern plum rain season, and so on!