Johnny said that he found the wealth code, and recently made a set of stock algorithms, one of which involves the calculation method of the number of consecutive days of the stock limit, we all know that the stock market is closed on weekends, there is a fault, need a little skill. I asked if it was a time series, and he said I was lying, and I knew I was lying. Ask him how, he won’t tell me, all those years of brotherhood ain’t shit. Do you really think I’m not as smart as him? Huh!

I decided to stay up overnight to find out how to calculate the maximum number of consecutive days in Hive.

The background,

In the service requirements of website platform, the “maximum login days” of the user is a common requirement.

Original data:

u0001 2019-10-10
u0001 2019-10-11
u0001 2019-10-12
u0001 2019-10-14
u0001 2019-10-15
u0001 2019-10-17
u0001 2019-10-18
u0001 2019-10-19
u0001 2019-10-20
u0002 2019-10-20
Copy the code

Note: Data is a simplified version, the two columns are user_id, log_IN_date respectively. The actual situation needs to get the above form of data from the collected data after deduplication.

Create table and import data to Hive:

create table test.user_log_1 (user_id string, log_in_date string) row format delimited fields terminated by ' ';

load data local inpath '/var/lib/hadoop-hdfs/data/user_log.txt' into table test.user_log_1 ;
Copy the code

Take a look at the data:

hive> select * from test.user_log_1 ; OK u0001 2019-10-10 u0001 2019-10-11 u0001 2019-10-12 u0001 2019-10-14 u0001 2019-10-15 u0001 2019-10-17 u0001 2019-10-18 u0001 2019-10-20 u0001 2019-10-20 Time taken: 0.076 seconds, touchback: 10 row(s)Copy the code

Second, the algorithm

The core is to sort by access time, the login time column minus the sequence number after sorting, get a date value, according to this value can be grouped and counted.

Step 1: Sort

Group by user_id and sort by log_in_date:

select user_id, log_in_date, row_number(a)over(partition by user_id order by log_in_date) as rank from test.user_log_1;
Copy the code

Results:

u0001	2019-10-10	1
u0001	2019-10-11	2
u0001	2019-10-12	3
u0001	2019-10-14	4
u0001	2019-10-15	5
u0001	2019-10-17	6
u0001	2019-10-18	7
u0001	2019-10-19	8
u0001	2019-10-20	9

u0002	2019-10-20	1
Copy the code

The maximum number of consecutive login days for the u0001 user is 4 days, which can be verified using the following calculation method.

2. Step 2: Make the date difference between column 2 and column 3

You can see the pattern, small dates, small line numbers; If you differentiate the date from the line number, the difference for consecutive logins should be the same.

select user_id, date_sub(log_in_date, rank) dts from (select user_id, log_in_date, row_number(a)over(partition by user_id order by log_in_date) as rank from test.user_log_1)m;
Copy the code

Results:

u0001	2019-10-09
u0001	2019-10-09
u0001	2019-10-09
u0001	2019-10-10
u0001	2019-10-10
u0001	2019-10-11
u0001	2019-10-11
u0001	2019-10-11
u0001	2019-10-11
u0002	2019-10-19
Copy the code

Obviously, the maximum number of consecutive logins is 4.

3. Step 3: Group and sum by column 2

select user_id, dts, count(1) num from (select user_id, date_sub(log_in_date, rank) dts from (select user_id, log_in_date, row_number(a)over(partition by user_id order by log_in_date) as rank from test.user_log_1)m)m2 group by user_id, dts;
Copy the code

Results:

u0001	2019-10-09	3
u0001	2019-10-10	2
u0001	2019-10-11	4
u0002	2019-10-19	1
Copy the code

Step 4: Find the maximum number

The sequence of consecutive login days for each user has been calculated, and then the maximum login days for each user can be taken as follows:

select user_id, max(num) from (select user_id, dts, count(1) num from (select user_id, date_sub(log_in_date, rank) dts from (select user_id, log_in_date, row_number(a)over(partition by user_id order by log_in_date) as rank from test.user_log_1)m)m2 group by user_id, dts)m3 group by user_id;
Copy the code

The result is consistent with our expectation, the maximum login days of user U0001 is 4.

u0001	4
u0002	1
Copy the code

Iii. Expansion (the maximum trading day of the stock)

We know that the stock market, such as our A-share, weekend is not open, so A stock if Friday trading limit, Monday then trading limit, this is two consecutive days trading limit, using the above method is not, use the LEAD function to try:

select user_id, log_in_date, lead(log_in_date) over(partition by user_id order by log_in_date) end_date from test.user_log_1;
Copy the code

The results of

u0001	2019-10-10	2019-10-11
u0001	2019-10-11	2019-10-12
u0001	2019-10-12	2019-10-14
u0001	2019-10-14	2019-10-15
u0001	2019-10-15	2019-10-17
u0001	2019-10-17	2019-10-18
u0001	2019-10-18	2019-10-19
u0001	2019-10-19	2019-10-20
u0001	2019-10-20	NULL
u0002	2019-10-20	NULL
Copy the code

Ha, ha, ha, ha.

The first column is the uid, and the last two columns are the date, so the date in the two columns is between Monday and Friday.

  1. If THE date in the third column, minus the date in the second column,The difference is equal to 1, obviously continuous;
  2. If THE date in the third column, minus the date in the second column,The difference is equal to 3But the third column date is Monday, so it’s consecutive.

Above two kinds of conditions are integrated, the biggest that can calculate a stock rises continuously stop number of days, you learned waste.

Hadoop Hadoop data migration usage detail Hbase repair tool Hbck data warehouse modeling layered theory article understand Hive data storage and compression components focus on learning these several