3 billion logs, retrieval + paging + background display, have you ever encountered more bizarre requirements?

Hello, Mr. Shen. I want to ask you about a database query log, which is displayed on the front desk page.

Requirements:

(1) Query logs according to certain search conditions;

(2) Query and display related log information through the foreground Web page;

(3) Retrieval requirements include user, time range, type and other specific fields;

Hope to do:

(1) Query speed as fast as possible;

(2) support paging query;

Current plan:

The log information is stored in the Oracle database. The Oracle database is partitioned according to the date. A partition table is generated every day, and the total amount of data in each partition table is about 1000W. Index related query fields such as user and type to meet the query requirements of different dimensions.

Potential problems:

A cross-partition query will take 3-4 minutes to calculate the total number of records. Is there any optimization method?

== Problem description ==

This requirement is still very abnormal, usually logs will be filtered/structured/summarized, into the data warehouse, the establishment of a wide table of business, the query on the wide table, generally will not specifically check a row of records.

If you want to support retrieval and display it line by line in the Web background, you need to address at least a few issues:

(1) Storage problems;

(2) Retrieval problems;

(3) Scalability problems (data volume expansion, retrieval field expansion);

First, storage problems

Can I use a relational database to store logs?

This can be done if the log format is fixed and the search criteria fixed.

Such as:

2019-08-11 23:19:20 uid=123 action=pay type=wechat money=12

Can be converted to a table:

t_log(date, time, uid, action, type, money)

Then index related fields to meet the requirements of background query and display.

The amount of data is too large, how to solve?

According to the description of the topic, the daily data volume is about 1000W level, and the data volume of one year is about 36Y level.

If using Oracle storage, 1000W is a partition table:

365 partitions are required in a year, and cross-partition query performance is low and not suitable.

Change to one partition per month:

Single partition 3Y records, most partitions have no write operations (insert, modify, delete), only read operations on the index, read and write performance can withstand. Twelve partitions a year is much better than 365 partitions.

Although the logs in this example can be structured (converted into tables), because of the large amount of data, in fact, a relational database is not suitable, you can use ES or Hive for larger data volume to store.

Second, the retrieval problem

The log format and search criteria are fixed. If relational databases or Hive storage is used, indexes can be built on related fields to meet query requirements.

If ES is used for storage, its internal implementation with inverted tables, natural support for retrieval.

Third, scalability

Data volume expansion

Whether Oracle, ES, or Hive is used for storage, the only difference is the storage capacity of a single instance/cluster. If the amount of data is unlimited, the solution is essentially “horizontal shard”.

Note that, try not to use the own “partition table” to expand, and in the business layer itself split.

Voice-over:Why don’t Internet companies use partition tables?”.

Retrieval field extension

If the log is not standardized and the search field is not fixed, then it becomes a “search engine” problem.

At this time, it is more appropriate to use ES, but in combination with the infinite amount of data, it may eventually need to realize the storage in the search engine (similar to Baidu, the storage capacity is infinite, the search field is not fixed).

Voice-over:”Search” principle, architecture, implementation, practice!”.

Conclusion:

In this example, the log volume is large and the mode is fixed. Suggestions:

(1) It is most recommended to use Hive storage and index to achieve the requirements of log background retrieval;

(2) If the scalability requirements are slightly higher, ES can be used to realize storage and retrieval, and horizontal expansion can be used to store a larger amount of data;

I hope the above ideas are helpful to xing Guan. My experience is limited. I also welcome you to contribute more and better plans.

3 billion logs, retrieval + paging + background display, have you ever encountered more bizarre requirements?

Related Posts

🧡 data structure and algorithm 🧡 from zero to have a series of ten hash tables

In-depth understanding of object-oriented programming in Python

Very wow ES read scene, write scene performance optimization guide! You deserve it!