As data has gradually become a valuable asset of a company, big data teams often assume a more important role in the company. Big data teams are often responsible for data platform maintenance, data product development, mining business value from data products and other important responsibilities. Therefore, for many big data engineers, how to choose appropriate big data components and do appropriate big data architecture work according to business requirements is the most common problem encountered in daily work. Here, according to the log analysis work of Seven Niuyun in the increasing number of hundreds of billions, share with you some experience of big data technology architecture selection.

What are big data architects looking at

In a big data team, the core concern of the big data architect is the selection of technical architecture. What are the factors that affect architecture selection? In our practice, the selection of architecture in the field of big data is most influenced by the following factors:

Data magnitude is a particularly important factor in big data. Ultimately, though, the data magnitude itself is a measure of a business scenario. Differences in the magnitude of data often indicate differences in business scenarios.

Business requirements Experienced big data architects can extract core technology points from complex business requirements and select the appropriate technical architecture based on the abstract technology points. Major business requirements may include: application real-time requirements, dimension and flexibility of queries, multi-tenancy, security audit requirements, and so on.

In terms of maintenance cost, big data architects should be able to clearly understand the advantages and disadvantages of various big data technology stacks and fully optimize the architecture to meet business requirements. A reasonable architecture can reduce maintenance cost and improve development efficiency.

On the other hand, big data architects should have a clear understanding of their team members, the technical expertise and taste of other students, and ensure that their technical architecture can be recognized, understood, and maintained and developed in the best way.

Next, we will focus on these aspects to see how the selection of architecture most suitable for our team’s business will be affected by these factors.

Selection of technical architecture

Business requirements are multifarious, and it is often not the details of the requirements that affect our technology selection, but some specific scenarios after refining. For example, business requirements say we want to build a log analysis system, or we want to build a user behavior analysis system, what are the specific points we should focus on behind these specific requirements? This is a very interesting question. In the process of doing big data, we often find that our questions about these requirements often fall on the following questions.

The magnitude of data as an important factor affects our decision on technology selection. Besides the change of data volume, the needs of various business scenarios will also affect our choice of technology components.

Data magnitude As mentioned above, data magnitude is a measure of a special business scenario and also a factor with the greatest influence in big data applications. Often corresponding to different data levels of business, we will have different ways to consider.

Generally, the magnitude of data is about 10GB, and the total number of data is tens of millions of magnitude. Such data is often the most core data of the business, such as user information database. Because of its core business value, this kind of data volume often requires strong consistency and real-time. At this level, traditional relational databases such as MySQL can well solve various business requirements. Of course, when faced with problems that are difficult to solve with a relational database, such as full-text indexing, the architect still needs to choose a search engine such as Solr or Elasticsearch based on the business requirements.

If the data scale grows to 100 million to 1 billion, generally speaking, this stage will be faced with a choice of traditional RDBMS+ reasonable index + sub-database sub-table and other strategies? Or should you choose something like SQL On Hadoop or HTAP or OLAP? In this case, there is a lot of flexibility, and our general experience is that if you have database and middleware specialists on your team and want to keep the architecture simple, you can choose to continue using traditional relational data. However, it is still recommended to use big data components in order to have higher scalability for future services and support a wider range of business requirements within a visible time.

As the volume of data grows to a billion to a billion, especially over 10 terabytes, our traditional relational databases are often left out of the technical architecture of choice. At this time, it is often necessary to select the technical components of specific scenarios in combination with various business scenarios. For example, we need to carefully examine whether our business scenarios need a lot of update operations. Is random reading and writing required? Do you need a full-text index? 

These are the general results of some of the major analytics engines at various levels of data. The figures in this chart are just general results for most scenarios (not exact test results, just for reference). It’s worth noting, though, that while it may seem like we want as little response time as possible and as much data as possible, there is no silver bullet in big data that will solve every problem. Each technology component sacrifices part of the scene to stay ahead in its field.

Real time is such an important factor that we have to focus on the business requirements of real time from the very beginning. Real time in business usually contains two meanings:

On the one hand, the real time is reflected in the real time of data intake. The real time of data intake refers to how long can our big data application accept to see the data when the business data changes? From the ideal situation, of course, our business anyway is want the system to the real time the better, but from the two aspects of cost and technology to consider this problem, we generally divided into real-time systems (millisecond delay), near real-time systems (seconds delay time), the quasi real-time system delay time (minutes) and offline (hours or days delay). Generally, the delay time and throughput capacity are inversely proportional to the computing capacity. The stronger the throughput, the more accurate the calculation, and the longer the delay time.

On the other hand, real-time performance is also reflected in the query delay, which calculates how long the user has to wait for the query request before the server can return the calculation result. Most of this depends on the specific form of the product. If the product is to be displayed to end users, such as statistical products such as the Popular list, hot search list and recommended products, and has high QPS demand, the delay must be controlled at sub-second level. In another scenario, if a product is for data analyst, or operating personnel used for data exploration, usually passes through the calculation of mass and uncontrollable, at that time may be more suitable for an offline task model, user the tolerance will be higher, levels of data output level support for minutes or even hours.

As can be seen from this figure, HBase and Cassandra, which support transaction and high update throughput, are generally selected in the real-time field, or TiDB, Spanner, Kudu and other HTAP components, which support transaction and analysis of the distributed database.

For higher Analytical performance, you can choose specialized ON-LINE Analytical Processing (OLAP) components such as Kylin or Druid, which belong to MOLAP (Multi-OLAP). Support the creation of data cube in advance, pre-aggregation of indicators, although sacrificing a certain degree of query flexibility, but to ensure real-time query.

Elastic Search is one of the most flexible NoSQL query engines that supports full-text indexing, which other engines don’t. In addition, it also supports a small number of updates, support aggregation analysis, also support detailed data search queries, in the near real time domain applicable to a lot of scenarios. However, as ES is a Lucene based storage engine, it requires higher resource cost and has no advantage in analysis performance compared with other engines.

In addition, if our data is archived offline or in the way of addition, the product form needs to rely on the operation of large quantities of data. Such products tend to tolerate high query latency, so a series of products from the Hadoop ecosystem, such as Spark, the next generation of MapReduce computing engine, and another series of SQL On Hadoop components, Drill, Impala, Presto and others have their own advantages and can be selected according to other business requirements.

Calculate dimension/flexibility

Dimension and flexibility of calculation are two important factors for selection of calculation. For example, if our product only produces a fixed number of indicators, we can use Spark offline computing to import data results into business databases such as MySQL and provide display services as result sets.

But when if our query is an interactive, if the user can choose their own dimensions for data aggregation, we cannot put all the dimensions of the permutation and combination are expected to work out, that we may need an OLAP components that need to be able to do according to the specified dimension index aggregate, the selection can enhance the flexibility of the results show, It also greatly reduces query latency.

Furthermore, if the user can not only calculate the data index, but also query the original detailed data, the OLAP component may not be suitable, so they may need to use ES or SQL On Hadoop more flexible components. If there is a full-text search requirement, select ES; if not, select SQL On Hadoop.

multi-tenant

Multi-tenant requirements are also an issue that big data architects often need to consider. Multi-tenant requirements often come from many different users and are very common in the infrastructure department of a company.

What are the considerations for multi-tenancy?

The first is the isolation of resources. From the perspective of resource saving, resources can be fully utilized if they can be shared among different tenants. That’s what we want to do in infrastructure. However, for many tenants, the service level may be higher or the data volume may be larger. If they share resources with common tenants, resource contention may occur. Consider isolation of physical resources.

Second, consider user security. On the one hand, to do authentication, need to prevent malicious or unauthorized access to data. On the other hand, audit logs must be recorded for each sensitive operation to trace the source IP address and user of each operation.

Third, but most important, is data permissions. Multi-tenant systems are not only about isolation, but also about resources being shared and utilized more efficiently. Now data permissions are often not limited to a file, a warehouse read and write permissions. More often than not, we may need to license some subset of data, some data fields, so that each data owner can distribute their resources more securely to the tenants who need them. Data can be used more efficiently, which is also an important mission of a data platform/application.

Maintenance costs

The cost of maintaining a big data platform is a critical metric for architects, and experienced architects can choose the right technology solution based on the characteristics of their teams.

• Usage cost is directly proportional to the complexity of technical components. Generally speaking, the higher the complexity of components, the more components, and the higher the cost of multiple components.

• Maintenance costs are dependent on service providers and component complexity. In general, a single technology component is cheaper to maintain than a complex technology component, and a technology component provided by a cloud service is cheaper to maintain than a self-built big data component.

• In terms of team requirements, generally speaking, the more complex the technical component, the higher the team requirements. However, on the other hand, there is also a relationship between team requirements and service providers. If cloud service providers can undertake the operation and maintenance of components, it can actually help the business team to liberate more engineers from the operation and maintenance work and participate in the work of big data application.

So in general, the architect’s preference for technology selection should be to choose the easiest technology architecture to use and maintain while meeting business and data volume requirements. On this basis, if you have a very strong technical development and operation and maintenance team, you can choose to build your own big data platform; If there is insufficient o&M and development support, you are advised to use a cloud service platform to support services.

7 Niuyun is how to do the selection of architecture

The big data team of Qiuniuyun is called Pandora. The main work of the team is to support the requirements of the big data platform within Qiuniuyun, and also to productize the big data platform and provide professional big data services to external customers. We can say that Qiuniuyun is Pandora’s first customer, and we have accumulated a lot of technology selection experience in bearing various internal needs of the company.

Seven Niuyun features and business challenges

A brief overview of the challenges we faced in the seven Ox Clouds scenario. In addition to Pandora, Qiuniuyun has six product teams, including cloud storage, live cloud, CDN, intelligent multimedia API services and container cloud teams. Business data and log data generated by all product teams are collected into Pandora’s unified log store using LogKit (Professional edition), Pandora’s proprietary collection tool. Then each department uses this part of the data to do various data applications.

First of all, the commercial operation department is the team with the important mission of the whole revenue and growth of Qiuniuyun. It needs the buried point and log data collected by each team to make a unified user view and make user portraits based on this. To provide customers with more intimate operation services, improve customer satisfaction.

In addition, the SRE team needs to conduct in-depth performance tracking of online systems, and we need to provide support of OpenTracing interface. In the relatively unified environment of Qiuniuyun technology stack, we can easily support full-link monitoring, so that the SRE department can track and monitor online service performance without relying on the buried points of the R&D team. It’s easier to know what’s wrong with your service.

The product development side proposed the need for full-text index, which can quickly locate the log data according to keywords in the daily nearly 100 TB of logs and query the log context. In addition, it is necessary to be able to parse key fields in APP logs, such as user ID, response time, download traffic, etc., to monitor user-level operation and maintenance indicators and serve customers more accurately.

Of course, no matter what the requirements of a business department, they need to have excellent and flexible report presentation system, which can support business analysis, exploration and decision-making. A rational architecture that supports complex business reporting and BI requirements.

Landing in the framework of seven niuyun

Comprehensive consideration of the product needs of all parties, we made the following product design:

We first developed logKit Professional edition, which is used to collect and synchronize data from various open source projects or log files. In addition, a set of data bus Pipeline was designed, which combined with qiniuyun’s characteristics of large data throughput, but the delay can be accepted to the second level. Here, we adopt multi-Kafka cluster + Spark Streaming and develop a self-developed flow scheduling system, which can efficiently export data to the downstream unified log storage product. Meanwhile, Spark Streaming can easily complete log parsing and field extraction.

Unified log storage Here we support our own and various third party chart presentation systems. For back-end data systems, we use a hybrid architecture pattern, where the body consists of three base products.

The log analysis platform builds a log storage and indexing system based on the customized version of ES of Qiuniuyun, which can guarantee the return of billions of data searches in seconds even with the throughput of 100W /s.

Data Cube is an OLAP product based on custom Druid. It is a high performance query for multi-tenants and provides microsecond level aggregation analysis for the largest customers with 30TB+ raw data per day.

Offline workflow Based on storage and Spark workflow platform, the offline data computing capability can process large-scale calculation and analysis of PB level data.

Architecture advantages

After these big data practices, what does Pandora bring to the table for both internal and external users? Through comparison with outstanding commercial and open source products in the industry, qiniuyun has the following advantages:

Complete multi-tenant support

Pandora supports multiple levels of isolation when it comes to multi-tenant resource isolation. This includes low-level namespace isolation, where we ensure that all customers can safely use the cluster by limiting their use of various shared resources, such as CPU and memory. Moreover, in order to meet more customer customization requirements, we also use multi-cluster dynamic capacity expansion to support space isolation between tenants, so that users can use independent resources.

In addition, security, permissions, and auditing, which are important in multi-tenant scenarios, have also been greatly improved. Data We can perform permission management according to the granularity of data subsets and fields, and authorize data to other tenants. At the same time, we will audit every operation record of data, accurate to the source IP and operator, to ensure the data security of cloud service.

Support rich business scenarios

On Pandora’s big data platform based on the log field, we support real-time and offline computing models, and use workflow interface to operate various big data flows simply and conveniently. Products and tools such as log analysis and data cube support a variety of business scenarios. Including but not limited to:

• User behavior analysis • Application performance monitoring • System device performance monitoring • Non-invasive buried point that supports full-link tracing systems to discover application bottlenecks in distributed systems • IoT device data analysis and monitoring • Security, auditing and monitoring • Machine learning, automatic system exception detection and attribution analysis

Verification of large-scale public cloud data

We already serve more than 200 named customers in the public cloud, with more than 250 TERabytes of data flowing in every day, about 365 billion pieces of data per day. The daily amount of data involved in calculation and analysis has exceeded 3.2PB. The large scale of public cloud verifies that Pandora’s big data log analysis platform can provide customers with a stable computing platform and good business support.

Users enjoy the lowest o&M cost

Pandora’s product design philosophy is that cloud services should be an all-in-one product. So for customers, Pandora is still a single product component, although adapted to a large number of application scenarios. Therefore, the operation and maintenance cost of customers using Qiuniuyun’s big data service is the lowest. Only one development team is required to take care of all aspects of data development and operation and maintenance. Provides great convenience for rapid business iteration and growth.

The above are some experiences shared with Pandora. Seven Cloud intelligent log management platform is committed to reducing the mental burden of users, to help customers digital upgrade. Please click “Read the original article” to learn more about the product details and free quota policy immediately:

• New log data 1 GB/ month • Existing log data 1 GB/ month • Log warehouse 1 month • API calls 1 million times/month

Everyone is welcome to register for the experience.

Document site address: pandora-docs.qiniu.com

People say

The Great Talk column is dedicated to the discovery of the minds of technical people, including technical practices, technical dry goods, technical insights, growth tips, and anything worth discovering. We hope to gather the best technical people to dig out the original, sharp and contemporary sound.

Click “Read the original text” to open the qiniuyun intelligent log management platform for free