I. Introduction to big data platform

1.1 Evolution of big data platform architecture

As shown in the figure, the evolution of Meizu’s big data platform architecture:

At the end of 2013, we began to practice big data and deployed a test cluster. At that time, there were only three nodes, because we started late and did not catch up with Hadoop1.0. It was a big data cluster running with YARN, and the HA function was enabled by default.

In September 2014, the number of nodes increased to 20, and the data increased by 30GB.

In June 2015, Spark and Hbase were launched, and the number of nodes reached 100 at the same time, and the data increased by 10T daily.

Remote data disaster recovery was realized in May 2016.

Right now we are mainly doing big data security, including user authentication and authorization. At present, the scale has reached nearly 1,000 servers, storage 30PB, increasing by 60TB, running 20,000 computing tasks every day, including search, advertising, recommendation, statistical analysis, user portrait, crash tracking and so on. This year, we are going to launch a new computer room, specially for running big data business. By then, there will be more than 2,000 nodes.


1.2 Challenges of Big Data operation and maintenance

What are the challenges of operating such a large scale platform?

First, look at the characteristics of big data operation and maintenance.

L Cluster scale and data volume are large and explosive. Big data literally means “big”, with large amount of data, large scale and explosive growth.

L There are many components, which are interrelated and complicated.

L Deploy components in batches and go online and offline. Deployment is usually done in batches when it goes live, and using traditional methods (such as scripting tools) can be inefficient and problematic.

The characteristics of big data operation and maintenance determine that the core problems of big data operation and maintenance are mainly reflected in two aspects:

One is how to manage a large scale cluster;

The second is how to provide users with high-quality services.

Therefore, the goals of big data operation and maintenance are:

L To solve the operation complexity of automation as the primary goal. Automation can improve the stability, the operation of the machine is more reliable than people, curing the operation of the machine to do, can reduce some man-made errors, improve the stability of the line; On the other hand, automation can improve efficiency. By handing over most of the operation to machines, it frees the operation and maintenance staff from the daily tedious operation, so that we can have more time to accomplish more meaningful and valuable things.

L Intelligence aimed at prediction and automatic decision making. The trend of big data operation and maintenance is changing from automation aimed at solving operation and maintenance complexity to intelligence aimed at prediction and automatic decision making, so we need to do well in automation before we can embrace intelligence.

1.3 Problems in big data operation and maintenance


L Complex deployment and operation.

L Repeated monitoring, repeated alarm, decentralized monitoring.

L Permission control, security audit, arbitrary running tasks. We do not limit the permissions of users. A user can access the data of the whole cluster with one account. In terms of security audit, we cannot know whether the user accesses sensitive data. In addition, there is no perfect user management system and out-of-the-box security Settings on Hadoop’s official website, which requires us to explore and practice.

L Cannot query service resource usage or apply for any resource. In terms of resource usage, we do not know how much storage and computing resources are used by the business every day. When the business wants to expand its capacity, it only verbally says that resources are insufficient, and there is no sufficient reason for the operation and maintenance to refuse to provide resources.

Ii. Construction of big data operation and maintenance platform

2.1 Platform Selection


Compare commercial and open source products:

L Commercial products, advantages are very simple installation and deployment, complete interface functions; The downside is that updates are slow because it’s not open source, so we can’t optimize it, and commercial products tend to consume a lot of system resources.

L Open source products, advantages are very open, free, strong community support, easy installation and deployment; The disadvantage is that the stability is poor and the function is fixed.

L There is a third option, which is self-research. We independently develop products, which can achieve customization and have flexible and rich functions. However, there are also disadvantages: time-consuming and high r&d costs.

Our approach is to focus on the advantages of the three, in the case of low investment, to obtain a rich, customizable, stable and iterative management platform, when problems arise, we can rely on a strong community to quickly solve.

2.2 Component version selection


2.3 Establishment of operation and maintenance specifications


L System specifications, which mainly restrict system versions, kernels, system accounts, etc.

L Deployment specifications, which restrict component versioning, configuration, etc.

L Specifications for capacity expansion and upgrade, mainly used to formulate the window period for capacity expansion, etc.

L Task operation specifications, such as a task can not be run 24 hours a day;

L Monitoring standards, we try our best to avoid accidents caused by standards.

We regulate the operation of personnel by formulating norms to avoid operation and maintenance accidents caused by irregularities to the maximum extent.

2.4 Platform construction process


L First automate the curing operation through the platform and standardize the landing through the platform.

L Next is to establish a unified monitoring and alarm platform.

The third is to establish comprehensive safety protection.

L Finally, visualization and cost realization of resource use.

2.4.1 Platform and automation


Installation platform & cloud platform

The first step is initialization, which is the installation of the system. Based on the type of the machine, it can be a physical machine or a VM. The dedicated server has an installation platform on which the operating system of the dedicated server can be installed in batches and system versions and accounts can be standardized. VMS also have cloud platforms that can be used to create and initialize cloud hosts in batches and run temporary tasks or experimental clusters.

Cluster management

In terms of cluster management, both commercial and open source products mentioned earlier are generally managed for a single cluster. If there are multiple clusters, multiple platforms must be deployed to manage them, which increases the complexity of operation and maintenance to some extent.

What we do is we manage multiple clusters in one place, but they have their own space and they don’t interfere with each other. At the product level, the former group group management is realized, which reduces the difficulty of operation and maintenance and improves the efficiency of operation and maintenance.

The host management

On the platform, you can view the host status, filter the host list, and perform host-level operations, such as enabling or disabling hosts. You can also delete hosts, add hosts, and perform rack management and rack display.


On the host list page, you can view host details, component status, health status, and usage status, including CPU, memory, and disk information.

The health status of a host is classified into four types. If the icon is red, at least one master component is down on the host. Orange indicates that at least one Stave component is dead; Yellow indicates that the host does not report the heartbeat for three minutes. In blue, the host is running properly.

Component operations

In component operation and maintenance, you can view the overview and status of all components. You can also start and stop components, add and delete components, modify component configurations, and perform operations on components. The complete lifecycle of the cluster should include the offline, or recycle, phase, which is also easier to understand, basically reversing the above four steps.

Through platformization, we have achieved standardization, and automation has greatly reduced the difficulty of big data cluster deployment, improved the efficiency of operation and maintenance, and ensured the stability of cluster.

2.4.2 Unified Monitoring and Alarms

Monitoring data Collection Architecture (Timing Metrics)

As mentioned earlier, traditional big data monitoring is relatively decentralized. Our solution is to use the AMBARI indicator monitoring system, which can uniformly monitor the operation of various services and hosts on the platform and provide relevant indicators of various services and hosts, so as to achieve the purpose of judging the health of the cluster. The whole process includes the collection, storage, aggregation and acquisition of monitoring indicators.

Data collection process


One indicator table is for hosts, and the other indicator aggregation table is for clusters.

Then comes the acquisition of indicators. Ambari provides two ways: one is to obtain indicators through the acquisition center; Another is by Ambari access Server end, before a way more close to the native index data, the latter is more commonly used way, arguably Ambari upper indicators for is to get through this way, but in essence or call the first way, to get the raw data in the repository, then carry on the processing and logic to handle, Finally, return to the WEB side.

Unified alarm platform: Receives alarms and sends them to the responsible person based on rules

Have a need to monitor is the demand of the alarm, but the alarm to the forwarding of information, not only or you will make alarm receiving personnel submerged in the alarm information, and all the warning information must want to have a unified platform for takeover, information platform to receive the necessary and according to the rules set to merge to send.

The purpose of our unified alarm platform is to solve alarm omission, disturb to off-duty personnel, reduce alarm fatigue, and ensure that alarm/fault/reminder notices are timely, accurate and efficient notified to specific personnel. By optimizing the existing alarm processing process, we introduce the on-duty mechanism, alarm upgrade mechanism, and alarm convergence rules to achieve accurate notification of faults. The unified monitoring platform and alarm platform reduce the difficulty of setting big data alarms and improve the sensitivity of O&M personnel to alarms.

2.4.3 Authentication, Authorization, and Audit Comprehensively Protect Hadoop


L The most inside is the OS level security, mainly some account Settings and so on.

L The second aspect is permission control, which is mainly to authorize or deny access to specific resources and specific access users. Permission control is based on security authentication.

L The third level is security authentication. Security authentication checks the identity of the user to ensure that the user is the correct user. We use Kerberos system.

L Finally, network boundary security system, including hardware firewall, VLAN/ subnet isolation, etc.

L Through these four layers, we ensure that all users have passed security authentication and are authorized to operate the cluster. However, whether the user’s operation is reasonable, what data they access, or whether they try to access sensitive data, is left to the audit. Security audit is also crucial to data security. Os-level security and network security are generally uniform, and are not expanded here.

Safety certification

Hadoop security authentication is performed based on Kerberos, a network authentication protocol. A user enters his or her own authentication information for authentication. If the authentication passes, the user obtains ticket, which is used by the user to access multiple Kerberized services.

It mainly addresses server-to-server authentication and client-to-server authentication, but does not implement user-level authentication, that is, it cannot restrict user submission of jobs.

Hadoop itself does not concatenate user accounts. Instead, it authenticates users through the Kerberos protocol. Let’s briefly explain the MR task submission process on YARN.

Before performing a task, the client authenticates itself with the KDC to obtain a TGT. The client requests a ticket from the KDC through the TGT, and the KDC generates a session key and sends it to the client. After receiving the ticket, the client authenticates itself with the SERVICE to complete identity authentication.

Access control

Permission control is implemented by Apache Ranger, a security management framework in Hadoop system. Apache Ranger is a centralized component that defines and manages security policies. It has some common policy protection and policy models built in it, and these security policies are enforced by Hadoop supported components.

As shown in the figure above, let’s take a look at the policy execution process.

First, the user requests the resource, matches all the permissions of the resource, and then checks all the resources to see whether they are in the denied access list, if they are denied access, if they are not in the allowed list, if they are allowed access, if they are not, they refuse access, or make a decision to delegate. Ranger has the option of delegating decisions to the access control layer of the system itself.

Ranger architecture


The first part is to synchronize users. It periodically loads users and then synchronizes them to the administrator.

The second part is the administrative center, which provides interfaces and also enforces user-set policies.

The third part is the client side plug-in. These plug-ins are embedded in the component execution process and periodically pull policies to the administrator to enforce access decisions for that policy. It also logs these visits periodically.

Security audit


The Apache Eagle framework consists of three main parts:

The first is the reception and storage of data streams. Eagle supports any type of data flow received into its policy engine. For example, the AUDIT mechanism of HDFS can collect these qualifications through Kafka and send them to simple policy execution for processing.

The second part is the host’s real-time data processing engine. Eagle provides an efficient stream processing API independent of the physical platform to process the data sent in real time.

L The third one is alarms, which provides a flexible alarm framework.

The application scenario of Eagle is mainly used to monitor data behavior.

L Monitor data access in Hadoop.

L Detect illegal intrusions and security violations.

L Detect sensitive data access to prevent data loss.

Real-time detection and warning based on policy

L Anomaly detection based on user behavior

Security is the life baseline of Internet products. We protect the security of Hadoop in an all-round way through authentication, authorization and audit to ensure the stable operation of big data cluster. Cost is the final standard to verify the efficiency of operation and maintenance, such as the saving of human cost and the control of the use of business resources are the direct embodiment of operation and maintenance value.

2.4.4 Resource visualization and cost

Before resource visualization and cost reduction, we often do not know whether a business’s resource usage is reasonable and whether resource expansion is necessary. Through the visualization and cost of business resources, business resource consumption can be counted, business consumption clarity and task details can be displayed, business elasticity can be provided, and favorable basis can be provided for promoting business optimization computing tasks.


We also give the consumption details from the dimension of the user, so as to be well-founded and convenient for the back of some reverse check.

At the same time, you can also learn the detailed running status of tasks. For example, the type of task, when to start, when to end, how long to use and so on.

The above are some methodologies and technologies used in the construction of the big data operation and maintenance platform.

Iii. Summary and outlook

3.1 summarize

L In terms of quality and efficiency, it expounds the necessity of platformization and automation of big data operation and maintenance, realizing automatic deployment and management of cluster, host and component;

L In terms of security, it solves the problem of who has authority, what authority and what has been done, ensuring the security of the platform;

L in terms of cost, we have achieved the plan and the truth, the expansion has the basis and the optimization has the target.

3.2 looking

The overall goal of big data operation and maintenance is to provide adequate quality of service and user experience at the lowest possible cost. Network bandwidth, servers, and maintenance labor are the main sources of big data costs. We hope to use big data analysis technology to predict and automate hardware failures, achieve zero investment in machine management, maximize the use of resources, and reduce budget costs.

L Provide high-quality business operation and maintenance services. We hope users can apply for automatic creation of delivery clusters through the platform to carry out business operations.

L At the same time, we also hope that the operation and maintenance team can make full use of big data analysis technology to improve the ability of prediction, discovery and automatic detection, predict and allocate resources, dynamically scale clusters, realize intelligent early warning and automatic repair, and promote the development of operation and maintenance to the direction of intelligence.

On November 9-12, The 6th TOP100 Global Software Case Study Summit will be held in Beijing National Convention Center. Tan Linying, head of Meizu Technology theme desktop, will share “How mobile Phone Manufacturers Make Internet Products”.

TOP100 Global Software Case Study Summit has been held for six times, selecting excellent global software development cases and attracting tens of thousands of participants every year. Including product, team, architecture, operation and maintenance, big data, artificial intelligence and other technical special sessions, on-site learning of the latest RESEARCH and development practices of Google, Microsoft, Tencent, Alibaba, Baidu and other front-line Internet enterprises.

For more information and schedule of TOP100 cases, please visit our official website. 100 r&d cases worth learning in 2017 will be shared in 4 days. A total of 10 free single-day tickets for the opening ceremony will be provided on a first-come-first-served basis. Free ticket application: www.top100summit.com/?qd=juejin