Abstract: message queue Kafka is a distributed, high throughput, high extensibility, message queue service, widely used in log collection, monitoring, data aggregation, streaming data processing, online and offline analysis, etc., is one of the indispensable product of big data ecological, ali cloud provides full hosting service, users need to deploy operations, more professional, more reliable and safer. This article takes you inside the message queue Kafka. The following content is based on the video and PPT of the speech. Video sharing

http://click.aliyun.com/m/1000012118/

PPT download link

http://click.aliyun.com/m/1000012119/

Message queue Kafka

Message queue Kafka is a distributed, high-throughput, highly scalable message queue service. In contrast to Apache Kafka, message queue Kafka provides a fully managed service. Apache Kafka is a distributed message system based on push-subscribe. It has the characteristics of fast, extensible and persistent. It is now an open source system owned by Apache and is widely used in big data scenarios as part of the Hadoop ecosystem.

And message queue Kafka for Apache Kafka to provide full hosting services, completely solve the pain point of open source products for a long time. Users only need to focus on business development, no deployment operation and maintenance, low cost, more flexible, more reliable. The biggest feature of message queue products is a fully managed service, which involves two main features: compatibility and convenience. First of all, for compatibility, message queue Kafka can be 100% compatible with Apache Kafka, for users, can use a variety of languages open source client seamless access, currently using open source Kafka users, only need to change an access point can use message queue Kafka products. Meanwhile, message queue Kafka is compatible with all of Apache Kafka’s ecologies. Secondly, for convenience, message queue Kafka does not need to be deployed, as long as users buy message queue Kafka, fill in the instance information, within 15 minutes can use the message queue Kafka service, so it is very convenient and easy to use.

The above is the overall introduction of message queue Kafka, next will be divided into pain points, advantages and scenarios of these three modules to share with you in more detail. First of all, we will share the current Ali cloud for message queue service collected by the user pain points, and according to these pain points to solve the problem, message queue Kafka has the advantages of what, and finally we will introduce the message queue Kafka applicable scenarios.

Pain points: The annoyance of self-built Kafka

Apache Kafka is difficult to operate and maintain

Kafka, from a user perspective, is a very simple product that provides a publish and subscribe model. Kafka is very difficult to maintain because it needs to pay attention not only to the roles of brokers and controllers in the cluster, but also to the products it depends on, such as ZooKeeper. Therefore, the operation and maintenance of the above modules not only involves the tuning of parameters, but also faces problems such as capacity expansion and shrinkage with the growth of business. In addition, you need to pay attention to disk and network conditions. Therefore, to sum up, the operation and maintenance cost and difficulty of self-built Kafka are very large. Here are some specific examples.

Data corruption

Some users reported data clutter problems when using Kafka clusters themselves. As you know, in a Kafka cluster there are two roles: Controller and Broker. In the case of a Controller exception, one of the brokers is automatically selected as the new Controller. However, due to network anomalies, the Controller that failed at the beginning may be resurrected. After the resurrected Controller, the whole cluster will have a “split brain” situation. Because the main responsibility of Controller is to manage the state of partitions and copies of the whole cluster, the occurrence of “split brain” will cause data chaos, which is unacceptable for users.

They are not available

The entire Kafka cluster is strongly dependent on ZK, and the operation and maintenance of ZooKeeper is huge and complex. For example, if o&M personnel do not know much about ZooKeeper, they may not know how to deploy ZooKeeper or how to ensure the availability of ZK in the same machine room or multiple machine rooms, which often requires o&M personnel to think and weigh. ZooKeeper stores important Kafka data. If the ZK is unavailable, the disaster recovery (Dr) backup groups and stored data in the entire cluster are affected.

Bandwidth concerns

For users, self-building Kafka involves not only the dependencies on the periphery, but also a common problem within the cluster — bandwidth. From the user’s point of view, there is often a trade-off between the number of copies. To improve reliability and disaster recovery capability, a cluster usually requires three replicas. When the number of replicas is large, data replication between machines is involved, which increases network bandwidth. Also, since brokers are peer to peer, data needs to be synchronized from the Controller. In this way, the Controller not only needs to undertake its own tasks, but also needs to provide services to the outside world. In terms of its own design, these two tasks have no priority, so when the cluster scale is large, network bandwidth congestion will occur. The Ali Cloud message queue Kafka has already helped users to solve the above problem. Users do not need to do the tradeoff between backup, Ali Cloud will help users to achieve three copies of data storage, and make the service availability is up to 99.9%

Disk operations

There are other issues with user-built Kafka, such as disk operation and maintenance. Since 0.110, Consumer offsets are not only stored on the ZK side, they can be stored in Kafka clusters as a general Topic. The retention policy of the entire Consumer offsets determines disk usage, so it is possible to set the wrong parameters to cause disk usage to be too high. At the same time, users often see that their cluster has 100 TERabytes of disks, but only a few dozen terabytes of disks are already unwritable. As we all know, there are two ways to partition data in Producer: Using Hash may cause Hash skew, and using RoundBobin may cause uneven disk usage. For users, they may see a situation where the Producer buys a lot of disks and the disks are not fully occupied, but the Producer is unable to write. As for the details of disk operation and maintenance, message queue Kafka already helps users out.

Data loss

In fact, for users, the most distressed is the problem of data loss. Kafka provides users with three data storage strategies. The first is considered the OneWay method, the second is to dump a single backup, and the last is to dump all backup data. The choice of these three methods is a game between usability and performance. In the case of high network load or hard disk write, disk write failure may occur. Also, Kafka’s data is initially stored on PageCache and is periodically flushed to disk, but not on disk for every successful message sent. In the event of a power outage or machine failure, data stored in memory is lost. In addition, there is another case where data loss occurs when the amount of data in a single batch exceeds the limit. And the use of message queue Kafka, the user does not need to do these data above the selection of game and consideration, because as long as the message queue Kafka sends data successfully, then these data will be persistent, to ensure that the data will not be lost. Because of these optimizations, the reliability of the message queue Kafka is up to eight nines (99.999999%).

Second, the advantages of

In front of the shared with you self-built Kafka encountered pain points, the next will be combined with the above pain points to share the advantages of the Ali cloud message queue Kafka and how it is to solve the pain points. Out of the box

Ali Cloud message queue Kafka is out of the box, is 100% compatible with Apache Kafka, the original is using Apache Kafka users just need to change the access point can be seamlessly access, and message queue Kafka can also support the open source version of the support of a variety of clients, At the same time, it is compatible with the entire ecology of Apache Kafka. And message queue Kafka does not need users to deploy, only need to fill in the user instance information after the purchase, in 15 minutes can use message queue Kafka service, very convenient.

All managed

After users purchase message queue Kafka, Ali Cloud will maintain the entire cluster and provide hosting services, which is a total maintenance cost of 0 for users. How is this zero maintenance cost achieved? Ali Cloud message queue Kafka provides second-level health inspection and self-recovery system, Ali Cloud has professional R & D and operation and maintenance team to ensure the normal operation of the whole health inspection and the implementation of automatic maintenance system. For users, health inspection is the cornerstone of hosting. So how does Ali Cloud provide health inspection for users? What is the content of health inspection provided for users? In fact, it can be divided into three levels, namely machine dimension, business dimension and operation and maintenance above business performance. For example, Ali Cloud will pay attention to whether the network is abnormal, whether the disk failure, such as system-level operation and peacekeeping inspection. In addition, Ali Cloud message queue Kafka also provides inspection at the business level. It pays attention to the normal operation of production and consumption, as well as the normal operation of the overall service provided by Kafka. All these are inspection at the business level. In terms of performance, the system also pays attention to indicators such as disk I/O request speed, predicts the load based on disk I/O request speed, and performs some alarms and automatic processing. You can perform health inspection to check the cluster health status.

High reliability and availability

As for message queue Kafka, it not only promises 99.999999% data reliability and 99.9% service availability. Message queue Kafka promises data reliability and service availability not only through health inspection, but also through a lot of optimization. Here are two optimizations that Ali Cloud has made for Kafka. At the storage layer, optimization is achieved through the separation of storage and computation. Secondly, Alibaba cloud messaging service Kafka also provides automatic disaster recovery, and the scope of automatic disaster recovery is very large. Here are a few simple points: The first point is that when a Broker fails, the standby Broker is directly started and the traffic from the stalled Broker is automatically distributed to the living Broker, thus realizing the effect of complete unawareness of the business. It is through the above way, Ali Cloud message queue Kafka achieves high data reliability and service availability, so users do not need to worry about data reliability and service availability.

Business monitoring & reporting

At the system level, Ali Cloud helps users to operate and maintain the entire cluster, ensuring availability and reliability. On this basis, Ali Cloud also provides a set of business monitoring and reporting system for business parties. In this business monitoring system, it is mainly carried out through three dimensions, the first dimension is instance, the so-called instance is a concept that users can understand as a self-built cluster, but in fact it is a cluster, each user can get a real small cluster when buying an instance. In the case of an instance, the user needs to be concerned with exceptions such as their disk water level and producer and consumer TPS exceeding the threshold. The second dimension is Topic. Ali Cloud also provides some query information about message accumulation. In an intuitive way, users will be able to see whether the producer is normally producing messages, which is also a pain point for users in the open source Kafka implementation. In open source solutions, without such operations tools, it is difficult for users to directly monitor producers. The last dimension, which is heavily used by users, is the accumulation of Consumer groups and topics. The current accumulation will also provide delayed messages in the future. The above three dimensions are the monitoring dimensions that Kafka can provide at present, and the corresponding monitoring dimensions will be added continuously in the future based on user feedback.

Data security

Message queue Kafka provides a series of data security guarantees to ensure data security. The first one is the private network VPC. The VPC network is an isolated network environment constructed based on Ali Cloud, which is logically completely isolated from private networks. A VPC network is a user’s private network on the cloud, that is, a private network provided by the user. In theory, cloud servers deployed on private networks in a VPC are secure. Cloud servers of different users are deployed on different private networks and are isolated by tunnel ids. In addition, some users may have more requirements. For example, in addition to VPC requirements, government cloud users also need encryption between components. Ali Cloud also supports such scenarios. In addition, Ali cloud message queue Kafka also supports blacklists and authentication functions, which can ensure data security through a variety of mechanisms.

Advantages of message queue Kafka

These are some of the advantages of Kafka, and to summarize. Message queue Kafka is fully compatible with Apache Kafka, Apache Kafka can use the entire ecology of products, such as Flume products and lower end of Spark, Storm, Flink and ES, for message queue Kafka is fully compatible. Second, the message queue Kafka provides is a fully managed service, which means that any problems that occur in the cluster, whether disk problems, network problems, Kafka itself or the products on which it depends, are resolved by a team of professionals. For users, what they can see is 99.9% availability of the product, and can bring users a very stable state, while the underlying technical details are handled by ali Cloud professional team. In terms of high availability and high reliability, it is strongly associated with full hosting. Data reliability is the most important thing for every product, because when data loss occurs, the whole business logic may go wrong, and then cause some major failures. What Ali Cloud promises is that when users use message queue Kafka to send messages, as long as the message returned is successful, the reliability of the data can reach 8 9, which is also a point that users need not worry about. At the same time, Ali Cloud message queue provides users with very practical business reports and flexible and comprehensive business monitoring system, and business monitoring and reports are based on the user’s business dimension, including the disk water level of the whole cluster, Topic and Consumer Group and other business-related indicators that all users care about. These contents are deposited in the console of the message queue Kafka, and users can log in to the console directly to see the overall operation of the business. Finally, the data on the run in the message queue Kafka is very safe, with VPC network isolation, authentication and encryption, and black and white lists a series of security to ensure the user’s data is very safe, at the same time, message queue Kafka has a huge advantage is its purchase of each instance is users to buy the exclusive, There will be no instability in the whole system due to the interaction between users.

Three,

The advantages of the message queue Kafka are described above, and the scenarios in which it applies are shared. In fact, it can be considered that the message queue Kafka and open source Apache Kafka are applicable to the same scenario, the difference is that the message queue Kafka has higher reliability and availability, at the same time does not require users to operate and maintain. Build a log analysis platform

Taobao, Tmall platform and other companies will generate a large number of logs every day. Operations and maintenance teams, as well as decision makers, need to analyze and count the entire log data. The performance of Kafka itself is very efficient, and The characteristics of Kafka make it very suitable as a “log collection center”. This is because Kafka collects logs without being aware of the traffic, it is compatible with its upstream, and can encrypt messages directly through configuration. When log data is sent to the Kafka cluster, it is virtually non-invasive to the business. At the same time, it can directly connect to offline warehouse storage such as Hadoop/ODPS and Strom/Spark for real-time online analysis downstream. In this case, Kafka allows the user to focus on the business logic of the process without doing much development to implement statistics, analysis, and reporting.

Site activity tracking scenarios

In addition to data analysis and reporting, Kafka can also implement web activity tracking scenarios. Kafka collects real-time data on web site activity, such as browsing, searching, and behavior. Message queue Kafka can shard different data models of a business through topics. So, users can be divided according to the registration or login and purchase, for different scenarios that need to be tracked downstream, can be connected to different processing systems, such as real-time processing, real-time monitoring and offline processing, Kafka in this scenario is very convenient and easy to use.

Data generates value in the flow

The first two examples are the message queue Kafka assumes the role of data input flow in the whole solution, and Kafka can not only serve as the data input flow, but also can do flow calculation processing, such as the stock market trend analysis, weather data measurement and control, website user behavior analysis and other fields. Due to the fast data generation, strong real-time performance and large data volume in these fields, it is difficult to uniformly collect and store data before processing, which leads to the traditional data processing architecture can not meet the needs. Streaming computing engines such as Kafka Stream and Storm/Samza/Spark can analyze data based on business requirements and ultimately save or distribute the results to desired components.

Multiplex forward

We often encounter scenarios that require different computing methods for different business dimensions. For example, for an accounting system, real-time stream processing may be required. For statistical analysis, batch computing can be used. Using Kafka, you can achieve multiple forwarding. Upstream production of a piece of data, multiple downstream nodes can obtain the data and make corresponding processing, so Kafka can complete the function of data multiple forwarding.

Commercial release of message queue Kafka

The message queue Kafka was officially released commercially on July 1, 2018. At present, it can be used commercially in north China 1, North China 2, East China 1, East China 2 and South China 1. Currently, VPC deployment is supported, and the non-VPC version is expected to be released in September. The non-VPC version mainly resolves the access problems of current public network users and the remaining problems of classic network users. In the early stages, message queue Kafka focused on stability and cost optimization issues, with resource alerts scheduled to go live in August.

For users, stability is always in the first place. Finally, through this sharing, I hope users can remember: Ali cloud message queue Kafka is very easy to use, and the open source version of Kafka can achieve 0 cost switch, at the same time data reliability and service availability is very high, users no longer have to worry about the failure of the whole business due to Kafka problems. Kafka is also clearly positioned to use Kafka in big data scenarios and MQ across businesses, compared to Another Aliyun product, MQ. And while Kafka’s docking capabilities for the ecosystem are very powerful, MQ offers enhanced capabilities such as transactions, timed messages, and sequential messages. This month is the first month of ali Cloud message queue Kafka activity, monthly package will give a 15% discount, the package to 20% discount.