Design and Implementation of Traffic Collection and Distribution Scheme for the Whole Network of Hybrid Cloud (Part II)

The whole network traffic collection and distribution scheme

At present, most large enterprises have IT facilities resources of multi-data centers and hybrid clouds. From the perspective of network, as shown in the figure below, their own data centers are interconnected through a proprietary network and divided into business areas, and there may be multiple branch networks. In order to ensure resource flexibility and fast launch of services, public cloud resources are also widely used and multiple cloud service providers are selected. Enterprises need to have a comprehensive and clear picture of the network from the aspects of operation, maintenance, obstacle removal, operation management, business performance and so on.

The goal of this scheme is to establish a unified and efficient network traffic collection and processing platform for enterprise hybrid cloud, realize a unified traffic collection abstraction layer in the face of all kinds of resource pools, support IPv4 and IPv6 protocol environment, and realize filtering, de-duplication, compression, trunching and other processing functions for the traffic, which can be used for the network operation center (NOC: Network Operation Center (SOC), Security Operation Center (SOC), big data analysis platform and other traffic consumers provide data supply.

The realization of the collection and processing of the whole network traffic can be planned from the region and resource pool. The following plans are described from the data center side, the public cloud side and the overall control and management side.

Data center side

On the side of data center, the network partitions that are usually divided are taken as examples, such as Internet business area, external business area, core business area, etc. For the deployment of the whole platform, data center can be defined as Region, which can contain multiple AZ Available zones.

To obtain the network traffic in the region, it can include the physical network in the available area and the network data traffic in the resource pool, as shown in the figure of typical region.

In the scope of the physical network involved, in addition to the available area of the internal network, also includes all kinds of links, such as dedicated lines, Internet (_ISP: __Internet Service Provider_) link, etc.. Flow or flow information can be obtained by mirroring, splitting, sFlow, NetFlow/IPFIX, etc. In a hybrid cloud environment, the challenge of the scheme is the scope involved in the resource pool. The network boundary is mainly composed of all kinds of virtual machine switches, which are large in number and fluctuate greatly. At the same time, new technologies are also involved.

The Deepflow ® TRIDENT flow collectors of all sizes provide the basic capture capability for a networked flow collection solution.

Network traffic collection in the resource pool

The different resource pools are equipped with different forms of collectors to provide the optimal network traffic capture ability, including VMware ESXI collector, KVM collector, KVM-DPDK collector, HyperV collector, container ONVM collector, container ONHost collector, to avoid the configuration of virtual switches in the pool. The collector operates independently as a process to reduce the impact on the current network and to avoid the risk of possible interference with production configuration.

For the bare metal equipment resource pool, the network traffic in the pool can be obtained through the LEAF switch and the port image of the access switch, and then collected to the TAP device, and then the data packet processing operation can be realized by the exclusive server type collector, or the collector can be installed on each bare metal equipment system that needs to be collected.

Physical network traffic collection

In the physical network, traffic acquisition mainly through port mirror, spectroscopic, collector mainly to filtering network packets, and processing, such as distribution where the main business area of the ISP Internet line, communications dedicated lines, the centers of regional export lines and lines before and after the firewall, load balancing device.

In physical network management, information collection and display such as forwarding path, port statistics and Telemetry data of physical network switching matrix (Fabric) are required, which are usually provided and resolved by the monitoring scheme of physical equipment manufacturer. The Deepflow ® collector can connect with standard data output of network devices such as Sflow, NetFlow/ IPFix in the Switching Matrix (Fabric).

Support in DPDK environment

In operator CT (Communications Technology) networks, NFV Technology solutions have been applied. In its virtual network implementation, virtual switches such as OVS (Open VSwitch), through the use of data plane development suite (DPDK: Data Plane Development Kit) _ to improve packet processing performance. In a CT environment, the communication traffic between Virtual Network elements (VNF: Virtual Network Feature), especially the control signaling traffic, also faces the acquisition problem after Network virtualization.

In the enterprise environment, if there is a resource pool with DPDK suite, the KVM-DPDK collector can be selected in the scheme to collect the traffic in the resource pool.

Multi-area support

Most companies considering a unified traffic collection platform have IT resources in multiple data centers and multiple branches. As shown in the figure below, network traffic collection requirements in various data centers and resource pools are fulfilled by the corresponding collectors.

Public cloud side

The public cloud provides VPC network for tenants, and the Workload collector is deployed on virtual machines, containers, and bare metal devices in the form of user-oriented software. The Workload collector supports Linux, Windows and other mainstream operating systems to realize network traffic collection of various resources in VPC.

On the public cloud side, the Underlay network is maintained and provided by the cloud operator, and the collector is installed on the Workload Operating System as a user process to complete the network traffic acquisition. At the same time, in the environment of virtual machine container deployment, the container collector can realize the network traffic acquisition of container POD.

Since the deployment is installed on the Workload Operating System and there are a large number of collectors, they can be preinstalled through mirroring.

Control management side

The first two chapters mainly introduce the ability of collecting and acquiring the traffic of various types of resource pools. Since the number of collectors is large, the policy dimension is many, and the fluctuation is prominent, the management and control of collectors is the focus of scheme capability evaluation.

In the face of multi-data center, multi-cloud and heterogeneous hybrid cloud infrastructure, unified construction of network traffic management and scheduling platform, to solve the problem of large scale and manageability, the design of control surface is the core point. The controller is the control center issued by the management control collector and policy. It can be divided into master controller, standby controller and slave controller, which can be selected according to the deployment requirements.

Master Controller: The control hub and interface for external interactions and services throughout the Deepflow ® platform. There is only one master controller in the deployed Deepflow ® platform, and the region in which the master controller resides is called the master region. Backup controller: it has the same function as the main controller. When the main controller goes down or fails to provide service, the main controller will be switched over automatically. The Deepflow ® controller cluster does not have high availability without a backup controller. There is only one standby controller in the entire Deepflow ®, and it must be in the same area as the primary controller and share a virtual IP address to provide external services. Slave controller: is responsible for controlling the collector and data nodes in the Region or AZ Available Zone, and synchronizes the master controller’s policy and cloud platform resource information to all the collectors and data nodes. In addition to the areas where the master and standby controllers belong, at least one slave controller is deployed in each area, and load balancing and high availability can be realized among multiple slave controllers in the same area.

In a multi-point deployment environment, the primary Region is first designated, and the primary controller exists in the primary Region. When the high availability function of the primary controller is started, multiple controllers should be deployed in the primary Region. The state synchronization between controllers is guaranteed through heartbeat, and the primary and standby controller election is initiated in time. After the main controller is elected, the control entrance is provided for the overall flow management platform. Regional controllers other than the master region are slave controllers and do not participate in the master controller election.

A region can be divided into multiple Available zones (AZ: Available Zones). Usually, the Available Zone is taken as the unit, and each type of collector in the Available Zone is independently controlled by a single controller, and the local collector is issued with acquisition strategy, distribution strategy and pre-processing strategy. Multi-area control communication can be carried out through dedicated line network, mainly including management, policy and other communications.

Usually, in an environment with branches, the quantity is larger than that of the data center, which is mainly the traffic of requesting services. There is no server in the region, so the traffic data needed is mainly to build the overall situation of the network and analyze the end-to-end network performance of the business. There is no need to deploy the controller independently, and the collector can be divided under the management of the controller in the nearby area according to the actual situation. In a public cloud environment, the controller is deployed in the virtual machine and manages the scoped collectors.

The controller completely controls the state of the collector, and all kinds of collectors have the same state machine mechanism, as shown in the figure below:

Each type of collector may be in several states, such as self-check, running, stop, exception, protection, etc. The protection state is to ensure that the platform can limit the use of CPU and memory resources when the collector is working. When configuring collector resource limitations when 1 vcpu, 1 gb of memory, running process, if there is a pressure, collector state will be “run” switch to “protect” state, discarded packets of the acquisition, processing, to ensure that no influence production environment, to readjust the allocation of resources or processing pressure drop, cut back to the “running” state.

In addition, the controller realizes multi-granularity collection and distribution strategy by docking virtualized resource pool controller, Configuration Management Data Base (CMDB), public cloud open API, etc., which makes it more flexible and closer to business applications in cloud and container environments.

A single controller can support up to 2000 collectors, which is usually the size of collectors involved in one available area. The master and standby controllers work together with the slave controllers, and the scale of the controllers is up to 50, and the election mechanism is realized in the main area. The overall scheme can meet the scale of 100,000 collectors and meet the network traffic collection requirements of private IT, public cloud and container of large enterprises.

Based on distributed monitoring traffic processing

The collector is no longer a simple network traffic pipeline, but a computing unit capable of processing the network traffic collected locally. Many collectors and controllers are built into a distributed traffic processing system of the same scale as the cloud network.

Tech-oriented web access, load balance between overall situation, application service and application security policies need to deal with now network capture flow after storage for analysis show, concentrated post-processing mass flow needs a large number of extended computing resources, in this scenario, the collector have the ability to patent pre calculation of the algorithm and the flow rate distribution in the resource pool on demand for processing, Effectively reduces the stress of distributing data to monitoring networks and back-end analysis tools.

The abstract layer of traffic collection and processing is realized through various types of collectors, which mainly abstracts the data packet processing ability, including filtering, de-weighting, data packet truncation, compression, feature marking and other functions.

filter

Filtering ability is the basis for efficient traffic acquisition and accurate realization of the value of network traffic data. It is not a good executable scheme to process and store all data packets without distinction. After filtering, the conditional dimension is the next important factor. It is far from enough to set filtering conditions based on network quintuple for acquisition policy, distribution policy and processing policy. In a pooled, multi-tenancy, container environment, only these can be said to be outdated. Richer filtering conditions, such as business, host, service, POD and other dimensions, need to be added.

Deduplication, truncation, stream logging, compression, marking

Deduplication ability is to ensure the accuracy of traffic data acquisition. Collector exists at both ends of network traffic and may be distributed in different resource pools and regions. At the same time, after capture, statistics can be carried out, source and destination end can be distinguished, etc., to provide support for the accuracy of analysis and visualization.

The ability of truncating is the ability to respond to the demand of the data consumer after obtaining the data packet. At the same time, it is also one of the foundations of compression ability. Can provide a packet header, the header after the specified offset length of the truncation capability.

Flow log capability is the ability to obtain network metadata from traffic packets, and is the basis of overall mapping and backtracking query of the whole network. Supports 80 types of metadata acquisition, in addition to the basic source, destination MAC, IP address, port, on demand to obtain more TCP and performance log information.

Compression capacity is to ensure the effective use of transmission bandwidth and storage resources after obtaining traffic data. If the back-end consumer needs packet headers or stream log data, the compression ratio is 100:1; if only telemetry statistics are needed, the compression ratio is 10000:1. This also ensures that in the overall scheme, for some branches, only the stream log information is required, and exclusive network lines are not required. VPN can be built through the Internet scheme.

The feature marking capability is to mark the characteristic value in the reserved field of the encapsulated tunnel header during packet distribution, which can be recognized when the package is unloaded. For back-end analysis tools, operation and maintenance platform for special packet recognition, can be used for network diagnosis, collection point location and other scenarios.

Package distribution

Packet distribution function is to solve the requirements of complete packet analysis and security guarantee. The core needs to guarantee the primitiveness of packets, including the contents and order of packets, etc. At the same time, it can distribute single packets to multiple destinations.

The packet distribution function is realized through three-layer tunnel (ERSPAN, VXLAN). The controller uniformly issues the distribution strategy. The packet is directly encapsulated by the involved collector end and sent to the destination end, which supports multi-purpose end sending.

In the scheme of mixed cloud packet distribution, the network plane of distribution needs to be considered. If the distribution flow is large, an independent network monitoring plane can be reserved. You can reuse existing physical networks if you are targeting only a small number of core services.

At the destination of distribution, you need to consider an unload solution for the encapsulated tunnel. In general, if the destination end is large platforms such as NOC, SoC, and big data platform, the tunnel can be unsealed by the VXLAN unload function of the physical switch. At the same time, if some traditional analysis tools do not have the ability to unload the tunnel, you can use the exclusive server collector, run the unload mode of the tunnel, deploy it in the front of the analysis tool, and send the packets to the analysis tool after unpacking. Of course, if the analysis tool has VXLAN offload capability, it can receive tunnel packets directly.

The overall distribution scheme is shown in the figure below. In the traditional physical network environment, NPB devices and schemes can well fulfill the requirements of packet distribution. However, in the hybrid cloud environment, the number of resource pools is large and the types are different. In the hybrid cloud environment, the ability of monitoring traffic distribution on demand is realized, and the distribution operation is distributed at each collection point in a distributed architecture. Avoid single point of performance bottlenecks and scenarios where the adaptation logic network spans multiple resources.

Data services

For the data consumption demand of non-original data packets, the platform provides an open data subscription method. The processed packet headers, network metadata and telemetry statistics are collected through the network plane to the high-performance timing database, which can be called through API and message queue can be called by other data consumption platforms.

High-performance timing database can be configured in each zone and available zone. Usually in a branch environment, there is no need to deploy a timing database, and its data can be compressed and written to the database in the nanometer zone.

The master controller responds directly to API calls to network data by querying the local timing database or collecting the results returned by the database API in the region and replying to the requester.

Data subscription can be provided through message queues such as ZeroMQ. After the data demand platform initiates message queue request to the database, the subscription service can be executed.

The deployment of

The overall plan mainly involves three parts: collector, controller and high-performance timing database. After the completion of the overall plan, the construction can be carried out by regions and resource pools in stages, and finally a unified traffic monitoring and management platform can be built for the enterprise hybrid cloud IT infrastructure environment.

Since the traditional physical network already has a complete monitoring scheme, KVM and container resource pool are usually chosen for the first step of deployment and implementation, to solve the problem that the “black box” of traffic in the virtual network environment is not visible, and to meet the requirements of the compliance audit of virtual network traffic. Collecting traffic connects with existing monitoring and analysis tools, and closes the tool chains of operation and maintenance and business analysis in the private cloud and container environment.

The second step is to bring in more resource pools, synchronize them with the newly built and expanded resource pools, access the switch Sflow data in the physical network, access the special line and other spectral flow data, so as to realize the flow collection capacity of the overall data center, and there is a unified flow monitoring plane. Connecting with network center, security center, intelligent operation and maintenance platforms, providing data packets and streaming data services to meet the display and analysis needs of all platforms for current network traffic data.

The third step is to collect the Workload or instance traffic running on the public cloud, complete the overall monitoring and traffic management of the hybrid cloud IT environment, and have the ability of overall network portrait, traffic distribution, and support traffic data distribution on multiple platforms.

For an already operational hybrid cloud environment that can be deployed without affecting production operations, network planning multiplexes the management, monitoring and distribution planes involved in the Deepflow ® platform into the existing network plane, often by multiplexing the existing network management plane.

As for the overall planning scheme, it is suggested to plan an independent network monitoring plane for the overall hybrid cloud, and to manage the regulatory traffic of the hybrid cloud uniformly and independently. In addition, the requirements of the computing power of the collector can be overall planned according to the processing flow and resources, and the minimum resource usage of a single collector can be configured with 128M of 1VCPU.

Plan advantage

Advanced flow collection

DeepFlow ® tech-oriented acquisition scheme, mainly traffic acquisition collector technology, collector supports KVM, VMWare, container type, such as deployment in process form installation, maximum extent, avoid to produce switching plane, there is no conflict with production plane switches, flow chart of risks, while on the operating system succession process level protection advantage, Achieve overall system stability.

Distributed processing system

After collecting data packets, centralized processing is avoided. Distributed architecture is adopted, and centralized management is achieved by distributed processing controller of collection points.

The scene is on a large scale

The overall scheme is based on the distributed design model and multi-region management, which can fully guarantee the elastic expansion of the resource pool scale. The overall system can manage 100,000 collectors, covering virtual machines, containers and public cloud resource pools.

manageability

The main controller of the platform is a centralized entry for the administrator to configure the acquisition and distribution policies, and has the ability to manage the state of all the collectors. All kinds of operations are close to the resource pool feature, supporting virtual machine name, subnet, cluster, container POD and other multi-dimensional. Resources exist migration, recovery, redeployment and other scenarios, and the strategy follow ensures the continuous execution of the collection capability in the dynamic environment. The granularity of platform management is a single collector, and the management control of the collector and its running state and history can be traced back.

Packet, stream data service

Data service is an important link connecting traffic collection with back-end platform, complete traffic packets are distributed to multiple destinations, and high-performance network timing database provides streaming data service through API, Zzeromq, Kafka and other message queues. It also decouples the collection from the various back-end analysis tools to prevent the flow collector from being confined to a single tool in a shaft.

conclusion

Deepflow ® Hybrid Cloud Network Traffic Monitoring Collection and Distribution Solution provides complete and sustainable platform-level traffic management for enterprises in the evolution process of hybrid cloud, cloud native and other new IT infrastructure environments, avoiding repeated investment, repeated installation, and solving practical network regulation problems. Also for the enterprise planning overall operation and maintenance, security platform to complement the current network traffic, flow log this section. This scheme has been applied in the IT environment of finance, operators and other customers.