In the cloud era, the tenfold performance improvement of locally deployed enterprise-class software in a short period of time without capacity expansion is indeed an eye-raising achievement.

As we all know, the performance of x86 has been squeezed to the limit. With a lot of questions, the SegmentFault community interviewed Yang Xiang, director of network development at Spruce. In the interview, Xiang Yang talked in detail about the performance of various time-series databases in the enterprise data center network scenario, and explained how the Spruce Network achieves tenx performance improvement of x86 software from an engineering perspective.

Xiang Yang: Spruce network research and development director, network architect. Responsible for the management of the Spruce R&D team and the architecture design and core function implementation of Deepflow.

In 2013, he received his doctor’s degree in computer science and technology from Tsinghua University, under the tutelage of Professor Jianping Wu, and independently realized the world’s first BGP hijacking detection system based on correlation analysis. Therefore, he won the Internet Measurement Conference (IMC, Community Contribution Award, Top International Conference on Network Measurement. In 2015, he obtained a postdoctoral certificate from Tsinghua University. His main research direction is cloud data center network architecture, and he has obtained a number of patents related to network security and cloud data center.

SCARF: Could you please introduce your main work experience, your focus on technical research, and your current work?

Xiangyang: Actually, my working experience is relatively simple. Before joining Spruce, I studied for a PhD in Tsinghua University. Under the tutelage of academician Wu Jianping, I did some work including the algorithm and structure of inter-domain routing, security of inter-domain routing and so on. After graduation, I came into contact with the field of SDN, so it was logical for me to join Spruce, and then I have been engaged in the research and development of SDN until now. I am now head of the R&D team and the Deepflow product line at the company.


SCARF: With the improvement of business needs and the development of cloud computing and other related technologies, enterprises begin to build their own cloud data centers, of which the network is one of the important components. In your opinion, what are the pain points in the network architecture of cloud data centers in modern enterprises? How should companies respond to these challenges?

Xiangyang: We can see that when enterprises are doing this, one of the main challenges can be divided into two aspects. One aspect is how to build it, and the other aspect is how to maintain it.

The construction mainly solves two pain points: one is the problem of network connectivity, the other is the problem of network service.

In order to support the business, it is necessary to connect and interconnect heterogeneous resources and hybrid cloud in the first place. Then on top of that, you can provide rich network services, including application delivery services, security services, etc.

Parallel to the above problems is the problem of operation and maintenance of complex networks. Because networks can be quite complex these days, typically the IT infrastructure will be on different pools of resources in the public cloud, the private cloud, and a virtualized pool of container resources. In such a set of IT infrastructure to open a unified network, to support the flexible needs of the business.

Such as a business to the resource pool in the container inside to get a few POD, inside the virtual machine resource pool to some virtual machine, inside the bare-metal resource pool to get several physical machine to implement a business requirements, this time in fact these resource pool is independent of each other, but from the perspective of the network (the business) as a whole, It is a challenge to get these networks across and orchestrate them in a unified way.

On this basis, if we can get the network through, then the difficulty of maintaining the network is very high. It is impossible to maintain the network by manual labor. If the complexity of the network is increased by 10 times, it means that at least 10 times the manpower is needed, which is difficult to sustain.

SF: In order to help enterprises solve the challenges of network operation and maintenance management, Spruce Network launched its cloud network analysis product DeepFlow as early as 2016. Could you please first introduce the core components, functional features and application scenarios of DeepFlow?

Xiangyang: We actually divided one of Deepflow’s solutions into two scenarios, one for acquisition and distribution, and the other for analysis. That corresponds to the three core components of DeepFlow’s entire product: the collector, the analyzer, and the controller.

From a component perspective, the controller is responsible for central control and large-scale (collector) management, Because our collector will run in many heterogeneous environments — KVM virtualization, VMware virtualization, public cloud environment, private cloud environment, container environment, Linux environment, and Windows environment — the collector will run in a heterogeneous environment. Another feature is the very large number of collectors.

Controller what it needs to do is a centralized large-scale control; What the collector needs to do is just to cover the whole network of these heterogeneous environments, including physical, virtual, etc., to do a full network coverage; And we need to have a high performance strategy matching algorithm.

Because if all the traffic data are processed, the cost of resources at this time will be very large. Some traffic we just want to look at, some traffic we want to record some of its counter, and some traffic will enter its PCAP file, enter some of its detailed data packets, and distribute some traffic directly out.

Different traffic requirements, we need to use a strategic system to do a equivalent to the choreography. By matching, some traffic in line with a certain intention is given to the corresponding consumption tool at the back end. In this case, we need to have a strong policy matching engine in the collection side. In addition to the analysis tools distributed to the third party, traditional NMP and DPI, we can also do some analysis by ourselves, which is our analyzer.

One of the main features of our analyzer is to store all the state and statistics of the network through a distributed temporal database. This is equivalent to a panoramic view of the network to do a description. When a customer does a network troubleshooting in a hybrid cloud scenario, they can connect all these connections, the different networks, the different layers, the different Overlay, the different Underlay — maybe a two-layer Overlay in a container scenario, for example — and connect them all together.

Back to the application scenario. The first is the collection and distribution of traffic on a hybrid cloud. This kind of application scenario is generally aimed at customers who already have a lot of traditional analysis equipment, such as DPI equipment, NPM equipment, etc., but they are suffering from unable to get the traffic in the virtual network and container network.

In the scenario of virtual network, the scale of the network is very large. Now a server can create 10 virtual machines, and a virtual machine may have 10 PODs, which is a very large number. So in the virtual network you can not like the traditional physical network by spectroscopic, mirror direct access to traffic. We can do the full network coverage, on demand to the traffic out to the back end of the analysis tools.

Another scenario is network diagnostics for the hybrid cloud, which is actually a network analysis scenario. Because we see that the current cloud network is sufficiently complex, there are heterogeneous resource pools, there are different levels of Overlay, how can we do fault diagnosis and positioning in such a complex network environment? This requires us to have an analysis ability of the whole network traffic data, namely network analysis.

Scarf: Performance has improved with every update to Deepflow since version 5.0, and in particular the performance of the core components has improved significantly since version 5.5.6. How has this been achieved? Is it related to programming languages and databases? Why does Spruce focus so much on performance improvement?

Xiangyang: First of all, let’s talk about how these performance improvements are realized. In fact, there are several aspects in general. On the acquisition side, we have introduced some new technologies, such as DPDK and XDP with the advanced version of the Linux kernel. XDP is less dependent on the customer’s environment than DPDK, and the performance of traffic collection can be improved by 10 times compared with our previous generation technology.

The kernel version of CentOS 8 released in 2019 does not support XDP very well. We have also made some improvements to XDP support from the kernel level. This allows us to improve the performance of our collector in a lower version of Linux environment.

The other aspect is the analysis side, mainly we are an optimization based on the data structure and algorithm. Most of the components of our entire Deepflow platform are based on Golang, and we’ll talk about why we chose Go in a moment. We made some very critical algorithm and data structure changes to its native data structures, such as Map, which improved its performance by a factor of 10.

We have some patents, such as some improvements to the object resource pool and memory pool in Go, and some improvements to Map, which are the optimization on the analysis side.

Then there’s the storage-side optimization, where we’ve redeveloped the kernel for the InfluxDB database, achieving a 10x increase in read and write performance, the ability to scale horizontal and, most importantly, make it more suitable for networking scenarios.

Going back to Golang, it’s not so much about the language. If we wanted to achieve extreme performance, we would probably choose a language like C. But there is another issue here, which is our development efficiency and adaptability. If we use C, we may be more dependent on the environment (such as the version of glibc).

Go is actually a language of the cloud age, and technologies like Docker are built on top of Go, with very little dependency. We chose Go to gain advantages in dependency and development efficiency, but also to overcome its shortcomings, such as GC shortcomings and data structure performance issues, which we have improved.

Finally, why we are so focused on software efficiency, because we are a software company. Hardware companies that make boxes, for example, tend to focus more on things like personal computers. We have to do a cloud native platform software, so that it can run at any place in public clouds and private clouds, container, it should not be to the running environment of the operating system as well as the physical machine have any hypothesis, and it also needs to under the premise that does not require hardware environment for customers more benefits.

That is to say, there is no way for us to change the hardware, so we need to pursue the ultimate performance of the software, so as to bring value to customers.

SF: Spruce Network used open source time-series database influxDB in its product development process. What is the basis for Deepflow’s database selection and development? Has there been any change in database selection during the three years of continuous Deepflow iterations?

Xiangyang: Back three years ago, when we first started working on this product, we found that the development of temporal database was not good. At that time, the time series database is based on some traditional databases, it is not a database directly facing the time series data scene. For example, ElasticSearch is actually a search engine but is used as a timing database.

The version of InfluxDB at that time was in the 0.x era. We started out using both ElasticSearch and InfluxDB. On the one hand, it relies on the stability of Elasticsearch and its massive horizontal scaling capability; On the other hand, as we saw at that time, time-series data is a new data type, which cannot be stored in all the existing databases directly, so we used InfluxDB in a small range.

Later on we had a major revision because ElasticSearch as a search engine consumed a lot of resources and was not suitable for the timing scenario, let alone for the storage of data in a large-scale network monitoring scenario. So we switched all the way back to InfluxDB, which was the second stage of our database selection. After that, there was a cluster solution for some time in the open source version, but it was dropped after a release — the feature became a commercial solution.

This is part of the reason we switched databases, but part of the reason, and more importantly, is that we found that the InfluxDB case was not very suitable for networking scenarios. We use a normal sequential database to monitor 10,000 servers, and 10,000 servers have a CPU value at every point in time, say, per second, so the amount of data is on the order of 10,000 servers per second.

In other words, the monitoring data is strongly related to the number of servers being monitored, but in a virtual network scenario, the number of machines (virtual machines) is several orders of magnitude higher. Or the above scenario, if there is any exchange of visits between the two machines at this moment, although the data to the square of the above data level, but can still several orders of magnitude higher than that of the above, and the relationship between this access will also be able to go to and other dimensions, such as type of protocol (TCP/UDP), port and the quantity is directly related to the service. The data contained here is a very high-dimensional storage, and it is equivalent to a sparse matrix storage, which is not quite the same as the application scenarios of the classic time series database.

In addition, there are some specific requirements under the network scene, such as querying a network segment and matching the weight of a network segment. In particular, the network traffic is basically distributed to different machines through load sharing. In such a large data volume, it is not supported that how to aggregate and store the results processed by different machines at this time. The InfluxDB in these scenarios is not supported. Therefore, based on the core storage and query engine, we improved the performance, extended the level, made the high availability support, and did more network data query, aggregation and filtering support. The result is the self-developed network time series database that we use today.

We’ve tested many long-term storage solutions like Prometheus on the back end, because Prometheus is good for short-term data storage — usually just a day or two. It has a lot of back-end storage, like S3DB (Simple Sloppy Semantic Database), VictoriaMetricsDB and so on have a lot of such databases. In terms of the open source community’s ranking of time-series databases, InfluxDB is number one, but other databases are much better than its performance test data.

This performance test is also individual, and it is important to note that other databases tend to be limited to a specific scenario in which the test data is better, or it is not widely used. So we also use the InfluxDB for our own time-series data store when considering.

On another level, the existing timing database is based on the same scenario, that is, the physical server is monitored at a fixed frequency, which is not the case for network monitoring. The objects of network monitoring are massive one-dimensional IP, two-dimensional IP pairs, three-dimensional IP port numbers, etc., which is obviously not fixed frequency monitoring, but a sparse matrix monitoring. This difference makes us consider that, for example, all the existing time-series databases are using TSM as the data storage structure. Therefore, we are also developing the next generation of products on this data storage structure to support the TSM feature of sparse matrix, so as to better store and retrieve network data.

Simply put, because influxDB is more widely used, mature, and stable, and all other time-series databases are virtually identical at the lowest algorithmic level. So the difference in test performance could be a difference in test method or usage scenario or other aspects, and we think this difference is actually negligible.

SCARF: Spruce Network uses open source database components in its product development process. Do you have any thoughts about open source at present?

Xiangyang: We are not an open source software driven company yet. In the future, we may feed back some components to the community. We have also seen the ability to cluster removed in the Community version of InfluxDB. We feel that our clustering feature does a pretty good job, it is very easy to operate and does not rely on clustering protocols like ZooKeeper’s Zab or RAFT, and is a highly available clustering approach that is well suited for time-series data scenarios. If we were to contribute our own cluster implementation and scale-out query capabilities to the community, it would have an impact on the community’s functioning, as the community has already put some of those capabilities into a commercial version.

SF: Deepflow opens up the interface between development and data, allowing customers to develop personalized applications and tools on Deepflow. What programming languages does Deepflow currently support?

Xiangyang: For custom development, we now provide two ways. The first way is API, which can be called by all languages because it is RESTful.

The other way is the SDK based on Python. We see many advantages of Python, including its large user base, low barriers to use and rich libraries. It has natural advantages in data processing and network orchestration.

SCARF: Will there be future extensions to programming language support, such as Java? Will these custom-built applications affect DeepFlow’s performance?

Xiangyang: Right now our plans in this area are mainly customer-driven. Customer-developed applications through our APIs have little impact on DeepFlow performance. On top of our product module, the storage engine also has a distributed query engine that does a lot of distributed computing. In fact, API calls leave the computation to our parser cluster, where the bulk of the work is done. Therefore, the efficiency of the API caller is not the bottleneck of the whole query chain, because the final data given to the API caller is the result of a fully aggregated data, which is of a far greater magnitude than the data we store on the platform (for example, a month’s network data). No matter what language a customer chooses for custom development, Deepflow performance is not affected.

SCARF: Under the current environment, most Internet companies choose agile development and rapid iteration in product development. Deepflow, on the other hand, decoupled the software architecture after version 5.0, so what made you decide to do this? How do you see the balance between software architecture stability and the ability to respond quickly to changing requirements?

Xiangyang: As we develop Deepflow, the version iteration cycle is constantly changing. Three years ago our iteration cycles were six months — which is common for a company to B. But our product runs in the cloud, and the ability to iterate quickly is very important. We had to deliver the product quickly to the customer, and the customer’s environment was not an operational system for us, nor could we get our product updated many times a day. We need to make a trade-off in this case, making sure that our iteration cycles are as low as possible and the product is as stable as possible.

Against this background, we have decoupled the DeepFlow platform from six months three years ago to six weeks now, allowing us to respond quickly to customer needs — not in projects, but in standardized product releases.

We use the idea of productization to respond to the growing customer needs. The decoupled products are divided into two layers: the upper layer is the application layer and the lower layer is the platform layer. The platform layer pursues high performance and stability, while the application layer pursues flexibility and high efficiency. In the process of product iteration, we can selectively arrange the upper and lower iteration cycles. For example, in successive releases, the underlying platform does not need to be iterated, but at the same time, we can update the upper application with each release to meet the new needs of customers.

SS: Are there any other changes in DeepFlow’s software architecture that go beyond decoupling? What is the reason?

Xiangyang: One of the obvious changes we made in DeepFlow development was that we used a lot of open source components when we first made the product, which is also the practice of many similar products, but it was a very big test for the stability of the product. Because open source components are more oriented to operation, need someone to watch; Because the code is not completely under our control, when something goes wrong, it will be slow to fix.

Then we slowly switched to another mode, which was to develop a large number of components by ourselves. Now, MySQL is an open source component in our product. Because it is so classic and stable, we don’t need to make any changes to it. Other than that, we developed all the other components ourselves. We went from making full use of open source components to developing and using open source libraries ourselves. Open source components such as Elasticsearch are stable at core, but many of its surrounding components such as input/output, start/stop operations are prone to problems.

We are now replacing some very mature libraries, and we are gradually replacing other components as well as we are replacing influxDB. Right now we only have the core storage and query engine for influxDB, but we are slowly replacing that as influxDB is based on classic TSM and not very suitable for networking scenarios.

SS: Is there any difference in the main roles of MySQL and InfluxDB in a product component?

Xiangyang: MySQL mainly stores some metadata, which we call metadata. It is some business-level data. However, InfluxDB stores time-series data, that is, there are many resource objects that are generating statistics all the time, and we store statistics on this time dimension in InfluxDB.

Scarf: There are different places for metadata and timing data. Will the query performance be affected?

Xiangyang: The two kinds of data in the same place is not very suitable for data maintenance. For example, for business data or relational data, such as a virtual machine Interface, associated IP how to do add, delete, check, change?

If the information is stored at each point in time, as is often the case, the virtual machine itself will need to change its historical data accordingly. The temporal data is different. The temporal data is the statistical data at a fixed time point for a specific object, and there is basically no modification of the historical data. The only scenarios where you need to make changes are when you do failover, or when the object changes. To give an easy to understand example, the storage scenario of digital data is a bit like storing the product page of an e-commerce company. An SKU’s product name, image, description and other attributes (specific objects) rarely change (unless offline), but real-time data of prices and inventory change frequently or even all the time.

SF: What is the difference between Deepflow’s own applications calling the platform layer and those of customers developing personalized applications?

Xiangyang: The difference is that we add an authentication mechanism for customers, and there is no big difference in the rest. The system native application and the customer’s own development application is completely equal. We are very focused on engaging our customers in the whole product chain, including the applications that the customers develop, and the data that the customers generate in our products.

SCARF: Spruce’s customers are mainly enterprises in traditional industries such as finance, telecommunications and manufacturing. Our impression of these enterprises is that they are usually slow. These customers’ businesses don’t have that much rapid change, so why should Spruce make these extreme improvements in performance, in rapid iteration?

Xiangyang: In fact, the overall technology environment of customers has changed, especially the IT environment. For example, customers in financial banks are already using virtualization, containers, micro-services and other applications in production environments on a large scale. In fact, customers are relatively quick to introduce new technology. On the other hand, these customers are really buying “boxes” before, and now they are buying software in the cloud environment. If our software consumes too much of the customer’s hardware resources, then the value of the software can hardly be reflected.

SCARF: So what is the next phase of Deepflow that we plan to bring to our customers? Can you tell us about the next development focus of Deepflow?

Xiangyang: In the previous stage, we mainly productized the DeepFlow data collection capability and some data statistics capability. The next stage is to improve the ability to analyze data intelligently. Intelligent analysis ability is first reflected in such as how to relate the traffic before and after a network element device, such as a firewall, a load balancer before and after the traffic, which traffic belongs to the same session? A customer accesses the load balancer, and the load balancer accesses a host at the back end. How do I draw the access chain? All of these are intelligent analytical capabilities.

Another layer is the correlation between the different network layers, like the data of the container network and the data of the next virtual network and the data of the next physical network. This correlation in a hybrid cloud scenario gives customers a complete, end-to-end, hop-by-hop diagnostic capability, which is an important direction in our product evolution. Now we have collected a lot of network data, how to do intelligent baseline warning, fault warning processing and fault warning based on these network data, this is what we need to do next.

From the customer’s usage level, we will pay more attention to the data generated by the customer. As I just mentioned, the SDK is a means of customer engagement. Deepflow is mainly a platform for data. We will not limit how customers use this platform. We allow customers to flexibly build some views of monitoring data on this platform, and they can easily DIY their own large monitoring screen from these views, which is an improvement of our ability to customize customers.

SCARF: Did Deepflow make any adjustments for the industry?

Xiangyang: We are mainly making adjustments at the level of solutions. We will combine upstream and downstream products to build complete solutions to customers, facing different scenarios and solving different problems.

SCARF: What skills do you think a future cloud network engineer might need to acquire in order to adapt to the future cloud era?

Xiangyang: The days when a networker was CLI oriented and sometimes had to do some automated work, such as writing Expect scripts to capture SNMP data, are slowly becoming a thing of the past. Now net work may be more to improve yourself from two aspects, said to me a job advertisement here, welcome interested in SDN colleagues to join us:…

The first is automation. With a networker now managing an order of magnitude larger than ever before, automation is definitely a must. Python, as I mentioned earlier, is actually a better programming language for automation, and the barriers to entry are not high.

The second is the ability to analyze data for complex systems. People in different positions have different needs for monitoring data in a system, and you can’t expect a production-grade thing to solve all problems with a click of a mouse. Network engineers need to be able to get the results they want based on the data they collect, with a little bit of processing. In addition to network data, there may be some system data, such as log data. How to do the correlation and analysis of these data still requires human creativity at present.