Netflix has developed a network-observable side car called Flow, which uses eBPF tracing points to capture TCP traffic in near real time. This high-performance satchel takes up less than 1% of CPU and memory on instances, providing massive traffic data for network insight.

challenge

The cloud network infrastructure currently used by Netflix includes AWS services, such as VPC, DirectConnect, VPC 太 太, Transit Gateways, NAT Gateways, and devices owned by Netflix. Netflix’s software infrastructure is a large distributed ecosystem consisting of specialized layers of functionality that run on AWS and Netflix-owned services. While we strive to keep the ecosystem simple, taking advantage of the inherent nature of various technologies will cause us to face various challenges.

  • Mapping of application dependencies and data flows. As the number of microservices increases, it becomes difficult for service owners and centralized teams to identify systemic problems without understanding and understanding application dependencies and visibility of data flows.
  • Path verification. The speed at which Netflix changes in the production stream and studio environment can cause the service to be unable to communicate with other resources.
  • Service segmentation. The convenience of cloud deployment has led to the organic growth of multiple AWS accounts, deployment practices, interconnection practices, and more. Without network visibility, it is difficult to improve our reliability, security and capacity posture.
  • Network availability. The expected continued growth of our ecosystem makes it difficult to understand our network bottlenecks and possible limits.

Cloud Computing Network Insight is a set of solutions that provide operational and analytical insights into the cloud computing network infrastructure to address identified problems. By collecting, accessing and analyzing network data from various sources, such as VPC traffic logs, ELB access logs, eBPF traffic logs on instances, etc., we can provide network insights to users and central teams through Lumen, Atlas and other data visualization technologies.

Flow exporter

Flow Exporter is a satchel that uses eBPF trace points to capture TCP traffic in near real time on instances that support Netflix’s microservices architecture.

What is BPF?

Berkeley Packet Filter (BPF) is an in-kernel execution engine that handles virtual instruction sets and has been extended as eBPF to provide a secure way to extend kernel functionality. In some ways, eBPF does for the kernel what JavaScript does for websites: it allows all kinds of new applications to be created.

An eBPF traffic log represents one or more network traffic and contains TCP/IP statistics that occur over a variable aggregation interval.

Sidecar uses less than 1% of CPU and memory on any instance in our fleet by leveraging high-performance eBPF with carefully selected transport protocols. The choice of transport protocols, such as GRPC, HTTPS, and UDP, depends on the characteristics of instance placement at run time.

The run-time behavior of the traffic output can be managed dynamically through configuration changes of quick properties. Flow Exporter also publishes various performance metrics to Atlas. These metrics are visualized through Lumen, a self-serving dashboard infrastructure.

So how do we capture and enrich this traffic on a large scale?

The traffic collector is a regional service that ingests and enriches traffic. IP addresses in the cloud can be moved from one EC2 instance or Titus container to another over time. We use Sonar to attribute an IP address to a particular application at a particular time. Sonar is an IPv6 and IPv4 address identity tracking service.

The Flow Collector consumes two streams of data: IP address change events from Sonar via Kafka and eBPF traffic log data from Flow’s Exporter device. It uses application metadata from Sonar for real-time attribution of traffic data. The owning traffic is pushed to Keystone and routed to Hive and Druid data stores.

The attributed traffic data drives various use cases within Netflix, such as network monitoring and network usage prediction through Lumen dashboards and machine learning-based network segmentation. The data is also used by security and other partner teams for insight and incident analysis.

conclusion

With eBPF and a highly scalable and efficient traffic collection pipeline, it is possible to use eBPF traffic logs for large-scale network insights into cloud network infrastructure. After several architectural iterations and some tweaks, the solution has proven to be scalable.

We currently ingest and enrich billions of eBPF traffic logs per hour and provide visibility to our cloud ecosystem. Rich data enables us to analyze the various dimensions of the network (such as availability, performance, and security) to ensure that applications can efficiently deliver their data payloads in a globally decentralized cloud-based ecosystem.

Original link: netflixtechblog.com/how-netflix…