24 to 26 June 2019, Hosted by CloudNative Computing Foundation (CNCF), the KubeCon + CloudNative Vecon + Open Source Summit (Shanghai) will be held in Shanghai, China. Following the first successful landing of KubeCon in China in 2018, this year KubeCon will attract thousands of technicians from all over the world to participate in this grand event, to participate in the in-depth discussion and case analysis of all CNCF projects and topics, and to listen to the sharing of CNCF project operators and end users. The program committee of this year’s KubeCon + CloudNativeCon + Open Source Summit consists of 75 experts reviewing 618 proposals, In KubeCon China 2019, a total of 26 alibaba technical presentations were selected. In this KubeCon, Ding Yu (Shu Tong), responsible person of Ali Cloud intelligent container platform, CNCF TOC, etCD project author, Li Xiang, senior technical expert of Ali Cloud container platform, CNCF ambassador, Kubernetes project maintainer, Ali Cloud senior technical expert Zhang Lei and many other cloud native technology giants will all be present and do technology sharing, At the same time, it will bring you the latest trends and progress of many advanced Cloud Native technologies, including open source Virtual Cluster strong multi-tenant design, OpenKruise open source project, Cloud Native App Hub and so on. We look forward to your meeting, communication and technical cooperation with ali Container Platform team on KubeCon China.

KubeCon + CloudNativeCon Alibaba special page online

“KubeCon + CloudNativeCon Alibaba special page **” has been officially launched to fully display Alibaba Cloud in this KuebCon speech topics and cloud native ecological achievements. Here, you can master the topics of Ali’s speech on KubeCon, track the update of the course “CNCF X Alibaba Cloud Native Technology Open Class”, understand the dynamics of Ali cloud native products, and the arrangement of the manual salon on June 24. Click the link or “read the original text” at the end of the article to directly enter the special page.

Special page link: yq.aliyun.com/promotion/8… We recommend that you focus on the following presentations:

Kubernetes is on the cusp of the cloud native future

Speaker Ding Yu (Shutong), head of Aliyun Intelligent Container Platform

As a practitioner of cloud native applications, Ali Cloud not only supports the double 11 with huge traffic, but also undertakes the large-scale daily business of Alibaba’s economy. This talk will share aliyun’s successful thinking on Kubernetes technology, and look into the future development trend of cloud native.

Keynote: Alibaba scale cloud native

Speaker: Li Xiang, senior technical expert of Aliyun Container Platform

Topic introduction: Ali Cloud has successfully scaled up to the original cloud, this speech aims to share specific experience to the audience, involving scale expansion, reliability, development efficiency, migration strategy and other aspects, and discuss optimization for large-scale scenarios. Cloud native works for Alibaba. Cloud native works for (almost) everyone.

Alibaba uses high availability + extensible Prometheus and Thanos

Ali Cloud Container Platform Senior technical expert Qin Guoan (Yan Lie) ali Cloud Container Platform senior development engineer Li Tao (Lv Feng)



Introduction to issues

Alibaba Group is using Kubernetes to support the world’s largest e-commerce business. Providing reliable fine-grained monitoring and alerting services is a real challenge in terms of availability and scalability. This talk will share experiences in developing fine-grained monitoring systems with high availability and scalability based on open source projects Prometheus and Thanos. The system mainly supports Alibaba’s cluster management system with 8 million TPS and 10K requests. Topics will be discussed:

  • How to use Prometheus to support large-scale scenarios?
  • How to use Thanos to solve data query problems caused by multiple Prometheus instances?
  • Lessons learned from the configuration of Prometheus and Thanos, such as target discovery and logging rule management and alert rules.

Manage microservices across regions and across clusters using Istio

Backend Architect UniCareer Xiaozhong Liu Is an e-learning career development platform designed to meet the diverse needs of students and working professionals worldwide. And serves users from many regions of the world. These applications are deployed on multiple Kubernetes clusters in different regions of Ali Cloud to reduce service access latency in different regions. In order to manage these microservices effectively, a multi-cluster service grid is needed to control microservice traffic and ensure service-to-service communication. Istio is a service grid built on Top of Kubernertes that supports multiple topologies to manage application traffic across multiple Kubernetes clusters. Throughout the case study, we will share deployment designs and technologies related to multi-cluster traffic management using the Istio service grid, and discuss some of the challenges and practices based on the needs and limitations of the underlying platform.

Efficient utilization of resources is achieved by hosting CPU and GPU workloads

This speech mainly introduces how to mix AI training tasks and long services on Kubernetes cluster. The main purpose is to improve resource utilization and save resources by mixing workload. We will describe how we achieve mixing and evaluate utilization from various dimensions including Qos class, Cgroup, scheduling, and so on. Over the past few months, we’ve built a GPU and CPU hybrid cluster of several hundred nodes, and we’ll cover best practices for mixing long service and AI batch tasks in a production cluster.

1-5-10: How to quickly recover from large-scale container faults

In the cloud era, container-based applications in enterprises surge. Due to manual operation and hardware failure, the possibility of container failure increases significantly. Therefore, how to ensure the reliability of large-scale containers without increasing resource input becomes a huge challenge for cloud platforms. Alibaba operates millions of containers and has the 1-5-10 theory for restoring container-related faults: MTTD (mean detection time) 1 minute, MTTI (mean detection time) 5 minutes and MTTR (mean resolution time) 10 minutes. In this meeting, we will discuss how to use 1-5-10 to improve reliability of large containers:

  • How to set up an effective proxy locally and detect problems within 1 minute;
  • How to diagnose container problem intelligently with expert knowledge base;
  • How to recover container problems automatically in a failure-driven manner.

Understand the scalability and performance of Kubernetes Master

Currently, Kubernetes has a size limit of 5K nodes, so if you want to use it to manage web-scale clusters like 10K nodes, You may not be able to. Are you wondering what the performance bottleneck is for Kubernetes to manage nodes beyond 5K? When you want to take its scalability to the next level, which component is holding you back? Etcd, Apiserver or Scheduler? Understanding these issues is key to operating a large Kubernetes cluster. At Alibaba, we ran into a lot of issues, such as pod creation being very slow as clusters got bigger and bigger. In this talk, we want to share how to do various benchmarking and analysis and find bottlenecks, as well as how to tune the control components and achieve over 100x performance improvements.

Intro:containerd

This talk will focus on containerd’s architecture design concept, and share with the audience how to enhance containerd with plug-in capabilities. Provides solutions for different image storage and strongly isolated container runtimes. At the same time, the demo case of ContainerD’s container runtime integration with gVisor and Firecracker will be shown to the audience to better understand the best integration mode of ContainerD.

Alibaba uses K8S, Kata containers and bare-metal cloud to build serverless

Serverless computing is a popular form of computing at present, which greatly reduces the cost for developers to deploy, manage and run applications. In a serverless platform, the services of different users are often mixed on the same node. Therefore, it is necessary to provide a trusted operating environment in multi-tenant scenarios. At Alibaba, we use Kata Containers as secure container runtime to ensure multi-lease hard isolation and service runtime performance at storage, network, hardware and other levels. In this sharing, we will discuss in detail how to achieve high performance of hard multi-tenancy and service operation in multi-tenancy scenarios based on our production practices.

Alibaba digital driven open source community exploration

The speaker Alibaba senior community open source governance office manager Introduction Zhao Shengyu rain (sheng) issues Operation of the open source community has always been a sore point of open source software development, especially for community dominated by pure developer, how the effective management of the open source community, found that active contributor in the community, through the data found the problems existing in the community management, etc., Are all pressing problems to be solved. The content of this presentation will include:

  • How do you measure how active a developer is in the community?
  • How to measure the overall activity of the open source community?
  • What can be seen and gained from the current analysis of the world’s top open source projects under these models?
  • What role should community management tools play in the open source community?
  • Based on the above content, what did Alibaba try and what results did it gain?

Alibaba: Lessons learned from e-commerce giant’s evolution to cloud native

The speaker Ali cloud container platform, senior technical experts Senior development engineer zhang container platform Si-yu wang (the wish) issues Would like alibaba e-commerce giant global migration to the cloud native platform is not easy, in this address, we will share our work last year from the Angle of technical and community draw experiences and lessons, including:

  • What are the major obstacles to Alibaba’s move to cloud-native technology?
  • What are Alibaba’s major technical liabilities? How can we solve these problems? Is our approach working?
  • What if your application is managed differently than Kubernetes in your organization?
  • Why is predictability important for e-commerce? Does Kubernetes have predictability out of the box? If not, why not? How to solve this problem (possibly without solution)?
  • How do you validate scalability issues in a cluster of thousands of nodes?
  • Can a large team work with upstream communities for win-win outcomes?

Intro: Dragonfly

Topic Introduction With the increasingly extensive application of container technology in industry, how to distribute image safely and efficiently is a new challenge faced by engineers. Dragonfly project is an image and file distribution system based on open source intelligent P2P. This project aims to address all distribution issues in the cloud native scenario. Currently, the Dragonfly project focuses on:

  • Simplicity: A well-defined API (HTTP) for the user that is non-invasive to all container engines
  • Efficient: CDN support, P2P-based file distribution to save enterprise bandwidth
  • Intelligent: Host detection implements speed limit and intelligent flow control at the host level
  • Security: Data block transfer encryption, HTTPS connection support

In this talk, we will focus on distributing container images through dragonflies. We will review the challenges facing the organization, including mass distribution, secure transmission, bandwidth costs, and provide solutions. This presentation will discuss practical use cases.

No more chaos: Massive Kubernetes audits and inspections

As we all know, accurate exception discovery and fast problem analysis are key to ensuring the availability and stability of The Kubernetes cluster. But throughout the Kubernetes project, there are countless monitoring indicators. In our Kubernetes cluster alone, we observed thousands of monitoring data like this being generated every second. How to make reasonable use of these complex and large numbers of data and indicators, effectively record and analyze them, and turn them into easy to understand visual display and accurate alarm information is a very challenging task. In this speech, we hope to share with you our practice and experience in Kubernetes cluster monitoring, audit and inspection in Alibaba. First, we’ll talk about Kubernetes’ key stability statistics and metrics, and how to understand them. We will talk about how to integrate and analyze these data and indicators in the form of cases. Finally, we will share alibaba’s best practices for efficient, real-time automated inspection and analysis of these data.

Minimize GPU cost for deep learning running on Kubernetes

More and more data scientists run the deep learning task based on NvidiaGPU on Kubernetes. At the same time, they found that idle Gpus in the cluster wasted more than 40% of their cost. Therefore, how to help improve GPU efficiency has become an important challenge. We will introduce a GPU sharing solution based on native Kubernetes:

  • How to define a GPU sharing API
  • How to schedule GPU sharing without changing the bare-bones code of the scheduler.
  • How do I integrate my GPU isolation solution with Kubernetes
  • We will also demonstrate how Tensorflow users can run different jobs on the same GPU device in a Kubernetes cluster

Three ways to speed up image distribution in the cloud Native era

This talk will share the practices and lessons to improve the efficiency of image distribution from the scale of Alibaba’s network. We use different image distribution methods according to different scenarios. P2P CNCF/Dragonfly distribution is the most direct way to ease mirroring center bandwidth and reduce distribution time. In addition, the remote file system snapshot program in CNCF/ Containerd directly remote stores the image, enabling the container engine to read the image content over the network, requiring little time to distribute. You will find that the second approach depends on network stability, so how do you balance dynamically loading an image from remote to local storage based on mirror content read requests? Finally, we will summarize how to choose a suitable mirrored distribution.

Dynamically adjust Pod resource limits in a Web-level cluster

Topic Introduction As a huge global e-commerce giant like Alibaba, the number of applications and types of applications are super-large. How to manage the resources of these containers scientifically and reasonably has always been a great challenge for us. In this talk, we will share our practical work experience and technical achievements from multiple dimensions, including technology and community evolution. These include:

  • What is the current status of community resource management for containers?
  • What are the specific challenges of alibaba’s large-scale application deployment?
  • How do we diagnose resource management problems?
  • How can we achieve a significant increase in resource utilization while ensuring a stable online service?
  • How to balance cloud native evolution with rapid delivery of work?
  • How can our experience help you and how can we feed back into the community to achieve win-win outcomes?


Author: Jessie Xiao Jiang

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.