The author | | peng source enlightenment alibaba cloud native public number

Kubernetes Stability Assurance Manual series

  • Kubernetes Stability Assurance Manual – Minimal edition
  • Kubernetes Stability Assurance Manual – Log topics
  • Kubernetes Stability Assurance Manual — Observability Topics

With the continuous improvement of people’s attention to stability and the popularity of community observability projects, observability has become a very hot topic, and people will have different understandings from different perspectives.

Starting from the lifecycle of software development, we try to form a macro understanding of observability, and develop the understanding and practice of observability from the perspectives of SRE and Serverless.

purpose

  • Enhance cognition and enhance competitiveness through overall grasp
  • Through reasonable design and practice, bring possibilities for the future

The target

  • Agree on an understanding of observability
  • Agree on the direction of observability

What is observability?

Definition of Observability:

In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

Consider a physical system modeled in state-space representation. A system is said to be observable if, for any possible evolution of state and control vectors, the current state can be estimated using only the information from outputs (physically, this generally corresponds to information obtained by sensors). In other words, one can determine the behavior of the entire system from the system’s outputs. On the other hand, if the system is not observable, there are state trajectories that are not distinguishable by only measuring the outputs.

In simple terms, observability is a method of deriving the internal state of a system from its external output.

The following diagram simplifies system composition and interaction between systems:

As can be seen from the interaction diagram above, the interaction behavior of the system has the following forms:

  • Within the system

    • Components function in closed loop and do not interact with other components or systems
    • Interaction between components
  • System between

    • Systems interact with each other

In this way, you can know the internal status of the system through the external output of the system in the following two forms of information:

  • Component closed loop information
  • Information that flows between components or systems

What is the problem domain of observability?

The core of observability lies in meeting the needs of different groups of people to understand the state of the system by observing data. Here, the life cycle of the observed data is firstly abstracted, as shown in the following figure:

Observation data is generated through App, stored after intermediate processing, and then query service is provided.

Observation data serve different types of people, such as product users, business, R&D, AND SRE. Different people use these data in different forms, including SLA/SLO/SLI/Alert, etc.

According to the life cycle of observable data, the problem domains of observability can be roughly summarized:

  • To generate the

    • Data model of observed data
    • Generation of observational data
    • The derivation of observational data
  • Deal with end

    • Collection of observation data
    • Processing of observed data
    • The derivation of observational data
  • Storage end

    • Storage of observational data
    • Query of observation data
    • Use of observational data
  • Using the

    • Consumption of observational data

What is the service goal of observability in the software development life cycle?

Looking at the software development life cycle from the overall perspective of the project, there are the following processes:

Break it down:

There are four types of roles in the software development life cycle. For the four types of roles, the service objectives of observability vary:

Note:

  • Reliability and stability are not the same relationship, reliability includes stability + timely meet functional requirements characteristics

Directions in which SRE can be put

Basic Services:

You can use OpenTelemetry as a basis to implement the above operations. For details, see Brief Analysis of OpenTelemetry.

At the same time, visual stability guarantee services can be explored to speed up problem discovery, location and solution from a global perspective, and the health status of “component itself” and “interaction between components” in the cluster can be grasped in one graph, as shown in the following figure:

With this as the entrance, we can grasp the cluster status as a whole, associate abnormal information, and deal with problems with a definite target.

Observability in the Serverless scenario

Serverless is a promising form of on-cloud computing at present. Ali Cloud provides relatively complete Serverless computing products as follows:

One of the major differences between different Serverless computing environments is the duration of the runtime environment. Using this as a starting point, the core of observability in a Serverless computing environment can be abstracted and then decomposed into corresponding solutions:

According to the duration of the running environment, it can be roughly divided into three categories:

  • Day level
  • Hour level
  • Minute or second level

These runtime environments can be implemented using technologies such as virtual machines, containers, or WebAssemblies, but the difference is the business-limited duration of the runtime environment.

According to the characteristics of the runtime environment duration, the core concerns of the platform and users will change accordingly:

  • The core of the platform side is to provide a reliable operating environment and allow users to manage applications freely

    • For observability, the platform’s core lies in operating environment reliability, while the user’s core lies in application environment stability and request response performance
  • In the hour-level operating environment, the core of the platform side is to provide management services around the application, and the users are focused on the business itself

    • For observability, the platform core lies in application stability and request response performance, while the user core lies in business characteristics
  • At the minute or second level, the core of the platform is fine-grained management of user business logic, and users are more focused on sensitive features of the business

    • For observability, the platform core lies in the request response reliability and business characteristics, while the user core lies in the core business characteristics

For the FaaS scenario, THUNDRA’s demo provides a good example for reference (three examples are selected) :

  • function

  • application

  • architecture

summary

Through in-depth understanding of the concept of observability, problem domains, and requirements at different levels, a big picture of observability can be formed, and then it can be combined with the business on this basis to enhance the competitiveness of the business in terms of observability. At the same time, iterative understanding enables technology and business to promote each other.

References

  • wikipedia: Observability
  • wikipedia: Service-level objective
  • wikipedia: Service-level agreement
  • wikipedia: Service level
  • Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers
  • conprof – Continuous Profiling
  • OpenTelemetry Proposal issues: Adding profiling as a support event type
  • Kubernetes scalability and performance SLIs/SLOs
  • From DevOps to NoOps, the landing method of Serverless technology is discussed

Welcome to leave a message to exchange the stability problems in the process of using Kubernetes, as well as the stability of the expected tools or services. You can also contact the author via email: [email protected].