Overview: Most cloud computing products are associated with cloud natives, which are reshaping the entire software lifecycle. But what exactly is cloud native? What are the biggest technological innovations and future opportunities that cloud Native brings? Can we build a set of development & operation and maintenance system on the cloud to create a new generation of R&D platform and maximize the R&D efficiency around the cloud?

The author | | ali shen hui source technology to the public

Most cloud computing products are associated with cloud natives, which are reshaping the entire software lifecycle. But what exactly is cloud native? What are the biggest technological innovations and future opportunities that cloud Native brings? Can we build a set of development & operation and maintenance system on the cloud to create a new generation of R&D platform and maximize the R&D efficiency around the cloud?

We invited Shen Xiu, head of The R&D platform of Ali Yunyun, to share the process and methodology of the efficient R&D operation and maintenance system construction. The paper consists of three parts: Firstly, from the perspective of problems, it analyzes the problems that may be encountered in the process of the gradual expansion of the team business, and the impact of these problems on the team effectiveness. Then combine the problem to see what kind of efficiency system can meet the demands of team efficiency improvement. Finally, it introduces some summary of the efficiency improvement methods of ali Yun Yun efficiency team.

Factors influencing team effectiveness

1. Factors influencing team effectiveness

First of all, the influence of the growth of enterprise personnel size on efficiency is discussed. At the beginning of the company, there were about ten or twenty people in a full-function team. At this time, the division of labor was not clear, and everyone worked in a very agile state. They would make up for each other, such as technology to do some product things, development to do testing and operation and maintenance. In this case, the team works together with little communication loss. Often the bottleneck is on individual ability. At this time, in order to complete business requirements more quickly, the start-up team paid more attention to single point efficiency in the selection of efficiency tools, such as useful assembly line tools, testing tools, etc., and the threshold of getting started was the first consideration.

When the team gradually expands, the division of labor begins to be specialized, and the problem of multi-function coordination begins to emerge. How to cooperate, how to distribute the rights and responsibilities, and what the cooperation process is, are the issues that the team cares about very much. At this point, the team will not decide the success or failure of the product because of individual ability, how to improve the median ability is the key issue. In this case, the choice of performance tools is more likely to be a product with a solution, such as branch management, test environment management, how DevOps is implemented, etc. The use of these tools can greatly improve the transparency between teams and improve the efficiency of communication. For example, the branch management mode was selected to solve the problem of communication between the development and test teams, and the DevOps mode is to leave most of the operation and maintenance work to the development team alone, thus improving efficiency by reducing communication.

With the further expansion of team business and the emergence of products with obvious business boundaries, the cost of communication and collaboration will be further magnified, and people will pay more attention to goals, consensus and results. Of course, campaign mode can be used to carry goals, consensus and results, which is a very good way to gather human resources and topdown to improve the efficiency of execution. On the other hand, it is also important to realize that the campaign cannot solve all the cross-product and cross-team coordination problems at the corner, but how to solve the problems of force distribution and business technology communication in the daily state is the key.

2. The impact of software services architecture on R&D performance

Let’s look at another issue, which is the impact of service architecture on R&D effectiveness. In fact, service architecture has a strong correlation with organizational architecture. For example, under a flat architecture, teams are independent and not strongly correlated with each other, but have a high self-sufficiency rate, which refers to the ability to independently fulfill a certain requirement.

Under the network structure, the organization form is often integrated, led by the same department boss, and the team closely cooperate with each other. At this stage the architecture is complex and lacks abstraction. However, because the business process is relatively simple, it is not too much of a problem to make requirements for each team to communicate point-to-point, short decision-making link, fast consensus. On the other hand, technical debt accumulates, and when the businesses are coupled to a point where the cost of maintaining the debt begins to outweigh the cost of new demand. The middle platform architecture is one way to solve this problem.

In the middle platform mode, various business modules begin to be abstracted, and then the technical side also needs to set up the technical middle platform, so that the original tools held by their teams converge and the process is unified. However, with the division of labor between the front desk and the middle desk, and the independent design of their development routes, there will be problems such as the department wall, the low self-sufficiency rate of the front desk business, and the difficulty in reaching consensus on priorities and delivery time.

By analyzing these three product, technical, and organizational architectures, you can understand the performance dilemma that teams face as they evolve.

3. Efficiency changes brought about by technological evolution

Having said that, let’s look at how the evolution of technology affects r&d effectiveness. Let’s take a cursory look at some of the technological changes of the past few years. Microservices, continuous delivery, DevOps, and other concepts were introduced in 2008, and they continue to this day. At the same time, Alibaba also carried out service-oriented transformation of the core system of e-commerce. Later, it found that there were more services and management problems. Only DevOps could eliminate the bottleneck and release productivity. There is some internal logic to these things, which is that business drives technology change, technology drives architecture change, and architecture drives R&D model change.

In recent years, k8S ecology, which has become increasingly prosperous, is roughly the same. The application of new technologies has created many new architecture models, such as Serverless, small program, etc. These new architectures also bring great challenges to the original RESEARCH and development mode, such as how to manage code branches and environment in the Function as Services mode. Will test tools and methods change, will test team responsibilities change, etc. Of course, you can also imagine what will happen in the future when the number of services will explode and the complexity of the architecture will grow beyond human control, and what tools will be needed to solve the efficiency problems at that time.

4. Constraints on r&d efficiency

Combining with the above analysis from the three aspects of personnel, architecture and technology, and further extracting the key factors in the middle, such a ring will be formed. The three key factors are cost, people, and person-to-person synergies. The cost can’t be amplified indefinitely, so it’s the key constraint in this ring. And because people have different abilities, you can’t create the perfect architecture and the perfect organization, and there’s a lot of synergistic consumption. As mentioned earlier, technical debt accumulates, and the synergistic costs tend to increase over time, consuming more labor, and resulting in less business labor under fixed cost constraints. The loop is going to have negative feedback, which means it’s getting worse and worse. Hence the need to address the issue of r&d effectiveness.

Technology is often used to arm people and raise their personal power ceiling, which I think is an important break point. A collaborative process that ADAPTS to the current state of the team organization and architecture is then required to reduce wastage. It is important to note that this often only leads to improvement, and it is difficult to fundamentally change the situation without changing the inherent architecture and organizational patterns. Finally, we can use some tools to make our work more efficient. What we used to do by hand is now automated, freeing up more time to focus on business value output.

The three-pronged approach can effectively drive the loop into positive feedback, resulting in higher team efficiency, faster skill improvement, smoother collaboration, and more manpower costs when the business develops well.

In its own practice, Ali found that it was constantly changing these elements. When it encountered bottlenecks, it invested in improvement, went out of negative feedback and entered high-speed development, and then encountered bottlenecks again.

So how to systematically improve or solve these problems, it needs a set of suitable efficiency tool system.

Second, the construction idea of efficiency tool system

1 three typical R&D teams

In our practice, we can summarize the following three typical R&D teams.

  • The first is the application development in the foreground and background. E-commerce, SaaS and so on are typical forms. This kind of business form is easy to standardize on the engineering side, and the tools are more perfect. Especially, the development of cloud native technology makes the focus of business shift upward, and the underlying technology becomes more and more cloud and black box.
  • The second is the research and development of underlying basic software, which is characterized by simple user interaction but large technical depth and complexity. This software tends to be a stateful service and has a strong dependence on the hardware infrastructure, making it difficult to standardize on the o&M side. In addition, in the development side, there is also a complex technology stack and the centralized development of many people in a module, which makes it difficult to accelerate iteration by decoupling services like the front and back applications. At the same time, it also gives rise to new problems such as branch management and binary version management. This difference between development state and operation state leads to the difference of tool system.
  • The third is the offline delivery of large-scale software research and development, represented by the hybrid cloud, industry software. Because of the complex system coupling and the addition of customer proprietary environmental factors, there is a high requirement for multi-team collaboration and the ability to deliver o&M systems. Compared with the first type of foreground and background application development, special attention is paid to version management, integration and upgrade, and remote operation and maintenance capabilities.

2. Layered construction efficiency system matching complex collaborative scenarios

Therefore, the efficiency system needs to be layered and abstracted in the face of different R&D scenarios and different emphases. Here the whole system can be divided into four levels, from bottom to top is the base, tool layer, collaboration layer, scene.

In the foundation base, the data precipitation of core assets of industry and research should be paid attention to to ensure the data consistency of the whole system. Core objects in the RESEARCH and development system are usually extracted for subsidence, such as teams, projects, applications, codes, products, etc.

Above that is the most critical tool layer, defined as an automated means of solving a single point of problem. Openness and integration should be the most important capabilities of the tool. For example, API First is often called.

Above that is the collaboration layer, which focuses on solving the problem of information transmission between people and standardizing the collaborative process online. By abstracting the synergies of different domains and cascading the single-point tools, users can finally complete a complete job online.

Universality, configurability, and experience are sometimes contradictory, so scenario-level products are needed to solve the problem of fine-grained user experience in their respective fields. It can be seen that this is the trend in the industry in recent years, general RESEARCH and development platforms continue to mature and deepen, and scenario-based research and development platforms continue to emerge, by integrating the capabilities of lower tools, quickly cover the segmentation of the RESEARCH and development scenarios.

At present, cloud Efficiency is building a research and development tool system according to this hierarchical idea, hoping to bring more developers into this system and build this complex ecosystem together.

Each team customizes its own performance plan

In addition to providing a standardized R & D process system, each team should have its own performance plan to meet its own team culture and habits. There are two or three levels of customization that can be provided here.

One is the team workbench, which is the knowledge precipitation place and collaboration space of the team. It provides multiple views to browse work status, to-do lists, progress, etc. There are also a number of administrative tools available to the leader.

The other two are team collaboration processes and tools. It is recommended that we deeply learn efficiency improvement methods and team management methods, and combine the current situation of the team to personalize them into the system, and even innovate tools more suitable for business characteristics, so as to gradually release the team’s productivity potential.

The lower limit of team effectiveness can be maintained by unifying the platform, but the upper limit of team effectiveness needs to be breached by the team itself.

4. Suggestions for further efficiency improvement

Based on the above analysis, the author puts forward the following three suggestions:

  • The first is that the team needs to focus on improving performance across the goals, business, product, and r&d processes. For example, a question: Is the test team solely responsible for becoming a delivery bottleneck? Obviously, this may be because the user link analysis on the requirements side is not comprehensive, or the development team delivery quality is poor, or the architecture design is not reasonable and the testability is not strong, etc., all of these will increase the burden of the test team, so that the test team becomes a bottleneck. Therefore, team leaders need to think end-to-end, master methods, and have a macro vision, rather than a one-stop-hole approach.
  • The second point is that the team needs to take responsibility for its own effectiveness and be the first responsible person. You know your team best and take the most effective measures.
  • The third point is to improve the team’s product design ability, technical ability, reduce technical debt, build in quality is very important for efficiency improvement. Effectiveness tool system can only provide the most basic guarantee, to make team effectiveness more healthy, need to start from the most basic software engineering details, gradually improve, in this respect, there is no silver bullet.

The evolution of the system of three efficiency methods

Move from an emphasis on tooling processes to an emphasis on value delivery

When teams start to refine, they become more professional and more resource efficient from an organizational perspective, but from a business value delivery perspective, the cycle time is very long and there are waits in between.

Therefore, it can be concluded that local efficiency does not mean efficient delivery of business requirements. There are many tools and means to improve local efficiency, which is a relatively convergent problem, and even overtime can be used to make up for the lack of efficiency. However, it is not easy to efficiently deliver the business value that users can perceive, as illustrated by the above figure. Also, it does not mean that the delivery can be sustained and efficient, because there is no way to guarantee that the global optimal organization and architecture and process will always be used to correspond, and even no mechanism to find bottlenecks. Of course, there is no way to answer the question of business success, because the distance between the business team and the production and research team is too far. Such a department wall prevents the production and research team from thinking and understanding the relationship between business success and their own output.

2 Achieve end-to-end visible business value

Therefore, the author believes that the first thing to improve performance is to achieve end-to-end visible business value. There are several implementation paths from the business team to the production and research team. The first is to build collaborative links from a business value stream perspective. In the past, the project management software was used to solve the cooperation problems of the production and research team, and the requirements, defects, tasks and so on were organized by a product or team. In the new system, business teams should be included, and the relationship between business value and product development needs and tasks should be communicated, so as to achieve end-to-end transparency and visibility.

The adoption of a large number of automation tools on the production and research side is still fundamental, in addition to linking the data produced by the tools to the value stream and to the data platform as much as possible. Simple measures can be taken, such as what percentage of work is done online and whether there is a unified data model to accumulate data.

After the first two steps, there is still the issue of aligning the business, product, and technical team goals, such as what the priorities of the business appeals are, what the time points are, what the bottlenecks are, and tracking in real time along the way. The person in charge of each link can perceive abnormal events and resource bottlenecks, and start to solve them in the first time to achieve the purpose of efficiency.

The third step is to achieve continuous efficiency. Quantitative analysis must be made based on the accumulated data. At this time, the charm of the data will be displayed. Decisions about which team is accumulating debt, which team is accumulating assets, which team is the choke point, and whether to adjust the structure or the division of organization are more efficient.

ALPD — a new generation of Lean product development

Based on the above analysis, a set of method system can be constructed by combining lean thinking, cloud thinking, architecture design thinking and other aspects.

The blue part of this diagram is the focus of this article. It is divided into three parts, full link digital lean collaboration, to solve the business and product technology collaboration problems. The second part is a domain-driven core of technology practices that address increasingly complex architectural issues. The third part is the cloud native engineering practice, with this set of engineering practice to further release the cloud native for each business developer dividend.

4 Link-wide lean collaboration

First, lean collaboration of the whole link. It is called full link because in this method, multiple roles such as business, product, and technology are all included. The key is the idea of stratification, divided into business, product and technology. Corresponding to business and objective management, requirements and product management, and team delivery views, respectively.

Under this model, with a series of efficient online chemical tools, as much work as possible can be done online, data is connected and transparent with the core value stream, and the goal of lean collaboration is finally achieved.

5 areas of core technology practices

Let’s look at the domain-centric technology practices. There are three sections, analysis, architecture, and implementation. Modeling business-led domains, domain-driven microservices architecture, and contract-oriented software implementation.

The design of the domain model is the core of the product and architecture design. Good design can easily solve the technical team’s change, test, and delivery coupling problems, improve system testability and operational maintenance, and reduce the impact of technical debt on the entire system through some anti-corrosion design.

6. Cloud native engineering practice

Finally, cloud native engineering practices. This diagram divides the engineering practice into three parts, the bottom layer is the immutable infrastructure, the middle is the continuous delivery pipeline, and the top layer is the quality assurance system.

The focus is in the middle, in red, the GitOps Engine, which is the Engine that fully lands the so-called app-centric IaC system. The author believes that the IaC design is a major refactoring of the cloud operation interface and the way developers use it. The benefits of cloud native technology can be further released by adding more customization capabilities to the code in the form most suited to developers.

The original link

This article is ali Cloud original content, shall not be reproduced without permission.