Taking my experience in Ali Games as an example, how to do business while restructuring the architecture

What is the most painful thing in the world for a programmer?

Some people say: when coding, the product changes the requirements!

Some people say: Look at other people’s code!

Some people will say: positioning a once-in-a-century hard-to-find online bug from time to time!

Some people will say: can’t find a female (male) friend!

However, I would say that these pains are not as painful as the extra time (changing the requirements, looking at the code) or the extra effort (finding a partner, locating a difficult bug). It is necessary to solve technical problems and coordinate resources. Keep the business on track and meet the target within the specified time. In short, the eighteen martial arts to master everything.

This is called “architectural refactoring”, and even more painful than “architectural refactoring” is “architectural refactoring while doing business”! The image of our product is described as “changing the engine of a speeding Ferrari”. Why do you say that?

First of all, the business should not be stopped. The development of the business should not be stopped for the sake of architecture reconstruction. When the Ferrari is stopped and the engine is changed, others will run away.

Secondly, the business should not have problems, and the business should not be unable to run due to structural reconstruction. If Ferrari fails to run due to repair problems, others will also run away.

Third, fix the problem at its root, not tinker with it, not put a little oil in the Ferrari engine and clean it, but put a new engine in.

Coincidentally, SINCE I joined UC, I have reconstructed the architecture of three systems (nicknamed “Fire Captain”), and the characteristics of each system are different. During the process, I have encountered various problems, stepped on difficulties and accumulated some experience. I’ll share it with you.

targeted

The first system I took over was a background system, which was responsible for managing the game-related data of ali Games (hereinafter referred to as SYSTEM M). The main reason for the reconstruction was that the system coupled the data unique to P business and the data common to all businesses, resulting in poor scalability. Its general structure is as follows:

Take the simplest example: a table in the database, part of the fields are common “game data” of all businesses, and part of the fields are unique data of “P business system “. If you want to change this table during development, the code and logic are very complicated, and the efficiency is very low.

In view of the problems existing in M system, our goal of reconstruction is to split game data and business data, unlock the coupling between the two, so that both systems can develop independently and rapidly.

The reconstruction scheme is as follows:

After the reconstruction, the effect is very obvious. After the reconstruction, the monthly online versions of M system and P business background system are four times as many as before the reconstruction.

The second system I took over was the core system responsible for game access (hereinafter referred to as S system). S system is the core system of game access. Once S system fails, a large number of game players will not be able to log in the game, and S system does not have the ability of multi-center. Once the main machine room breaks down, the whole BUSINESS of S system will be unavailable. The main database is a global single point. Once the main database is unavailable, the write services of both clusters are unavailable:

In view of the problems existing in S system, our reconstruction goal is to achieve dual-center, so that any machine room can provide complete services, and when a machine room fails, the other machine room can take over all the services.

The reconstruction scheme is as follows:

After the reconfiguration, the availability of the system increased from 3 nines to 4 nines. In the most exaggerated month before the reconfiguration, there were 4 large online failures. Although the reconfiguration also experienced the machine room switch breakdown, operator line failure, cabinet power failure and other problems, none of them had a major impact on the business.

The third business system I took over was innovation business (hereinafter referred to as X business). Because business is innovation, business fast try before and during the period of rapid development, how convenient how operation, how fast do, system design does not invest too much energy and time, a lot of things in the same system, lead to now has been changed without moving, do a new function or new business, need to spend plenty of time to discuss and combing all kinds of business logic, Step in a big hole if you’re not careful. The architecture of system X is as follows:

X system problems and M look similar, are scalability problems, but it is not the same as the root cause: M system is due to the coupling of a different business data lead to lack of system scalability, and X system because all functions that are relevant to business in the same system, leading to insufficient system scalability; At the same time, all functions are in the same system, which may lead to the failure of one function, resulting in the whole station is unavailable. For example, if a feature slows down the database, the entire site slows down.

In view of the problems existing in X system, our reconstruction goal is to split each function into different subsystems to reduce the complexity of a single system. The refactored architecture looks like this (just an example, the actual architecture is much more complex than the following) :

After the reconstruction, each system interacts with each other through the interface. Although it seems to increase the workload of the interface, on the whole, the development and development speed of each system is much faster than the original, and the system is relatively simpler, and there will be no problems in a subsystem, and all the businesses will have problems.

In retrospect, these three system reconstruction schemes seem to be taken for granted, but in fact, when analyzing and making decisions at that time, they were far from so simple.

Taking M system as an example, we encountered many problems after taking over, such as:

Data is often wrong;
M system is a single machine, after the single machine down all background operations can not be carried out;
The performance is poor, and some operations take a long time;
The interface is ugly, the operation is not humanized;
History after several hand transfer, the code is more chaotic;
Business data and game data are coupled, making development inefficient.

It is not easy to identify refactoring goals from so many problems; And if you want to solve all these problems, manpower and time is not enough!

Therefore, the primary task of architectural reconstruction is to identify the problems that really need to be solved through architectural reconstruction from a large number of complex problems and focus on solving them quickly, rather than trying to solve all the problems through architectural reconstruction. Otherwise, you end up in a situation where there are too few people doing too many things, and the team works their butts off for the better part of a year, only to find that everything seems to have been done, but every problem is still there. Especially if you are an architect or a technical lead who has just taken over a new system, it is important to resist the urge to “fire fire” and avoid extensive or sporty refactoring and optimization.

What about the problems we found? Of course not. Taking M system as an example, after the reconstruction was completed, we started several optimization projects to optimize these problems. However, the optimization at this time was mainly completed within the team without much connection with other teams, and the optimization speed was very fast. If optimization is carried out without refactoring, each optimization will involve a large number of related business teams to discuss the solution, which is very inefficient!

union

Architecture refactoring is a major operation that takes a long time and consumes a certain amount of R&D resources, including development and testing, so it inevitably affects the development of business functions. Therefore, it takes a lot of lobbying and communication to really get an architectural refactoring project off the ground. Note that I’m not talking about office politics here, but about communicating with stakeholders so that they can reach a consensus about refactoring and avoid unnecessary back and forth.

It’s simple, but it’s how you do it!

When the average technical student talks about architectural refactoring, they throw out a bunch of technical terms: scalability, reliability, performance, coupling, messy code. However, according to my practical experience, if we communicate with non-technical students in this way, the effect is like talking with a chicken and a duck. Students without technical background will find it difficult to understand, and may even worry that we are fooling them. Such as:

Technical students said: our system now scalability is too poor, change all change immovable!

Product students think: gee, scalability, and breast expansion movement related? Extend what, how can not change, it is just a place to write code.

Technical students said: our reliability is too poor, now only 3 9, the industry is 4 9!

The project manager thought: What are three nines? . Four nines and three nines are the difference between nine and reliability.

Technical students said: our system design is unreasonable, A business and B business coupling!

Operation students think: hey, coupling, lotus root or lotus root? Business A and business B are interdependent. Why is coupling unreasonable?

The above examples do not mock product operation and project students for not knowing technology, but illustrate that some technical terms are not well understood, and it is difficult to reach consensus in cross-field communication.

In addition, a common problem in communication is to speak by feeling rather than by data. For example, the technical students said that “the system coupling leads to our low development efficiency”, but there was no data and no samples. It was difficult for other students to have a direct impression simply by saying so.

So in communication and coordination, the technical language into popular language, to speak with facts, to speak with data, is the key to communication!

M system, for example, we put the “scalability” into “version development speed is slow, every design to consider whether to have an impact on the portal, whether to consider influential to other business”, then we also collected one month inside the version of the situation, found that there are several version design stage to discuss one weeks or two weeks, but the development is only 2 days; And with only 4 releases a month, the most extreme version, 2 weeks of discussion, 2 days of development, and then a month of waiting to go live with the portal system, project managers and product managers were horrified.

Taking SYSTEM S as an example, we did not directly say that the reliability is 9, but the number of online failures, the duration of each impact, the users affected, the feedback from customer service, etc.. And then when you compare data from other systems, whether it’s product or project or operation, it’s clear that there’s a problem with the reliability of the system.

Of course, if the above techniques do not work, or encounter extreme situations, it is time to consider some more effective measures! For example, we meet a product person who thinks that technical optimization and architecture reconfiguration are r&d matters. He does not pay attention to them, and does not consider the input of reconfiguration and optimization when allocating development resources. We had no choice but to go up to the superior leadership level for coordination, and even we put out tough words: “If you do not agree to arrange resources for optimization, the next time there is a failure, we will say that the product does not provide human optimization and reconstruction.”

mongols

In addition to the upstream and downstream communication and coordination discussed above, some refactoring also requires communication and coordination with other related or coordinated systems. Since everyone is engaged in technology and has a lot of common language, it is relatively easier to communicate and coordinate in this part. However, it does not mean that we can promote it if we want to. The main resistance comes from “what’s in it for me?” and “I am not in a hurry for this part”.

For the question “What’s in it for me?”, some people simply interpret it as selfish and think that the other party ignores the overall situation, so they artificially exaggerate the problem when communicating, for example, “You should consider this problem from the perspective of the department”, “it is helpful to the overall interests of the company” and so on. In fact, this kind of communication effect is very poor. First of all, this kind of uplifting is generally empty and cannot be clear. Different people have different understandings and cannot reach a consensus. Secondly, if it is beneficial to the company and the department, but useless or even disadvantageous to a certain group, it may be because the current program is not good enough, and another program can be considered.

So how can you push it effectively? Our strategy is “empathy, win-win cooperation, long-term focus”. In simple terms, it is to think from the perspective of the other party, what benefits reconstruction has for him, what problems it can help him solve and what benefits it brings.

Take SYSTEM M as an example. At that time, there was another C system and M system directly connected to the database and shared the database. Our reconstruction scheme was to remove the two systems from operating the database at the bottom at the same time, instead, C system wrote to the database by calling the M system interface. This solution for C system, short-term changes obviously is C system is large, ten several places to read and write database interface to call, just start C system is also feel refactoring to them, and then we analyze and communication, learn C system actually also suffer the current architecture, Mainly embodied in the “data error often want to screen” (because C and M system are working on the same database, logic is hard to guarantee the same), “to follow M system synchronous development” (table or field, due to increasing M system C system to retrieve from the database yourself, to understand the logic), “C system to connect the two databases, If something goes wrong, it is not easy to check “(because C system has its own database).

These questions actually after the M system reconfiguration can be solved, although C system has certain development effort in the short term, but longer term, C system definitely can save a lot of things, such as: screening data problem is M system, through the M system interface to get the data, data needed to pay attention to the related business logic, and so on. Through communication and coordination in this way, SYSTEM C is willing to cooperate with us in the reconstruction, and the fact proves that the reconstruction is of great benefit to both system C and system M.

Of course, if there is a situation that is beneficial to the company or the department but unfavorable to a certain group, it may need to coordinate with higher-level managers to promote, and it is difficult to promote horizontally.

As for the question “we are not in a hurry now”, some people may think it is just making excuses, and I do not rule out this possibility. But even if it is really an excuse, it is because we have not reached an agreement, maybe the other party is too embarrassed to refuse directly. So this can be handled by referring to the “What’s in it for me?” question above.

If the other person really has more important business to deal with, it’s not a good idea to push him or her at this time. Again, “Put yourself in someone’s shoes”! Because most of the reconstruction of the system is not when the fire is very urgent to start, but have certain forward-looking planning, if they really have other more important things, also may be waiting strategy, but need to have a clear formal start time, for example, three months after the start, beginning in June, Never say “later” or “when it’s not too busy” at an undefined point in time.

In addition to being flexible in planning, we can also be flexible in planning: we can skip the system-related refactoring and finish the rest of the refactoring first. Because most of the systems that need to be restructured need to do a lot of things, processed in stages, in terms of risk avoidance, planning and other aspects more flexible and controllable.

The authors introduce

Yunhua Li has more than ten years of software design and development experience in the telecom industry and mobile Internet industry. He once worked in Huawei and UCWEB, successively serving as software development engineer, system analyst, architect and technical leader. Now I am working as a senior software engineer in Alibaba Mobile Business Group (formerly UCWEB), leading several RESEARCH and development teams, responsible for architecture design, architecture reconstruction, technical team management, technical training and other responsibilities.

Technically, he focuses on open source technology, system analysis and architecture design, and has in-depth research and understanding of the characteristics and development trend of Internet technology. I have been in charge of game access high availability project, pigeon event publishing and subscription system, trading platform system decoupling project, and I have rich experience in system decoupling, high performance and high availability architecture.

Baichuan.taobao.com is the wireless open platform of Alibaba Group. Through the opening of “technology, business and big data”, baichuan.taobao.com provides high cohesion, open, industry-leading technology product matrix, mature business components and perfect service system in mobile scenes. Help mobile developers quickly build apps, accelerate the process of APP commercialization, and empower mobile developers and mobile entrepreneurs in an all-round way.

Taking my experience in Ali Games as an example, how to do business while restructuring the architecture

Related Posts

If you inject jquery debugging on third-party pages

After two years, WebPack 5 is officially released!

PHP toolkit: PHPStan — PHP static code analysis tool