There’s a quote from the classic book Refactoring:

In the beginning, the refactoring I did was all about the details. As the code gets simpler, I find myself seeing aspects of design that I wouldn’t have understood before, and that I wouldn’t have been able to reach without refactoring.

Refactoring is really exciting for programmers.

Earlier this year, our team completed a complex refactoring of over 300 files and 30,000 lines of code in the core engine of the advertising system.

It only took about a month from the design of the technical scheme to the final full launch, and there was no accident.

This is probably the biggest and most successful refactoring project I’ve ever done in my eight years of programming: fast enough, well planned, and of good quality.

Let’s talk about the historical baggage of the system

Our advertising engine went through about a year and a half of iteration before this reconfiguration. At the beginning, it targeted at the search scene, with a single business and a clear process.

From 2019, the company’s advertising business began to expand rapidly, with revenue growing almost exponentially. In this process, our advertising engine faced two challenges:

1. Business scenarios become complex. In addition to search ads, news stream recommendations and similar recommendation scenarios need to be supported.

2, advertising traffic began to increase rapidly, in addition to meet functional requirements, but also need to take into account good performance.

After combing through, most of the logic of the engine is common, so we define a body framework and abstract the extensibility. In this way, each scenario can implement some common interfaces according to the particularity of its own business. In addition, from a performance perspective, we sacrificed some code readability and parallelized some logic.

As the business grew and the search scene began to iterate rapidly, with more and more new policies, our main framework became inflexible.

If you move the body frame, any scene other than the search needs to be refactored. At a time of rapid business growth, time limits were not allowed, so we had to patch development on top of existing frameworks. This raises two obvious questions:

1. In order to accommodate the special logic of search, we need to add various if judgments in other scenarios to bypass this logic.

2. There are more and more advertising strategies, including dozens of them. When the framework loses its clear structure, the implementation of some strategies begins to become customized, lacking hierarchical division and pluggable abstract design.

In this context, as changes accumulate, the code begins to deviate from what it was designed for, and the technical debt mounts. However, we never found the right time to refactor.

A turning point occurred at the end of 2019. Due to the particularity of advertising business, the traffic began to decline naturally. In addition, the product operation team focused on the work planning of the second year, which gave us a very good window to start this reconstruction.

We set the construction period to one month, and finally the line was only one day longer than expected. Although there were two online problems, they were found and repaired in time during the gray period, and no online accidents were caused.

In general, this was a difficult and successful refactoring project, and here are some valuable lessons I learned from it.

02 What did we do to prepare for refactoring?

This refactoring was a large amount of code, more than 30,000 lines, and it was the core engine part of the advertising system. Before starting, we can expect the following difficulties:

1. Business resistance: Advertising is extremely business-oriented. Although this reconstruction can bring long-term improvement of R&D efficiency, it cannot directly increase business income, and the development cycle is not too short.

2. Technical concerns: Once an online accident is caused by reconstruction, the company has a punishment system. How to make everyone travel lightly? At the same time, if there are heavy business iterations in the process of refactoring, no one can guarantee the delivery time and quality is difficult to control.

In response to the concerns of both sides, I think the following work plays a key role.

1. Let everyone see the pain points

As mentioned earlier, as the business iterated, the main framework of our advertising engine became blurred, with dozens of other advertising strategies scattered across different business scenarios and disordered configurations.

Aiming at these two pain points, we started to sort out the existing business one month in advance. We went through the old code and read the previous requirement documents at the same time. Finally, we classified the core process and advertising strategy of different scenarios into a clear table.

It was this chart that gave technology and products the first clear view of our engine segment, the complexity of the business and the current technical bottlenecks.

2. Define the goals and values of refactoring

After making everyone feel the pain points, we laid out two core goals for this refactoring:

1. Reconstruction of the main framework: modularize the main process and redefine the upper and lower protocols to ensure clear interfaces; Each hierarchy also needs to do a good job of abstraction, with good scalability.

2. Flexible and configurable policies: Advertising policies are classified and abstract according to business intentions, and the execution conditions of the policies can be dynamically configured. At the same time, the policies can be arbitrarily inserted and removed.

In addition, we have refined the expected benefits of these two core objectives:

1. Technical benefits: clearer code structure, easier to understand and maintain; With enhanced scalability, engine development efficiency will be further improved.

2. Business benefits: policies can be configured and expanded in finer granularity and are more friendly to business support; After improving the efficiency of r&d, it can further accelerate the speed of business iteration.

By synchronizing the value of refactoring to everyone, it further increased everyone’s excitement and made everyone more motivated to participate.

3. Control of the overall rhythm

The control of the overall rhythm is also very important, so that everyone can have a time expectation of the event.

First of all, we set the construction period as one month. On the one hand, we considered the maximum period acceptable to the business side, and technically we also hoped for a quick decision. On the other hand, the Spring Festival is coming, so we have to go online before the company blocks the network and reserve a buffer of 1-2 weeks in case of accidents.

In addition, we agreed with the business side that non-urgent requirements for the engine would be removed during refactoring, which minimizes parallel development and code conflicts and keeps the team focused.

What lessons can be shared during the implementation process?

There are four lessons that I think are valuable in making this refactoring work so well.

1, high quality technical design scheme

This is due to the daily requirement that we design technical solutions for projects with a development cycle of more than 3 days, and this refactoring is no exception.

The overall architecture of the framework part, the protocol design between modules, and the scalability design of the strategy are the key points of this technical solution, which has been discussed by the team for more than three times.

After the big scheme was finalized, the team further refined the common parts such as database, interface field, cache structure, log burial point, etc. Because multi-person collaborative development was involved, the team agreed to take the document as the communication interface, and the document always kept in sync with the code.

Under such high requirements, the team produced more than 5000 words of technical solution document, a total of 36 pages, which laid a good foundation for the overall quality assurance.

2. Pre-refactor out the framework code

This PR is very critical, it is the most important step from the technical solution to the code. We combed through the restructured package structure, module division, API definitions between layers, and abstraction of different advertising strategies, ignoring implementation details for the moment.

In this way, the body code is basically finished and clearly delineates the framework we want. We then organized several centralized code reviews that led to a consensus.

This step is a good way to avoid getting bogged down in implementation details too early, which can lead to a lack of focus on the main framework and unstable code that can be reworked at a later stage.

3. Frequent communication and pair code Review mechanism

Once you get into the detailed implementation phase, it’s important to understand the existing logic. After a year and a half of iteration, the engine code has been developed by many people in history, but this time only three students participated in the refactoring.

In the whole process, when we encountered any unclear code logic, we repeatedly communicated and verified it without subjectively guessing, which was actually very important.

In addition, we assign students familiar with the business to be responsible for code review according to modules, so the mechanism is flexible and matches in pairs.

4. Effective test plan

Test first while refactoring remains intact. This principle is emphasized in refactoring and is the focus of our discussion of technical solutions, which I’ll break out in detail here.

First, we made an early agreement that we would not touch any of the old code and completely build a new package for refactoring. In this way, it is convenient to compare the results before and after reconstruction, and at the same time carry out online gray scale experiment.

In terms of the test scheme, the following four points are worth learning:

1. End to End testing: This refactoring does not involve functional adjustments, so there is no change in the behavior of the outer API. This is the most effective end to end testing method, which is the main means of development and QA testing.

2. Smoke test: QA students will provide smoke cases, and r & D students will smoke. Before r & D test, all smoke cases must be passed. This is unusual for most Internet companies, but it works for large projects.

3. Dual-process verification of sandbox environment: the code before and after the reconstruction mentioned above is retained, so we can grab the input parameters of the online environment as case by script, and then compare the return fields of API one by one in an automated way.

4. Online environment gray scale experiment: gray scale is very important for reconstruction. We use the existing ABTest platform to gradually release gray scale flow from 5%, to 10%, to 30%, and finally to 100%.

Write in the last

Review the entire refactoring process and summarize the following seven key points:

1. Grasp the opportunity of refactoring

2, early combing is very important, first find the pain point

3. Get people excited by clarifying goals and values

4, should not be a long-term battle, not parallel and business

5, need high quality technical solutions

6, refactoring is not moving, test first

Verify carefully and be responsible for each line of code

The most important factor, of course, is people. Refactoring on large projects is extremely challenging for teams to work together, and if everyone is on board, refactoring is half the battle.


About the author: 985 master, former Engineer of Amazon, now 58-year-old technical director

Welcome to pay attention to my personal public number: IT career advancement