I plan to write a few articles summarizing my three years as technical director of the Nuggets. From the three perspectives of technology, product and management, I will talk about my thinking and harvest.

This is the first one to describe in chronological order the technical planning and implementation process of transferring the whole site reconstruction of nuggets, a ServerLess technical community, to the public cloud when I just arrived at Nuggets in late 2016. And some reflection.

The beginning of the story

When I first got to Denver, there were two Android engineers, two iOS engineers, and three front end engineers. No. That’s the full staffing of development Force. This staffing underpins nuggets’ first 200,000 registered users.

While Nuggets is built with a ServerLess platform, our server-side implementation is all Node.js. The interface is then provided for all client calls. The platform replaces us to complete the basic database driver, data model storage, basic ACL mechanism, content storage and other low-level abstractions, user related functions (the whole user center solution) and so on. Using its SDK, you can build a small scale application. That’s how the Nuggets were born.

But soon, we ran into some problems.

Failure: Yes, and a systematic failure. Because of the high integration of ServerLess platform, we can not even open the operation panel in case of systematic failure. If the system is built by ourselves, at least there can be a degradation. But relying entirely on an entire system, there’s nothing we can do in the face of a systemic failure but wait for the platform to fix it.

It is known that charging by volume is very cost-effective when the number of requests is small, while charging by fixed amount is more economical when the number of requests exceeds a certain level. It’s the same with mobile data plans. The volume of requests for our business has exceeded the pay-as-you-go economy. Costs are beginning to rise sharply.

The platform architecture is completely black box. We didn’t know anything about the architecture or infrastructure of the platform other than the SDK we were using. The outermost operable boundary is its own domain name, the inner operable boundary is its own business logic code. Other than that, it’s a complete black box. While the selling point of ServerLess is that developers don’t need to worry about this at all, ServerLess is not a panacea in industrial practice.

Architecture corruption. code complexity has ballooned to a level beyond the reach of existing staff. The philosophy of ServerLess is to make it as light as possible, but the coupling of the existing architecture is so high that it often takes time alone to refactor the code. Refactoring may seem like a way to organize the code more smoothly, but there’s a deeper problem going on.

The need for frequent refactoring directly illustrates several issues:

  • The existing architecture design conflicts greatly with the new requirements, and every change is incompatible with the original architecture
  • Existing architectures are complex to develop and require a deep understanding of the current layout and patterns before they can be modified, otherwise the architecture will rot.
  • Existing architectures extend horizontal complexity (as the amount of code increases) leading to a dramatic increase in vertical complexity (as the amount of code that organizes the code), while startups grow their businesses so fast that the amount of code grows so fast that they are forced to refactor like crazy. This is why design patterns are needed for large projects. You need a unified way to manage the growth in horizontal complexity of your business, and management itself increases the vertical complexity of your business. In plain English, a large amount of code requires appropriate design patterns to organize the code, and the more code, the more complex the design patterns and organization code are required, that is, the vertical complexity increases. Horizontal complexity can be handled manually, whereas vertical complexity is requiredexperienced.Familiar with the current architecture.And docking next business evolves to have certain predictive abilityEngineers will solve it.

The three requirements I have given for handling vertical complexity well are based on experience and reflection:

  • experiencedRepresents an understanding of the current architecture
  • Familiar with the current architectureRepresents the ability to quickly make changes to the current architecture
  • And docking next business evolves to have certain predictive abilityRepresents a change in the right direction to the current architecture

For these reasons and considerations, I decided to reinvent the old architecture and design an architecture for Nuggets that would serve one million registered users.

I was very cautious when I first thought about this, and I’m sure you’ve all worked on refactoring projects at some point in your career. Refactoring is as risky as it gets.

But the main motivation for me to change my architecture was the situation where I would face the same risks whether I made the change or not. If the change is made, there is a risk that the technology will fail, and the time and cost consumed will make it impossible for the company to continue operating. If we don’t make changes, it’s fine now, but in the not-too-distant future we’ll be stuck with endless bug fixes, maintenance fatigue, and no business moving forward.

The only thing that’s certain is that if the Nuggets don’t get to 1 million registered users any time soon, there’s not much point in going after them. Only with rapid growth will there be a chance of future survival. So there’s only one way to change.

Well, if you’ve decided to do it, do it as soon as possible. But then I encountered a new problem. We don’t have enough hands.

Yes, the number of engineers cannot be replenished. Although I’ve tried my best to lower the bar for engineers, there’s probably not one out of 10 people in an interview who can easily implement a linked list. Not to mention hiring engineers with a certain level of experience. If a startup company is not famous, it is not very attractive in itself. It is not a problem that can be solved by paying a salary to a big factory to recruit engineers. It is very likely that a startup company will go to a big factory to recruit employees only for Double. We don’t have that kind of budget. And in general, in my experience you can get the ideal candidate within two weeks for an entry-level position at a head company. But the nuggets may not have a satisfactory candidate for three weeks.

This makes me wonder if the Nuggets could be stuck with this kind of engineering failure for a long time.

As I write this, four years have passed since I made this conjecture. The conjecture turned out to be correct. Startups can’t consistently acquire and maintain good engineers without spending a lot of money. This is the lesson learned in blood.

So, back then, how do you build a system that can support 100 million users, can quickly promote new business, and the cost is not particularly high when the level of engineers is not high and the manpower is not enough? Before we proceed, we need some theoretical basis, and these theories and reflections, which may still be at a very superficial level, but which are empirically successful, have led to my answer. So I want to share it with you.

Boring concept sessions

The concept of service size

The size of the service is an interesting topic. Why don’t we organize all the code into one service? How do you size your services?

Let’s look at the service size differentiation first. Service size does not refer to the amount of code for a service, but to the organization of services that make up a business. We can try to categorize service sizes based on how they are organized.

Let’s start with a fixed-size business, such as WordPress, which is a complete CMS type business.

So obviously, the upper limit of service size is WordPress. We can start by simply defining that the entire business relies only on intra-process communication (without any network communication) and organizing the entire business into a REPO, which is the maximum pattern in service size.

Next, we split WordPress into an infinite number of functions, so that each function constitutes a service (since there is no point in further subdividing a function of a proper size and function, it will only reduce performance complexity). Then a business consisting of a REPO with function as the smallest unit of service is the smallest pattern in the service.

For the sake of simplicity, we’ll call the minimal service mode Picoservice. The biggest model is called Polyservice. For Polyservice, we just mentioned WordPress is a good example. In reality, there are many polyservices, and traditional businesses may organize code in this way, I believe you have seen it. However, it is rare for every function to be divided into services. I have roughly counted about 10,000 functions in WordPress. You can imagine how scary it would be to split WordPress into Picoservice and eventually organize it into a business. In reality, only the business running on the FAAS platform may be composed and run as a single function.

Based on the definition of service size, we can sort the existing service types by size:

Picoservice <= FAAS(or ServerLess) < Microservice < Monoservice <= Polyservice

Note that service size is independent of service deployment scale. Both Picoservice and Polyservice can be deployed on multiple machines as long as they are properly designed to improve performance.

Why does the FAAS model, in which every function is separated into services, appear? The big reason is that it reduces the cost per write. If the program is small enough and the goal is simple enough, the cost of each writing or modification will be low enough. The idea is similar to Unix, in that you write programs with a single purpose and function, and pipe them together to do more complex work. The other is that hardware is now powerful enough to splurge and sacrifice some performance for convenience.

Microservices are the business-unit architectural pattern of this idea. So I think microservices can also be called BAAS (Business as a Service).

In the same way, the popular concept of a mid-platform can be seen as abstracting the common base of microservices into MonoServices to improve performance and management (i.e. a dedicated department to maintain high availability and expected consistency).

The concept of service dependency patterns

Once the concept of service size is introduced, we can begin to discuss the dependency patterns between services.

First we know that Polyservice is not necessarily related to low coupling and high cohesion. What if it’s close to Picoservice? Of course, we don’t really have an interface for each function, but let’s assume that our service needs to be split. How would it be split? Obviously, we need to cut as easily as possible. Easy to disconnect areas such as the natural logic of the business, business sub-functions, external dependencies, CPU memory and other resource-intensive logic boundaries, multiplexed logic and so on.

We also started with the ideal model. We are talking about pure Data Coupling here.

ABC Serial Pattern

That is:

A->B->C
Copy the code

In the separation of such businesses, it was not A failure as long as A and C were not put together because A and C functions were completely unrelated and together they became Coincidental Cohesion.

Mutual AB Pattern (AB Mutual Pattern)

A <-> B
Copy the code

If there are two nodes, you can decide whether to split or not. If there are no other conditions, you don’t need to think too much.

ABC Tree Pattern

This is the pattern:

ABC-Tree-Pattern:

A ↓ ↓ BCopy the code

or

ABC-Turn-Tree-Pattern:

A B C C DCopy the code

For these two models, it was also guaranteed that they could not make mistakes so long as they did not develop Coincidental Cohesion namely B and C together.

ABCD Loop Pattern

That is:

A -> B ↑ ↓ D < -cCopy the code

This pattern is often seen in callback functions. Generally suitable for large end together to improve performance.

These four minimum patterns are the most common patterns that make up business calls, so let’s look at the invocation patterns of real-world applications:

You can see the Call Graph of your application, which is basically a combination of these patterns. For a large business, the call pattern is similar to this tree (see Kiali call diagram for microservices).

So, based on the basic pattern above, we know what states are not suitable for splitting. But we don’t have a clue as to what states are suitable for splitting, so we need to introduce some other reference conditions.

The smallest unit of a natural business

We all know that requirements create a business, and there is no need to write code without requirements. So what is the smallest unit of business?

We define the set of operations containing only one function as the Minimal Business Unit. For example, article services (including creating articles, reading articles, updating articles, deleting articles and other operations). The operation is the Minimal Operating Unit.

Generally speaking, services are split according to the minimum limit, that is, the smallest service unit. And if for the sake of performance, can be locally improved to according to the minimum operating unit split. Even for extreme performance, the operating unit has to be split internally. For example, in the production of microblog data flow, the mode of BIG V is different from that of ordinary users.

My rule of thumb is to consider splitting a program if part of its logic has one or more of the following behaviors:

  • Logic that is called only by other parts, that is, the logic at the bottom of the call diagram (usually probably utility or storage classes)
  • Called by many other parts
  • Many other parts are called
  • Parts of the business where the requirements change very frequently
  • Different parts of the user
  • Part of the request that takes a long time
  • The amount of code is significantly larger than other parts

Split with minimal business unit pattern

We learned from the concept of service size that the smaller the service, the easier it is to make changes, and of course the lower the performance. Through the concept of service dependency patterns, we know how to organize services to make our inter-service dependencies easier.

We talked above about the emergence of the concept of mid-platform, which extracts the common parts between microservices to improve performance. So we can naturally think of, there is no concept and the opposite of Taiwan? For example, scattered Taiwan (here the middle of the middle is understood as the central, not the middle of the middle)? Or is there a model that is not only easy to modify, but also has little impact on itself and on others around it?

Ok, let’s talk about a pattern of service size mediation between FAAS and microservices. I call it Minimal Business Unit Pattern. That is, business unit is the basic unit of the pattern. Its special feature in practice is to encourage reuse in the way of replication to reduce the reuse degree of the same logic by different businesses. In this way, the impact of business changes on businesses is minimized.

For example, a business is called by many other businesses, and now the new business needs to call this part of the logic again, and there are many differences from the original logic. The traditional refactoring model takes the original code and develops it on top of it. The mid-stage model extracts the logic when it is large enough to centralize the infrastructure, and this model simply writes some duplicate code and splits it off with the new business.

This is an anti-pattern, and while it is also a reuse, it directly violates the DRY(Don’t Repeat Yourself) principle. The disadvantages directly lead to code expansion, maintenance costs rise, maintenance difficulties. But this time we’ll see what’s in it for you:

  • First of all, its biggest benefit is that it reduces the chance of code changes introducing bugs. becauseOld code will not introduce new problems.
  • Second, you can replace the increase in vertical complexity (the management logic that is generated to organize the code, design patterns being a prime example) with an increase in horizontal complexity (code bloat).
  • Thirdly, its cost is very low. Modification on the basis of the original code not only requires familiarity with the original business and modification into a proper organizational model, but also requires testing of the original business after modification (because there is no guarantee that the original business will be ok after modification), which are all costs. A new business simply copies the required logic and adds new logic.
  • In the end, it does not lead to failure centralization, which is consistent with the principle of genetic variation in biology, where changes in DNA(logic, for example) during replication can reduce the area of infection in the event of widespread outbreaks of viruses (bugs, for example). After copy at the same time, copy parts of the die in a single program will not affect the other copies of the code, sharing this code in a business application, once this code hang up, is the business of the whole to use this code will hang up, the best example is the allocation center the common business in a big company, Once the configuration center dies, all services die. Putting configuration into code, while not a good practice, does not face this problem.

However, be sure to figure out when this split antipattern applies:

  • The next several will rely on the rule that the business implementing the antipattern must be small enough to make it work.
  • This antipattern is suitable for experimental new business outbreaks. namelyThe product manager is experimenting with his new ideaIn this case, instead of modifying the existing business and finding out that no users like the new feature, the new logic will be thrown into the business as a time bomb (whether it is a Bug or a security risk). Instead, write a new business and duplicate the parts you need. So you can go offline when you don’t need it. It will not affect the original business. Change costs are much lower than developing on top of the original code.
  • After this pattern is implemented, the situation where it is modified again is not suitable for refactoring, but more suitable for rewriting. As long as the business is small, the cost of rewriting is infinitely low. Rewriting also reduces the chance of introducing new bugs.
  • This mode is suitable for maintenance personnel who change frequently. For example, if the company is short of staff, or there is frequent turnover of staff, it is necessary to understand the organizational model of the original code to modify the original code, while it is necessary to write a new one.
  • The other purpose of implementing the anti-pattern is to eliminate inter-business invocation. If the anti-pattern is implemented and still invokes the source business, then the implementation is not thorough.

He started the fight

hands-on

Okay, so much for the abstract. Let’s take a look at what all this says:

  • We started by talking about the characteristics of different sized services
  • And then we talked about if you want to make the service small, how do you split it up, and when do you want to split it up
  • Finally, we talked about a minimal business unit model that can be broken down to very small, with very low cost per modification

Well, we analyzed the service size and inter-service dependency patterns and came up with a pattern with low per-write cost, low peripheral impact and, of course, low performance — the minimal Business unit pattern.

Looking at our current situation, we don’t have enough people (no time to deal with vertical complexity) and we don’t have enough people (no ability to deal with vertical complexity). So it’s a good idea to use a pattern with horizontal complexity skyrocketing but not vertical complexity.

So what about the performance problem? In terms of overall performance, this is where the inherent advantage of WEB business is that as soon as replicas are deployed, overall performance increases linearly with the number of replicas. For the single request performance problem, we know that there are two ways to improve performance, do subtraction, that is, reduce logic natural performance improved. And centralization, that is, like the mid-stage, centralize performance-sensitive parts to improve performance.

Finally, we came up with the final solution to the current and future people problem — rewriting all the business using the minimal business unit pattern, and using the traditional centralized pattern to improve performance in the core areas where performance was inadequate.

Achilles held the turtle down

Refactoring doesn’t happen overnight. It’s suicidal to shut down your business, write a new system and switch traffic all at once. Orderly, try not to affect the existing traffic to switch, the probability of success will increase. Therefore, the switching process is exactly the opposite of writing a large business from zero. It is a good practice to switch the top edge business first, then work inward, and finally switch the core.

Everyone knows the story of Achilles and the turtle, and the business refactoring process feels similar. In the process of reconstruction, how to ensure that the old business code does not continue to expand is one of the keys to the success of reconstruction, otherwise the old code keeps expanding, and the new business will never be online.

This case is as simple as making the existing business code read only. I made it a rule with the business team that there would be a point at which no code would be added to an existing platform. New services are written from scratch.

Reconstructing a business is like replacing a fish tank with a fish tank. You need 4 things to participate in the whole process: fish tank (original system), fish (user flow), fish net (bridge), and new fish tank (new system). Since we can’t continue to add code to the old system, how do we give the new system access to the functionality and data that remains in the old system during the refactoring process? The answer is fishing nets.

Selecting the bridge between the old platform and the new platform is very complex. Traffic needs to be forwarded based on rules at the bundle point (gateway) of all access traffic, so as to ensure minimal changes and cover all services. The SDK of the original ServerLess platform is a good bridge, and all clients need the SDK to communicate with the server. We modified it a little bit. The business split out of the new platform according to the minimal business unit mode reads its own database and the original data left on the ServerLess platform through the modified SDK. Similarly, the business on the original ServerLess needs to access the business on the new platform, and this part of traffic is forwarded to the new platform through the modified SDK. After migrating one service at a time, you can redirect traffic to the new platform by making changes in the SDK. After the switch is complete, remove the SDK separately and replace it with the native URI call.

This is actually one of the limitations of the ServerLess platform design. The domain name to the ServerLess platform is the past of the CNAME, and the URI is completely out of your control. However, there is a Web gateway on the platform built by myself, and the services are completely unaware as long as the changes are made on the Web gateway for the next migration. The ServerLess platform cannot operate the WEB gateway, so this is the only hack. I have to say that the SDK is open source and gives us the opportunity to migrate. If the source is closed, the only way to achieve the same effect is to either do a high-risk one-time migration or to adapt your own WEB gateway layer on top of it.

Obviously, SDK access to both platforms can be a performance bottleneck. Therefore, I directly implemented a centralized traffic proxy gateway API-Proxy in the new platform, implemented traffic forwarding with OpenResty, and embedded an internal cache with Lua-resty-Lrucache to solve the time-consuming problem of data coupling process. This proxy gateway can achieve performance close to 80KQPS(average request time 20ms, cache hit) in an 8core16G machine configuration, which is what we talked about above, using traditional centralized mode to improve performance in the core where performance is not enough.

Ghost Dubbing

Database switching is more complex, for articles, comments on this kind of data, we find users online less time to directly copy the database switching traffic can be. But we had a problem replicating user data. User passwords are encrypted, and we don’t know how (internal functionality of the ServerLess platform, implementation unknown).

How do I switch this data? We did this design.

Unregistered users, very simple, directly call the migration of the new user center interface registration can be.

When a registered user logs in, apI-proxy is invoked to simulate user login. Once the user successfully logs in to the old platform, the user name and password are valid. The password entered by the user is directly entered into the user password update logic of the new platform to complete the password update of the local database. Once a local user has a password, the system does not invoke apI-proxy, but directly enters the new user center to complete the user login process. This completes the process of re-encrypting the user password and storing the result in a new database without knowing the original encryption method.

For inactive users, calling apI-proxy to verify user login will not be able to complete migration. After all, we will eventually log off apI-proxy. So we set a transition period of 2 weeks, after which we went directly offline apI-proxy. Inactive users will directly enter the password reset process and update their passwords by sending a reset link to the user’s registered email address.

Deployment patterns and local optimizations

Finally, we unified some parts that are not convenient for gray switching, and prepared a specific time to carry out the overall switching. The launching process was very simple. We made an announcement in advance, and then removed the forwarding logic to ServerLess platform in SDK and API-proxy, and directly switched to the new environment. Due to the full extent of the test, almost no adaptation errors were encountered. But there was a performance problem.

After the switch, the overall delay increases significantly, and the main station is even unable to open at peak times. This is the performance problem caused by excessive splitting. But solutions are already in the works. The deployment mode of our service is standalone deployment, that is, every machine on the line has all the services of the whole station. So to improve performance, simply start a new machine in the public cloud, deploy all the services to it, and finally add the new machine in the WEB reverse proxy.

In the days before containerized platforms, it was a good practice to quickly and easily scale up. The architecture is simple enough. Every machine is idempotent. When a machine is offline, some services will not be unavailable. But the drawbacks are also significant. Because the performance of a single machine is limited. Therefore, when there are too many services, the performance of single machine is not enough. Although increasing the number of machines can alleviate the problem, if there are CPU-intensive services, this deployment mode will inevitably share the main frequency with other services, resulting in increased latency. According to our practice, Nuggets has 100+ microservice REPO’s. That is, these repO’s are on every machine microserver. Our test data limit was around 8 repo per core. That’s a 16-core machine that can handle about 128 of our segmented repOs.

To reduce interface latency, we also made some optimizations on apI-Proxy. Because the service is in the minimal business unit pattern. Therefore, when there is a large amount of service associated data, more underlying storage interfaces need to be invoked. This can seriously affect performance. Especially if the data is context-dependent. User notification is an example, for example, a user received a notification that others like his article, so this notification information to display the abstract of the article (read the article system), show the user (read the user information), show the number of likes (like the system), this will call three interfaces. A global cache is designed for this apI-proxy. These data are aggregated together with Builder in advance, and the system that needs the data directly reads the cache, and filters out the fields that are not needed, and only keeps the fields that are needed. This is a classic idea of swapping space for time.

At the end

The nuggets’ restructuring and switch lasted nearly six months, from 2016-12 to 2017-05. I started the refactoring with two front-end engineers, two back-end engineers, two iOS engineers, and two Android engineers. The months from December 2016 to February 2017 were carried out in parallel, that is, new business development and reconstruction were carried out at the same time. In March 2017, there was a serious loss of engineers, only me and another front-end engineer were left in the front and back end, iOS engineers all left, and Android engineers remained with deep affection. So we spent most of the month recruiting new engineers and hired 3 back-end engineers by the end of march. In fact, annual bonuses are the worst time for engineers to lose, but we lose too much due to other issues. Speaking of which, we have achieved great results in training engineers. These engineers all went to big factories after leaving. I have to say small companies are difficult in many ways. That’s why I had to swallow my pride and make the architecture as simple as possible.

From April 2017 to May 2017, the development of new business was stopped, and all the time was spent in reconstruction and migration. The final phase of migration can be extremely stressful for all teams. The product team is in the “funny and awkward situation where you can’t attract new users without new features,” while the development team is at the tail end of the Pareto curve, and the closer it gets to completion, the more bits and pieces are waiting to be restored and refined. These are the moments when leadership comes into play. I ended up cutting some of the old businesses, switching them over, and then finding the time to work through the legacy issues. In fact, when I left the Nuggets, that part of the business was not required to be reimplemented.

I’ve heard countless stories like, “If I’d known it would be so hard, I would never have done it.” During the refactoring, I even spent every Saturday of the week working for free to push the process forward (the company normally closed Saturday), hoping to complete the switch as soon as possible. However, after the handover, the evaluation of work results becomes a problem. Since I’m not going to run into performance bottlenecks any time soon, I don’t have time to write a long article on the rationale for why refactoring, how refactoring works, and the benefits of refactoring. Perhaps it is only when the old structure breaks down that the effect of destruction can be revealed through instant magical repair. But there’s no magic, and startups don’t have time for emotion.

A strong architecture is not as good as money, money is not everything. So a strong architecture is not a panacea (manual humor). There are no good or bad architectures, only the right fit for the team and the business. Refactor carefully and evaluate first.

See you next time.