Nowadays, we constantly praise cloud native architectures (containerization and microservices), but the reality is that most companies still run monolithic systems. Why is that? This is not because we are very unfashionable, but because distribution is very difficult. Still, it’s the only way to create super-scale, truly resilient and responsive systems, so we have to integrate around it.

Forget Conway’s Law, distributed systems follow Murphy’s Law: “Anything that can go wrong will go wrong.

On the large scale of distributed systems, statistics are not your friend. The more server instances you have, the higher the likelihood that one or more of them will go down. And probably at the same time.

Before you receive an alert email, the server will be down, the network will lose packets, the disk will fail, and the virtual machine will terminate unexpectedly.

Some of the guarantees made in a monolithic architecture are no longer guaranteed in a distributed system. Components (now services) no longer start and stop in a predictable order. A service may unexpectedly restart, changing its database state or version. As a result, no service can make assumptions about another service – the system does not depend on 1-to-1 communication.

Many of the traditional mechanisms for recovering from failures can worsen a distributed environment. Hard retry can flood your network with packets, and backup and recovery are not easy. Past design patterns that address all of these issues need to be rethought and tested.

Distributed systems are easy if there are no errors. Optimism can create a false sense of security. The design of a distributed system must be resilient enough to accommodate all possible errors without affecting the day-to-day business.

Communication will fail here

In unreliable (that is, distributed) systems, there are two advanced approaches to traditional application messaging:

  1. Reliable but slow: Save a copy of each message until you are sure that the next process in the chain of responsibility has taken full responsibility for it.
  2. Unreliable but fast: Sending multiple duplicate copies to potentially multiple recipients and allowing messages to be lost and duplicated.

The reliable and unreliable application-level communications we discuss here are different from network reliability, such as TCP versus UDP. Imagine two stateless services that send messages directly over TCP, such as RPC communications. Even if TCP is a reliable network protocol, this is not reliable application-level communication. Any service can lose and lose the messages it is processing, because stateless services cannot securely store the data they are processing. (Banq note: this is RPC framework for synchronization, such as Dubbo or Google gRPC)

We can make this setup application-level reliable by placing stateful queues between each service to hold each message until it is fully processed (banQ note: message queuing is introduced). The downside of this is that it’s a bit slower, but we might be happy to live with it, because if it makes life easier, especially if we use managed stateful queue services, then we don’t have to worry about size and elasticity.

The reliable approach is predictable, but involves delays (delays) and complexity: lots of acknowledgement messages and elastic retention of data (states) until you have completed the confirmation from the next service in the chain of responsibility that they have assumed responsibility for.

A reliable method does not guarantee fast delivery, but it ensures that all messages will eventually be delivered at least once. This is a good approach in an environment where every message is critical and loss cannot be tolerated (such as credit card transactions). The AWS Simple queue service (Amazon’s managed queue service) is an example of using stateful services in a reliable way. (Banq note: Apache Kafka provides a valid pass like exactly once, which is also applicable for transactions like credit cards)

In the second case, end-to-end communication can be achieved faster using unreliable methods (such as RPC synchronization), but this means that services often have to expect repeated and unordered messages, and some messages will be lost. Unreliable communications may be used when messages are time-sensitive (i.e., if they don’t act quickly, it’s not worth taking action) or when later data just overrides earlier data. For very large distributed systems, unreliable messaging can be used because it is less expensive and much faster. However, microservice design needs to deal with message loss and repetition.

In each of the above case approaches, there are many variables (for example, guaranteed and unguaranteed sequentiality), all of which require different trade-offs in speed, complexity, and failure rates.

Some systems can use a combination of these methods, depending on the type of message being sent and even the current load on the system. If you have many services that behave differently, it can be difficult to use these methods properly. The behavior of the service needs to be clearly defined in its API. It often makes sense to define the communication behaviors that constrain or recommend services in your system to achieve a degree of consistency.

What time is it now?

There is no such common so-called global clock in distributed systems. For example, in group chat, my comments and my friends’ comments from Australia, Colombia, and Japan will not appear in a strict order. There is no guarantee that we all see the same schedule – there is always a sequence, but only if we don’t talk for a while.

Basically, in distributed systems, each machine has its own clock and there is no one correct time for the entire system. The machine clock might be synchronized, but even when it is synchronized the transfer time will be different, and the physical clock will run at a different rate, so everything will immediately go out of sync.

On a single machine, a clock can provide a common time for all threads and processes. In distributed systems, none of this is physically feasible.

In our new world, the clock age no longer provides an indisputable definition of order. There is no single “when” in the microservices world, so our design should not rely on inter-service messages.

The truth is out there?

In distributed systems, there is no global shared memory, so there is no single version of truth. The data will be scattered across different physical machines. In addition, any given data is more likely to be in relatively slow and inaccessible transmission between machines than would be the case under a singleton architecture. Therefore, the real operation needs to be based on current local information.

This means that different parts of the system do not always behave in the same way. In theory, they should eventually become consistent as messages spread across the system, but if the data keeps changing, we may never reach a perfectly consistent state unless all new inputs and waits are turned off. Therefore, services must deal with the fact that they may get “old” or inconsistent information when they call each other because of their own problems.

Speak faster!

In a singleton application, most of the important communication takes place in a single process between one component and another. Communication within the process is very fast, so the passing of many internal messages is not a problem. But once you break down the individual components into separate services, often running on different machines, things get more complicated.

Assume you know the following background:

  1. In the best case, it takes about 100 times longer to send messages from one machine to another than it does to transfer the internals from one component to another.
  2. Many services communicate using text-based RESTful messages. RESTful messages are cross-platform, easy to use, read, and debug, but slow to transfer. In contrast, remote procedure call (RPC) messages paired with binary messaging protocols are not human-readable and therefore harder to debug and use, but are much faster to transmit and receive. The message sending speed through RPC is 20 times faster. For example, gRPC is RESTful.

In a distributed environment, the result is:

  1. You should send fewer messages. You can choose to send fewer messages between distributed microservers than between components in a singleton, because each message introduces latency (i.e., latency).
  2. Consider sending messages more efficiently. For the content you send, you can help your system run faster by using RPC instead of REST to transfer messages. Or even just use UDP and handle the unreliability manually. (BANQ Note: RPCC communication is fast but unreliable)

Status report?

If your system can change at sub-second (less than a second in time) rates, which is the goal of a dynamically managed distributed architecture, then you need to understand the problem at this rate. Many traditional logging tools are not designed to keep track of this situation. You need to make sure you can use them.

Test damage

The only way to know if your distributed system is working properly and recover from unpredictable errors is to keep working on them and keep fixing the system. Netflix uses Chaos Monkeys to randomly cause intentional crashes. The resilience and integrity of your system needs to be tested, and it’s equally important to test your logging to ensure that if an error occurs, you can diagnose and fix it retroactively — even if your system is back online.

That sounds difficult. Do I really need it?

Creating distributed, scalable, resilient systems is very difficult, especially for stateful services that need to write databases to hold changing data. Now is the time to decide if it is needed. Will your customer needs tolerate a slower response or a smaller scale system? This way, you can design a smaller, slower, simpler system first, and gradually add more complexity as you build expertise.

Cloud computing providers like AWS, Google, and Azure are also developing and rolling out most of these features, especially elastic state (hosted message queues and databases). These services may seem expensive, but so is building and maintaining distributed services that are complex on their own.

Any framework (such as Linkerd or Istio or Azure’s service architecture) that can handle these complexities while limiting you is well worth considering.

The key challenge is not to underestimate the difficulty of building the right resilient and highly scalable services. If you decide you really need it, educate people thoroughly, introduce useful constraints, do well gradually, and expect setbacks and successes.