Introduce: Nuwa is flying in the distributed system to provide distributed coordination of the basic services, supporting ali cloud computing, network, storage and other almost all cloud products. In nuwa distributed collaborative service, consistency engine is the core basic module, supporting various consistency protocols such as Paxos, Raft and EPaxos, and supporting different business state machines according to business requirements. How to ensure the correctness of consistency library is a big challenge, we introduced TLA+, Jepsen and other tools to ensure the correctness of consistency library. This paper introduces the formal validation tool TLA+ from a programmer’s perspective.

The author | | auspicious light source ali technology to the public

A quote

Nuwa is the flying distributed system to provide distributed coordination of the basic services, supporting ali cloud computing, network, storage and other almost all cloud products. In nuwa distributed collaborative service, consistency engine is the core basic module, supporting various consistency protocols such as Paxos, Raft and EPaxos, and supporting different business state machines according to business requirements. How to ensure the correctness of consistency library is a big challenge, we introduced TLA+, Jepsen and other tools to ensure the correctness of consistency library. This paper introduces the formal validation tool TLA+ from a programmer’s perspective.

Theoretically proved the correctness of a program, or algorithm is often difficult, commonly used in engineering test to find the problem, but no amount of testing there is no guarantee that it covered all the behavior of those who didn’t cover behavior becomes a potential hidden trouble, once again exposed online, and often leads to unexpected results. Formal verification is precisely to solve such problems. It uses the powerful computing power of the computer to search all possible behaviors violently and check whether they meet the pre-set attributes. Any behavior that does not meet the expectations can be found to fundamentally ensure the correctness of the algorithm.

Two TLA + profile

TLA+(Temporal Logic of Actions) is a formal verification language developed by Leslie Lamport. It is used for program design, modeling, documentation and verification, especially for concurrent and distributed systems. TLA+ is designed to accurately describe the system using simple mathematical theories and formulas. TLA+ and related tools help eliminate basic errors in programs that are difficult to find and costly to correct.

To use TLA+ to formally verify the program, the first step is to use TLA+ to describe the program, which is called Specification. Once you have the Specification, you can run it using the TLC model checker, which iterates through all possible behaviors, checking properties specified in the Specification, and finding unexpected behaviors.

TLA+ is mathematically based and uses mathematical thinking, unlike any programming language. In order to lower the threshold of TLA+, Lamport developed the PlusCal language. PlusCal is similar to a programming language and can easily describe the program logic. Moreover, the tools provided by TLA+ can directly translate PlusCal into TLA+. Most engineers will find PlusCal to be the easiest way to start using TLA+, but the cost of simplicity is that PlusCal doesn’t have some of the features of TLA+ and sometimes can’t build as complex a model as TLA+, so PlusCal isn’t a replacement for TLA+ yet. TLA+ development can be simplified by completing the basic logic using the PlusCal programming language and then further modifying it based on the generated TLA+ code.

Three TLA + application

TLA+ has a wide range of applications in both academia and industry. TLA+ Examples show some distributed and concurrent algorithms validated using TLA+. In the research field of distributed algorithm and concurrent algorithm, to propose a new algorithm or improve an existing algorithm, TLA+ verification is basically standard. In addition to the introduction of non-formal demonstration, many distributed algorithm papers will attach the Specification of TLA+ to prove that their algorithms are formally verified. For those familiar with TLA+, it is even faster to understand the Specification of TLA+ by looking at it directly than by reading a long section of paper. When they do not understand the language description of the paper or feel there is ambiguity, they can compare the Specification of TLA+ with the understanding, which is sometimes a powerful tool to read the paper. Sometimes the algorithm details can only be seen in the Specification of TLA+. Since Specification is logical and impenetrable, it can be a better guide to implementation.

Lamport’s TLA+ home page lists a number of TLA+ applications in the industry. Take Amazon as an example, the core algorithms of some Amazon AWS systems use TLA+ for formal verification. Table 1 lists the problems found by TLA+ for some AWS systems, which cover some very core components. Once the problems of these core components are exposed online, the losses will be immeasurable. As such, it has become the industry standard for the core algorithms of distributed cloud services to use TLA+ to verify designs, so as a cloud service practitioner or interested students, familiarity with TLA+ is an absolute plus.

Four TLA + primer

You can start using TLA+ by installing the TLA+ plug-in in VS Code. Here’s a simple example to get started with TLA+.

Consider a single-bit clock that, because it has only one bit, can only be 0 or 1, and behaves in one of two ways:

0 -> 1 -> 0 -> 1 -> 0 -> 1 -> 0 -> 1 -> 0 -> 1 ->…

How do we describe this clock in terms of TLA+? To make it easier to get started, PlusCal is more convenient for engineers to get started.

Figure 1: PlusCal description of a single-bit clock

Figure 1 is the PlusCal description of the single-bit clock, I believe that students with programming background can easily understand. This PlusCal code can be translated directly into TLA+ code using the tools provided by TLA+ :

Figure 2: TLA+ representation of a single-bit clock

With the PlusCal foundation above, it is not difficult to understand the TLA+ section. The key is to understand the Spec. The Spec defines the behavior of the system. Figure 3 illustrates the behavior of a single-bit clock. Init initializes the clock to 0 or 1, Tick makes the clock jump back and forth between 0 and 1, and Stutter leaves the clock unchanged. And what TLA+ is doing is essentially traversing the graph.

Figure 3: Single-bit clock behavior

For this TLA+ to run, the TLA+ code above needs to be saved to the clock. TLA file, and a clock. CFG file, shown in Figure 4, needs to be written. The clock. CFG file is very simple and specifies which Specification to run. Which Invariant is to be checked.

Figure 4: Clock. CFG file contents

With these two files, you are ready to run with TLC, and when you finish, you get the result shown in Figure 5, which shows some statistics.

Figure 5: Run results

Five TLA + principle

To understand how TLA+ works and how it traverses, we can add some parameters at run time and let TLC output the state graph. For example, let’s run the TLA+ code shown in Figure 6. Figure 7 shows the REQUIRED CFG file to run. This example tries to find all the possible combinations of 1’s, 2’s, and 5’s to make $19.

Figure 6: money. Tla

Figure 7: money. CFG

After the operation, the state graph as shown in FIG. 8 can be obtained. The vertex in the graph is the state, with a total of 20 states, money=0 is the initial state, money=19 is the termination state, and the edge in the graph is the action, with 4 actions in total: Add(1), Add(2), Add(5) and Terminating.

Figure 8: State diagram

The operation of TLA+ is completely serial, and the running process is to traverse the graph on the state graph. Every time a state is traversed, it will check whether the current state meets the pre-set invariant. If so, the traversal will continue, and if not, an error will be reported immediately. TLA+ will try all traversal paths and not miss any behavior. We know that there are two traversal modes of graph: depth-first and breadth-first. TLA+ default breadth-first traversal can also be configured into depth-first mode or random behavior mode. Depth-first mode requires a maximum depth.

Now we know that the principle of TLA+ is actually a process of traversing and checking the state graph. Such a process seems simple, but it can cover all the paths of the algorithm without missing any behavior. In fact, we often use TLA+ to check the Safety and Liveness attributes of the algorithm.

Six TLA + concurrency

By now, I believe that readers have a preliminary understanding of the principle of TLA+, but careful readers may still have a big question in mind: TLA+ running process is completely serial, so how does serial TLA+ simulate concurrent algorithm or distributed algorithm?

For the serial algorithm, the actions in the algorithm are Totally Ordered, which itself is a serial state machine, and it is easy to construct the state graph. However, the actions in the concurrent algorithm or distributed algorithm are Partially Ordered, not a serial state machine, so how to construct the state graph?

If the actions in a concurrent algorithm or distributed algorithm can also become Totally Ordered, it can also be regarded as a serial state machine to construct a state graph.

In fact, Master Lamport has been working on this issue for a long Time. In his most cited paper, Time, Clocks and the Ordering of Events in a Distributed System, he provides a method for Ordering Events in a Distributed System. To put it simply, on the premise that the order of the events with Partially Ordered relationship is guaranteed, the order of the remaining disorderly events is artificially determined, so that all the events can be Ordered into Totally Ordered, and the ordering will not destroy the causal relationship.

In fact, it is in the areas of concurrent and distributed algorithms that TLA+ shines, because in these areas, the behavior of algorithms is so diverse that it is easy to miss. Therefore, TLA+ needs to thoroughly examine all the paths of algorithms and not miss any behavior.

Seven summarizes

TLA+ uses a computer’s powerful computing power to search algorithms for all possible behaviors to find unexpected behaviors. As computing power improves and software and hardware systems become more complex, TLA+ will be increasingly valued and become an essential skill for engineers.

Finally, if readers are interested in TLA+, here is a recommended introduction to TLA+ book “Practical TLA+”, which is suitable for getting started, and there is a free electronic version available online for direct download.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.