In micro service between services rely on is very common, such as review service depend on the audit services and audit services rely on anti-spam service again, when review service call audit services, audit services and call the anti-spam service, while the anti-spam service timeout, because they depend on anti-spam services, audit services anti-spam service timeout lead to audit service logic has been waiting for, While the review service is constantly calling the audit service, the audit service may be overwhelmed with requests, causing the service to go down

Thus, in the whole call chain, an abnormal link in the middle will cause a series of problems in the upstream call service, and even lead to the whole call chain service down, which is very terrible. Therefore, when a service calls another service as the caller, in order to prevent the problems of the called service from causing problems of the called service, the calling service needs self-protection, and the common means of protection is fusing

Fuse principle

The fuse breaker mechanism actually refers to the protection mechanism of the fuse in our daily life. When the circuit is overloaded, the fuse will automatically disconnect, so as to ensure that the electrical appliances in the circuit will not be damaged. The circuit breaker mechanism in service governance means that when a service call is initiated, if the error rate returned by the called party exceeds a certain threshold, the subsequent requests will not actually be initiated, but will directly return an error at the caller

In this pattern, the service caller maintains a state machine for each invoked service (invocation path), in which there are three states:

  • Close (Closed) : In this state, we need a counter to record the number of failed calls and the total number of requests. If the failure rate reaches a preset threshold in a certain time window, we switch to the off state, which opens a timeout period, and switch to the half-off state when this time is reached. This timeout gives the system a chance to correct errors that cause calls to fail and return to normal working conditions. In the closed state, call errors are time-based and reset at a specified time interval, which prevents accidental errors from causing the fuse to go in and out
  • Open: In this state, an error is immediately returned when a request is made. Usually, a timeout timer is started. When the timer times out, the state switches to half-open
  • Half-open: A state in which an application is allowed to make a certain number of requests to the invoked service. If the calls are normal, the service is considered to have recovered. The fuse switches to the closed state and the count needs to be reset. If there is still a call failure in this part, it is considered that the called party is still not recovered, and the fuse will switch to the closed state, and then reset the counter. The half-open state can effectively prevent the recovering service from being crashed again by a sudden large number of requests

The introduction of circuit breakers in service governance makes the system more stable and resilient, provides stability when the system recovers from errors, and reduces the impact of errors on system performance. Service calls that may cause errors can be quickly rejected without waiting for the return of real errors

Fuse is introduced

The principle of fuses is introduced above. After understanding the principle, have you thought about how to introduce fuses? One solution is to add fuses to the business logic, but it’s obviously not elegant or generic, so we need to integrate fuses into the framework, which is built into the zRPC framework

As we know, fuses are mainly used to protect the calling end, and the calling end needs to pass fuses before initiating requests, while the client interceptor has both these functions. Therefore, fuses are implemented in the client interceptor in the zRPC framework, and the principle of the interceptor is shown as follows:

The corresponding code is:

func BreakerInterceptor(ctx context.Context, method string, req, reply interface{}, cc *grpc.ClientConn, invoker grpc.UnaryInvoker, opts ... grpc.CallOption) error {
  // Fuse based on request method
	breakerName := path.Join(cc.Target(), method)
	return breaker.DoWithAcceptable(breakerName, func(a) error {
    // Actually make the call
		return invoker(ctx, method, req, reply, cc, opts...)
    // Codes. Acceptable Determines which errors require the fuse error count to be added
	}, codes.Acceptable)
}
Copy the code

Fuse realization

The implementation of fuses in zRPC refers to Google Sre overload protection algorithm, the principle of which is as follows:

  • Requests: The total number of requests made by the caller
  • Accepts: The number of requests normally processed by the called

Normally, these two values are equal, and as the called service fails to accept requests, the quantity accepted starts to be smaller than the number of requests, at which point the caller can continue sending requests until requests = K * accepts, Once this limit is exceeded, the fuse is turned on and the new request is discarded locally with a probability of returning an error. The probability is calculated as follows:

By modifying K(multiple value) in the algorithm, the sensitivity of the fuse can be adjusted. When the multiple value is reduced, the adaptive fuse algorithm will be more sensitive, and when the multiple value is increased, the adaptive fuse algorithm will be less sensitive. For example, Suppose that adjusting the caller’s request ceiling from Requests = 2 * acceptst to Requests = 1.1 * accepts means that one in every ten requests from the caller will trigger the circuit breaker

The code path is Go-Zero/Core /breaker

type googleBreaker struct {
	k     float64  // The multiple value defaults to 1.5
	stat  *collection.RollingWindow // A sliding time window is used to count requests for failure and success
	proba *mathx.Proba // Dynamic probability
}
Copy the code

Implementation of adaptive fusing algorithm

func (b *googleBreaker) accept(a) error {
	accepts, total := b.history()  // The number of requests received and the total number of requests
	weightedAccepts := b.k * float64(accepts)
  // Calculate the probability of discarding requests
	dropRatio := math.Max(0, (float64(total-protection)-weightedAccepts)/float64(total+1))
	if dropRatio <= 0 {
		return nil
	}
	// Dynamically determine whether a fuse is triggered
	if b.proba.TrueOnProba(dropRatio) {
		return ErrServiceUnavailable
	}

	return nil
}
Copy the code

The doReq method is called each time a request is initiated. In this method, fuse is triggered through accept first. Acceptable is used to determine which errors are included in the failure count, as defined below:

func Acceptable(err error) bool {
	switch status.Code(err) {
	case codes.DeadlineExceeded, codes.Internal, codes.Unavailable, codes.DataLoss: // Exception request error
		return false
	default:
		return true}}Copy the code

Both the number of requests and the number of requests accepted are incremented by markSuccess if the request is normal, and only the number of requests is incremented if the request is abnormal

func (b *googleBreaker) doReq(req func(a) error.fallback func(err error) error.acceptable Acceptable) error {
	// Determine whether a fuse is triggered
  iferr := b.accept(); err ! =nil {
		iffallback ! =nil {
			return fallback(err)
		} else {
			return err
		}
	}

	defer func(a) {
		if e := recover(a); e ! =nil {
			b.markFailure()
			panic(e)
		}
	}()
	
  // Make the actual call
	err := req()
  // Count normal requests
	if acceptable(err) {
		b.markSuccess()
	} else {
    // Count exception requests
		b.markFailure()
	}

	return err
}
Copy the code

conclusion

The calling end can protect itself through the circuit breaker mechanism to prevent the abnormal occurrence of calling downstream services or the impact of long time on the business logic of the calling end. Many microservice frameworks with complete functions have built-in fuses. In fact, fuses are not only needed between microservice calls, but also can be introduced when calling dependent resources, such as mysql and Redis.

Project Address:

Github.com/tal-tech/go…

If you like the article, please click on github star 🤝