Link tracing based on OpenTelemetry

The past life of link tracing

Distributed tracing (also known as distributed request tracing) is a method for analyzing and monitoring applications, especially those built using microservices architectures. Distributed tracing helps pinpoint where faults occur and the cause of poor performance.

The origin of

The term Distributed Tracing first appeared in the paper Dapper, published by Google. In a large-scale Distributed Systems Tracing Infrastructure, this paper still has a deep influence on the implementation of link Tracing and the design concept of Jaeger, Zipkin and other open source Distributed Tracing projects that emerged later.

Microservices architecture is a distributed architecture with many different services. Different services called each other before, if something went wrong because a request went through N services. As the business grows and there are more and more calls between services, without a tool to keep track of the call chain, the problem solving will be as clueless as the cat playing with the ball of wool in the picture belowSo you need a tool that knows exactly what services a request goes through, and in what order, so you can easily locate problems.

Hundred flowers

After Google released Dapper, there are more and more distributed link tracing tools. The following lists some commonly used link tracing systems

  • Skywalking
  • Ali eagle eye
  • Dianping CAT
  • Twitter Zipkin
  • Naver pinpoint
  • Uber Jaeger

Tit for tat?

As the number of link-tracking tools grows, the open source world is divided into two main groups, one is the OtherTechnical committee member of CNCFWill give priority toOpenTracing“, such as Jaeger ZipkinOpenTracingThe specification. Google is the initiator of the otherOpenCensusGoogle itself was the first company to come up with the concept of link tracking, and Microsoft later joined inOpenCensus

The birth of OpenTelemetry

OpenTelemetric is a set of apis, SDKS, modules, and integrations designed for creating and managing the astrologer telemetry data, such as tracking, metrics, and logging

After Microsoft joined OpenCensus, it directly broke the balance before, and indirectly led to the birth of OpenTelemetry. Google and Microsoft decided to end the chaos. The first problem was how to integrate the existing projects of the two communities. Compatible with OpenCensus and OpenTracing, users can access OpenTelemetry with little or no change

Kratos link tracing practices

Kratos is a set of lightweight Go microservices framework, including a number of microservices related frameworks and tools.

Tracing the middleware

There is a middleware named Tracing middleware provided by Kratos framework, which implements link tracing function of Kratos framework based on Opentelemetry. The middleware code can be seen from middleware/ Tracing.

Realize the principle of

Kratos link tracking middleware is composed of three files. Carrie, tracer. Go, go tracing. Go. The implementation principle of client and server is basically the same, this paper analyzes the principle of server implementation.

  1. First, when a request comes in, the tracing middleware is called. The NewTracer method in tracer.go is called first
// Server returns a new server middleware for OpenTelemetry.
func Server(opts ... Option) middleware.Middleware {
        // Calling NewTracer in tracer.go passes in a SpanKindServer and configuration item
	tracer := NewTracer(trace.SpanKindServer, opts...)
        / /... Omit code
}
Copy the code
  1. The NewTracer method in tracer.go returns a tracer as follows
func NewTracer(kind trace.SpanKind, opts ... Option) *Tracer {options := options{}
	for _, o := range opts {
		o(&options)
	}
        // Determine if the OTEL trace provider configuration exists and set it if it does
	ifoptions.TracerProvider ! = nil { otel.SetTracerProvider(options.TracerProvider) }*/ * Check if there is a Propagators set, Propagators if there is one, and set a default TextMapPropagator if there is none
	ifoptions.Propagators ! = nil { otel.SetTextMapPropagator(options.Propagators) }else {	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.Baggage{}, propagation.TraceContext{}))
	}
        
       
	var name string
        // Determine the current type of middleware, whether server or client
	if kind == trace.SpanKindServer {
		name = "server"
	} else if kind == trace.SpanKindClient {
		name = "client"
	} else {
		panic(fmt.Sprintf("unsupported span kind: %v", kind))
	}
        // Call the otel package Tracer method and pass in name to create a Tracer instance
	tracer := otel.Tracer(name)
	return &Tracer{tracer: tracer, kind: kind}
}
Copy the code
  1. Determine the current request type, process the data to be collected, and call the Start method in tracer.go
			var (
				component string
				operation string
				carrier   propagation.TextMapCarrier
			)
			// Determine the request type
			if info, ok := http.FromServerContext(ctx); ok {
				// HTTP
				component = "HTTP"
				// Retrieve the requested address
				operation = info.Request.RequestURI
				// Call HeaderCarrier in otel/ Propagation package. HTTP.Header is processed to satisfy TextMapCarrier Interface
				// TextMapCarrier is a text mapping carrier used to carry information
				carrier = propagation.HeaderCarrier(info.Request.Header)
				/ / otel. GetTextMapPropagator (). The Extract text () method is used to map the carrier, read into context
				ctx = otel.GetTextMapPropagator().Extract(ctx, propagation.HeaderCarrier(info.Request.Header))
			} else if info, ok := grpc.FromServerContext(ctx); ok {
				// Grpc
				component = "gRPC"
				operation = info.FullMethod
				//
				/ / call GRPC/metadata package metadata. FromIncomingContext (CTX) was introduced into CTX, convert GRPC metadata
				if md, ok := metadata.FromIncomingContext(ctx); ok {
					// Call MetadataCarrier in carrier.go to convert the MD into a text mapping carrier
					carrier = MetadataCarrier(md)
				}
			}
                        // Call the tracer.start method
			ctx, span := tracer.Start(ctx, component, operation, carrier)
                        / /... Omit code
		}
Copy the code
  1. Call the Start method in tracing. Go
func (t *Tracer) Start(ctx context.Context, component string, operation string, carrier propagation.TextMapCarrier) (context.Context, trace.Span) {
	// Determine if the current middleware is server and inject carrier into the context
	if t.kind == trace.SpanKindServer {
		ctx = otel.GetTextMapPropagator().Extract(ctx, carrier)
	}
	// Call the start method in the otel/tracer package to create a span
	ctx, span := t.tracer.Start(ctx,
		// Tracing. Go declared the request route as spanName
		operation,
		// Set the span property, set a component, component is the request type
		trace.WithAttributes(attribute.String("component", component)),
		// Set the span type
		trace.WithSpanKind(t.kind),
	)
	// Determine if the current middleware is a client and inject carrier into the request
	if t.kind == trace.SpanKindClient {
		otel.GetTextMapPropagator().Inject(ctx, carrier)
	}
	return ctx, span
}
Copy the code
  1. Defer declared a closure method
// Note here that closure is needed because defer's arguments are evaluated in real time and err will remain nil if an exception occurs
// https://github.com/go-kratos/kratos/issues/927
defer func(a) { tracer.End(ctx, span, err) }()
Copy the code
  1. Middleware continues execution
/ / tracing. Go 69 rows
reply, err = handler(ctx, req)
Copy the code
  1. The End method in tracer.go is executed after the closure in defer is called
func (t *Tracer) End(ctx context.Context, span trace.Span, err error) {
	// Check whether exceptions occur. If yes, set some exception information
	iferr ! =nil {
		// Record an exception
		span.RecordError(err)
		// Set the span property
		span.SetAttributes(
			// Set the event to exception
			attribute.String("event"."error"),
			// Set message to err.error ().
			attribute.String("message", err.Error()),
		)
		// Set the span state
		span.SetStatus(codes.Error, err.Error())
	} else {
		// If no exception occurs, the span state is OK
		span.SetStatus(codes.Ok, "OK")}/ / suspended span
	span.End()
}
Copy the code

How to use

Examples of tracing middleware can be found from Kratos /examples/traces, which simply implements link tracing across services. The following code snippet contains some examples.

// https://github.com/go-kratos/kratos/blob/7f835db398c9d0332e69b81bad4c652b4b45ae2e/examples/traces/app/message/main.go#L3 8
// First call the otel library method to get a TracerProvider
func tracerProvider(url string) (*tracesdk.TracerProvider, error) {
	Jaeger is used in examples/traces, and you can see opentElemetry for other methods
	exp, err := jaeger.NewRawExporter(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(url)))
	iferr ! =nil {
		return nil, err
	}
	tp := tracesdk.NewTracerProvider(
		tracesdk.WithSampler(tracesdk.AlwaysSample()),
		// Set up the Batcher and register the Jaeger exporter
		tracesdk.WithBatcher(exp),
		// Record some default information
		tracesdk.WithResource(resource.NewWithAttributes(
			semconv.ServiceNameKey.String(pb.User_ServiceDesc.ServiceName),
			attribute.String("environment"."development"),
			attribute.Int64("ID".1),)))return tp, nil
}
Copy the code

Used in GRPC/Server

// https://github.com/go-kratos/kratos/blob/main/examples/traces/app/message/main.go
	grpcSrv := grpc.NewServer(
		grpc.Address(": 9000"),
		grpc.Middleware(
			middleware.Chain(
				recovery.Recovery(),
				// Configuring tracing Middleware
				tracing.Server(
					tracing.WithTracerProvider(tp),
					),
				),
				logging.Server(logger),
			),
		))
Copy the code

Used in GRPC/Client

// https://github.com/go-kratos/kratos/blob/149fc0195eb62ee1fbc2728adb92e1bcd1a12c4e/examples/traces/app/user/main.go#L63
	conn, err := grpc.DialInsecure(ctx,
		grpc.WithEndpoint("127.0.0.1:9000"),
		grpc.WithMiddleware(middleware.Chain(
			tracing.Client(
				tracing.WithTracerProvider(s.tracer),
				tracing.WithPropagators(
					propagation.NewCompositeTextMapPropagator(propagation.Baggage{}, propagation.TraceContext{}),
				),
			),
			recovery.Recovery())),
		grpc.WithTimeout(2*time.Second),
	)
Copy the code

Used in HTTP/Server

// https://github.com/go-kratos/kratos/blob/main/examples/traces/app/user/main.go
	httpSrv := http.NewServer(http.Address(": 8000"))
	httpSrv.HandlePrefix("/", pb.NewUserHandler(s,
		http.Middleware(
			middleware.Chain(
				recovery.Recovery(),
				// Configuring tracing middleware
				tracing.Server(
					tracing.WithTracerProvider(tp),
					tracing.WithPropagators(
						propagation.NewCompositeTextMapPropagator(propagation.Baggage{}, propagation.TraceContext{}),
					),
				),
				logging.Server(logger),
			),
		)),
	)
Copy the code

Used in HTTP /client

	http.NewClient(ctx, http.WithMiddleware(
		tracing.Client(
			tracing.WithTracerProvider(s.tracer),
		),
	))
Copy the code

How to implement a tracing of other scenes

We can refer to the code of Kratos tracing middleware to realize the tracing of database. The author uses tracing middleware to realize the tracing of qMgo library to operate MongoDB database.

func mongoTracer(ctx context.Context,tp trace.TracerProvider, command interface{}) {
	otel.SetTracerProvider(tp)
	var (
		commandName string
		failure     string
		nanos       int64
		reply       bson.Raw
		queryId     int64
		eventName   string
	)
	reply = bson.Raw{}
	switch value := command.(type) {
	case *event.CommandStartedEvent:
		commandName = value.CommandName
		reply = value.Command
		queryId = value.RequestID
		eventName = "CommandStartedEvent"
	case *event.CommandSucceededEvent:
		commandName = value.CommandName
		nanos = value.DurationNanos
		queryId = value.RequestID
		eventName = "CommandSucceededEvent"
	case *event.CommandFailedEvent:
		commandName = value.CommandName
		failure = value.Failure
		nanos = value.DurationNanos
		queryId = value.RequestID
		eventName = "CommandFailedEvent"
	}
	duration, _ := time.ParseDuration(strconv.FormatInt(nanos, 10) + "ns")
	tracer := otel.Tracer("mongodb")
	kind := trace.SpanKindServer
	ctx, span := tracer.Start(ctx,
		commandName,
		trace.WithAttributes(
			attribute.String("event", eventName),
			attribute.String("command", commandName),
			attribute.String("query", reply.String()),
			attribute.Int64("queryId", queryId),
			attribute.String("ms", duration.String()),
		),
		trace.WithSpanKind(kind),
	)
	iffailure ! ="" {
		span.RecordError(errors.New(failure))
	}
	span.End()
}
Copy the code

reference

  • Dapper, a Large-scale Distributed Systems Tracing Infrastructure
  • OpenTelemetry website
  • KubeCon2019 OpenTelemetry share
  • Kratos framework
  • Traces of sample