STARTING
UP

The probability of occurrence of this situation is very low, running for more than a year without finding two instances of simultaneous problems, so multi-instance running can be avoided. The solution to the problem at the end of the article can be skipped to the end without spending time looking at the analysis process.

Environment Description:

Eureka-client: 1.7.2 Spring-boot: 1.5.12.RELEASE Spring-cloud: edgware.sr3

Problem reproduction

With the help of Btrace reconstruction, java -noverify -cp .:btrace-boot.jar -javaagent:btrace-agent.jar=script=

Train of thought

Before the main thread updates the local state of the instance (STARTING->UP), wait for the heartbeat thread to complete its first heartbeat and attempt to register the instance to obtain the current state. Triggered after the status of the main thread is updated

Btrace scripts

import com.sun.btrace.annotations.BTrace;
import com.sun.btrace.annotations.Kind;
import com.sun.btrace.annotations.Location;
import com.sun.btrace.annotations.OnMethod;

import java.util.concurrent.atomic.AtomicBoolean;

import static com.sun.btrace.BTraceUtils.currentThread;
import static com.sun.btrace.BTraceUtils.println;

/ * * *@author Addo.Zhang
 * @dateThe 2019-07-31 * /
@BTrace(unsafe = true)
public class EurekaRequest {
    public static final AtomicBoolean heartbeatThreadRegistrationStarted = new AtomicBoolean(false);
    public static final AtomicBoolean replicatorThreadRegistrationCompleted = new AtomicBoolean(false);
    public static final AtomicBoolean statusUP = new AtomicBoolean(false);

    @OnMethod(clazz = "org.springframework.cloud.netflix.eureka.serviceregistry.EurekaServiceRegistry", location = @Location(value = Kind.LINE, line = 45))
    public static void waitHeartbeatExecution(a) {
        println(currentThread() + " is waiting heartbeatThreadRegistrationStarted thread executing first");
        while(! heartbeatThreadRegistrationStarted.get()) ; }@OnMethod(clazz = "org.springframework.cloud.netflix.eureka.serviceregistry.EurekaServiceRegistry", location = @Location(value = Kind.LINE, line = 46))
    public static void markStatusUp(a) {
        statusUP.set(true);
        println("Heartbeat thread executed and " + currentThread() + " continues procedure to change status to [UP]");
    }

    @OnMethod(clazz = "com.netflix.discovery.converters.EurekaJacksonCodec$InstanceInfoSerializer", location = @Location(value = Kind.LINE, line = 369))
    public static void continueRegistrationExecution(a) {
        doExecution();
    }

    @OnMethod(clazz = "com.logancloud.forge.discovery.converters.LoganEurekaJacksonCodec$LoganInstanceInfoSerializer", location = @Location(value = Kind.LINE, line = 117))
    public static void continueRegistrationExecution2(a) {
        doExecution();
    }

    private static void doExecution(a) {
        println(currentThread() + " started to proceed registration");
        if (Thread.currentThread().getName().contains("HeartbeatExecutor")) {
            heartbeatThreadRegistrationStarted.set(true);
            while(! statusUP.get() || ! replicatorThreadRegistrationCompleted.get()) { }try {
                Thread.sleep(500); //interval for replicator registration request completed.
            } catch(InterruptedException e) { Thread.currentThread().interrupt(); }}else if (Thread.currentThread().getName().contains("InstanceInfoReplicator")) {
            replicatorThreadRegistrationCompleted.set(true);
        }
        println(currentThread() + " thread registration completed"); }}Copy the code

  1. The heartbeat threadHeartbeatThreadSending a heartbeat request (PUT), the registry returns 404.
  2. Instance information synchronizes threadsInstanceInfoReplicatorSend a registration request (POST) : status toUP.lastDirtyTimestampfora
  3. Heartbeat thread sends instance registration request (POST) : status toSTARTING.lastDirtyTimestampfora

The service registry

Start by analyzing the registration logic of the service instance.

Initialization InstanceInfo

Through InstanceInfoFactory# the create () method to initialize ApplicationInfoManager. InstanceInfo instance, instance state is set to STARTING

Service Instance Registration

The real logic for registering a service instance is done in DiscoveryClient# Register (). However, this method invocation has two entrances, which can be interpreted as active and passive registration in the whole process.

First, sign up

EurekaAutoServiceRegistration implements SmartLifecycle interface in EurekaClientAutoConfiguration# EurekaAutoServiceRegistration () is instantiated.

EurekaAutoServiceRegistration# start () method will give EurekaServiceRegistry EurekaRegistration registration:

public void start(a) {...if (!this.running.get() && this.registration.getNonSecurePort() > 0) {
      // Call EurekaServiceRegistry to register
    	this.serviceRegistry.register(this.registration);
    	// Publish events registered with the instance
    	this.context.publishEvent(
    			new InstanceRegisteredEvent<>(this.this.registration.getInstanceConfig()));
    	this.running.set(true); }}Copy the code

Eurekaserviceregistryregister (): Set the state of the instance to UP(this can be modified by eureka.instance.initial-status, which is UP by default). This triggers StatusChangeListener#notify().


public void register(EurekaRegistration reg) {
	maybeInitializeClient(reg);
	if (log.isInfoEnabled()) {
		log.info("Registering application " + reg.getInstanceConfig().getAppname()
				+ " with eureka with status "
				+ reg.getInstanceConfig().getInitialStatus());
	}
	/ / 1
	reg.getApplicationInfoManager()
			.setInstanceStatus(reg.getInstanceConfig().getInitialStatus());
	if(reg.getHealthCheckHandler() ! =null) {
	   / / 2reg.getEurekaClient().registerHealthCheck(reg.getHealthCheckHandler()); }}Copy the code

The DiscoveryClient internal anonymous class provides an implementation of StatusChangeListener that calls InstanceInfoReplicator#onDemandUpdate()

statusChangeListener = new ApplicationInfoManager.StatusChangeListener() {
    @Override
    public String getId(a) {
        return "statusChangeListener";
    }
    @Override
    public void notify(StatusChangeEvent statusChangeEvent) {
        if (InstanceStatus.DOWN == statusChangeEvent.getStatus() ||
                InstanceStatus.DOWN == statusChangeEvent.getPreviousStatus()) {
            // log at warn level if DOWN was involved
            logger.warn("Saw local status change event {}", statusChangeEvent);
        } else {
            logger.info("Saw local status change event {}", statusChangeEvent); } instanceInfoReplicator.onDemandUpdate(); }};Copy the code

InstanceInfoReplicator is an implementation of Runnable instantiated in DiscoveryClient#initScheduledTasks(), after which a thread is scheduled using its internal scheduling thread pool. OnDemandUpdate () also schedules a thread using the scheduled thread pool.

Its #run() method calls DiscoveryClient#refreshInstanceInfo() to update the status. Status updates are implemented through the HealthCheckHandler, as described in Status checking. The DiscoveryClient# Register () method is then called to register.

Two. Passive registration

DiscoveryClient#initScheduledTasks() was mentioned above, where tasks have other threads besides InstanceInfoReplicator. One of these is the line thread HeartbeatThread. This thread sends a PUT HTTP request to the registry every once in a while to report the status of the instance (status, and the time when the status changed (lastDirtyTimestamp)).

There are two possible results: 404 and 200. The former indicates that the instance is not registered in the registry. The latter indicates that the status is reported successfully.

If it is 404, the registration action is initiated directly by calling the DiscoveryClient# Register () method.

State examination (mmse)

CloudEurekaClient HealthCheckHandler to check the health status of instance, look at the HealthCheckCallbackToHandlerBridge implementation: Callback returns the current status if callback is empty or the current status is STARTING or OUT_OF_SERVICE. We did not set callback, so the current state is always returned. For example, the initial state of the application startup is STARTING

public InstanceInfo.InstanceStatus getStatus(InstanceInfo.InstanceStatus currentStatus) {
    if (null == callback || InstanceInfo.InstanceStatus.STARTING == currentStatus
            || InstanceInfo.InstanceStatus.OUT_OF_SERVICE == currentStatus) { // Do not go to healthcheck handler if the status is starting or OOS.
        return currentStatus;
    }
    return callback.isHealthy() ? InstanceInfo.InstanceStatus.UP : InstanceInfo.InstanceStatus.DOWN;
}
Copy the code

Problem analysis

The phenomenon of

TCP caught

Status =UP&lastDirtyTimestamp=1545039481813 The HeartBeat request and Fetch request are normal

Heap information

The local status is UP. LastDirtyTimestamp is 1545039481813 and lastUpdatedTimestamp is 1545039472888

Instance information in the registry

The status is STARTING, lastDirtyTimestamp is 1545039481813, registrationTimestamp is 1545039481898, lastUpdatedTimestamp is 1545039481899

 <instance>
      <instanceId>xp-xtower-webapp-boot-6-txcxb:xp-xtower-webapp-boot:10100</instanceId>
      <hostName>10.128.41.74</hostName>
      <app>XP-XTOWER-WEBAPP-BOOT</app>
      <ipAddr>10.128.41.74</ipAddr>
      <status>STARTING</status>
      <overriddenstatus>UNKNOWN</overriddenstatus>
      <port enabled="true">10100</port>
      <securePort enabled="false">443</securePort>
      <countryId>1</countryId>
      <dataCenterInfo class="com.netflix.appinfo.InstanceInfo$DefaultDataCenterInfo">
        <name>MyOwn</name>
      </dataCenterInfo>
      <leaseInfo>
        <renewalIntervalInSecs>5</renewalIntervalInSecs>
        <durationInSecs>20</durationInSecs>
        <registrationTimestamp>1545039481898</registrationTimestamp>
        <lastRenewalTimestamp>1545950719063</lastRenewalTimestamp>
        <evictionTimestamp>0</evictionTimestamp>
        <serviceUpTimestamp>0</serviceUpTimestamp>
      </leaseInfo>
      <metadata>
        <forge>1.0.0</forge>
        <management.port>10100</management.port>
        <jmx.port>1099</jmx.port>
        <group>innovation</group>
      </metadata>
      <homePageUrl>http://10.128.41.74:10100/</homePageUrl>
      <statusPageUrl></statusPageUrl>
      <healthCheckUrl>http://10.128.41.74:10100/health</healthCheckUrl>
      <vipAddress>xp-xtower-webapp-boot</vipAddress>
      <secureVipAddress>xp-xtower-webapp-boot</secureVipAddress>
      <isCoordinatingDiscoveryServer>false</isCoordinatingDiscoveryServer>
      <lastUpdatedTimestamp>1545039481899</lastUpdatedTimestamp>
      <lastDirtyTimestamp>1545039481813</lastDirtyTimestamp>
      <actionType>ADDED</actionType>
    </instance>

Copy the code
Scope Status lastDirtyTimestamp lastUpdatedTimestamp registrationTimestamp
Request UP 1545039481813
Local UP 1545039481813 1545039472888
Remote STARTING 1545039481813 1545039481899 1545039481898

Taken together, the problem is that lastDirtyTimestamp was not updated, resulting in the status of the registry not being updated. The time of lastUpdatedTimestamp is 1545039481899, which differs from lastDirtyTimestamp by 86 milliseconds.

The server instanceresourcevalidateDirtyTimestamp () determines the status code 200, 404, or 409 of the response by comparing the locally stored instance with the heartbeat request

reasoning

The state of the instance in the registry is STARTING, which determines that the instance is [passively registered](# 2). Passive registration).

Here are a few points in time:

  • 1545039472888: InstanceInfoThe time at which an object is instantiated because of the local object’s#lastUpdatedTimestampFields are assigned only when they are instantiated and will not be modified thereafter. seeHeap information
  • 1545039481813: the state fromSTARTINGintoUPIs the time when the instance state was last updated. Subsequent heartbeat requests carry the latest state of the instance (UP) and when the status was last updated (1545039481813See you),TCP caught.
  • 1545039481898: The time when the registry receives the registration request for the instance. seeInstance information in the registry
  • 1545039481899: The time when the instance information in the registry is updated. This time is only after the registration time1 millisecondSee you.Instance information in the registry

In summary, passive registration when sending a request, get the old state of the instance STARTING, modify time is really latest 1545039481813. The status of the subsequent heartbeat instance is UP, and the change time is 1545039481813. However, since the last modification time is the same as the last modification time at the time of registration, the registry will not update the status to UP after receiving the heartbeat request, even if the status has changed to UP.

RenewLease () -> InstanceResource#validateDirtyTimestamp(): If lastDirtyTimestamp in the request is the same as that of the currently saved instance, OK is returned without updating the status of the instance saved in the registry.

private Response validateDirtyTimestamp(Long lastDirtyTimestamp,
                                        boolean isReplication) {
    InstanceInfo appInfo = registry.getInstanceByAppAndId(app.getName(), id, false);
    if(appInfo ! =null) {
        if((lastDirtyTimestamp ! =null) && (! lastDirtyTimestamp.equals(appInfo.getLastDirtyTimestamp()))) { Object[] args = {id, appInfo.getLastDirtyTimestamp(), lastDirtyTimestamp, isReplication};if (lastDirtyTimestamp > appInfo.getLastDirtyTimestamp()) {
                logger.debug(
                        "Time to sync, since the last dirty timestamp differs -"
                                + " ReplicationInstance id : {},Registry : {} Incoming: {} Replication: {}",
                        args);
                return Response.status(Status.NOT_FOUND).build();
            } else if (appInfo.getLastDirtyTimestamp() > lastDirtyTimestamp) {
                // In the case of replication, send the current instance info in the registry for the
                // replicating node to sync itself with this one.
                if (isReplication) {
                    logger.debug(
                            "Time to sync, since the last dirty timestamp differs -"
                                    + " ReplicationInstance id : {},Registry : {} Incoming: {} Replication: {}",
                            args);
                    return Response.status(Status.CONFLICT).entity(appInfo).build();
                } else {
                    returnResponse.ok().build(); }}}}return Response.ok().build();
}
Copy the code

Why is this happening?

Two threads trigger registration during application startup

  1. InstanceInfoReplicatorThread:DiscoveryClientIn theApplicationInfoManager.StatusChangeListenerListening for an instance state change, a new thread is created to register the instance with the registry
  2. DiscoveryClient$HeartbeatThreadThread: This thread is inDiscoveryClientDelay after instance initialization (same as heartbeat interval, default is30sIn order to improve the efficiency of instance discovery, zhongtai changed it to5sThe first heartbeat request is sent if the registry returns404The heartbeat thread starts the instance status update line first, and the instance is registered to the registry first.

Above two threads by calling AbstractJerseyEurekaHttpClient $register () method and using EurekaJacksonCodec $InstanceInfoSerializer serialized instance message. In the serialization process, the instance state is recorded first and then the last modification time (lastDirtyTimestamp) of the instance state is recorded. These two operations are not atomic operations.

In the very extreme case (narrowing the heartbeat interval increases the probability of occurrence, but is still extreme), the main thread between two operations (STARTING the heartbeat thread with the instance state) changed the instance state to UP and changed lastDirtyTimestamp, The registration of the InstanceInfoReplicator thread is triggered, and the last modification time of the instance obtained by the heartbeat thread is inconsistent with the STARTING state. The subsequent registration action overrides the state of the instance in the registry: UP -> STARTING.

The latest statuses UP and lastDirtyTimestamp carried by subsequent heartbeat requests do not update the status in the registry.

The solution

In the EurekaJacksonCodec$InstanceInfoSerializer#serialize() method, move the call to #autoMarshalEligible() to jgen.writestartobject (). In this way, lastDirtyTimestamp is obtained before status, which ensures that the status is consistent even if lastDirtyTimestamp is smaller than the real one at the time of registration. LastDirtyTimestamp will be updated in subsequent heartbeat requests.

addInstance-info.getStatus(): UP
addInstance-info.getLastDirtyTimestamp(): 1565164484429

addInstance-info.getStatus(): STARTING
addInstance-info.getLastDirtyTimestamp(): 1565164484415

renew-status-in-registry: UP
renew-lastDirtyTimestamp: 1565164484429
renew-appInfo.getLastDirtyTimestamp(): 1565164484429
Copy the code

The PR has been committed and merged, however it is unknown when a fix will be released for 1.7.x

Reference:

  • GitHub issue
  • PR

The article will be sent to the official account: Cloud Code (wechat id: Sevenfeet).