This article is available at github.com/chengxy-nds…

Hello, I’m Rich

Nacos for everyone should not be too unfamiliar, born ali reputation, can do dynamic service discovery, configuration management, a very useful tool. However, the more people who use the technology, the more likely it is that the interview will be asked.

For example, the topic we are going to discuss today, when Nacos is doing configuration center, is the interaction mode of configuration data pushed by the server or actively pulled by the client?

Here I first throw out the answer: the client actively pull!

Let’s take a look at the source code of Nacos and see how it is implemented.

Configuration center

Before talking about Nacos, briefly review the origin of the configuration center.

The function of the configuration center is to manage configurations in a unified manner. After configurations are modified, applications can dynamically detect configurations without restarting them.

In traditional projects, most of the configuration is static, that is, the configuration information is written in a file such as YML or Properties within the application. If you want to change a configuration, you usually have to restart the application to take effect.

However, in some scenarios, for example, when an application is running, you need to modify a configuration item to control whether a function is enabled or disabled in real time. Therefore, it is definitely unacceptable to restart the application frequently.

Especially in the microservices architecture, our application services are split in a very fine granularity, from dozens to hundreds of services, and each service will have some unique or common configurations. If I want to change the general configuration at this point, do I need to change hundreds of service configurations one by one? Obviously that’s not going to happen. So to solve this kind of problem configuration center came into being.

Push and pull model

The client interacts with the configuration center in two ways: push or pull.

The push model

The client establishes a TCP long connection with the server. When the configuration data on the server changes, the client immediately pushes the data to the client through the established long connection.

Advantages: The advantage of long link is real-time, once the data changes, immediately push change data to the client, and for the client, this way is simpler, only to establish a connection to receive data, do not need to care about whether there is a data change logic processing.

Cons: Long connections can become unusable due to network problems, commonly known as suspended animation. The connection status is normal, but the communication is not possible. Therefore, the heartbeat mechanism KeepAlive is required to ensure the availability of the connection and ensure the successful push of configuration data.

The pull model

The client actively sends requests to the server to pull configuration data. The common method is polling, for example, the client requests configuration data to the server every 3s.

Polling has the advantage of being simpler to implement. But the disadvantages are also obvious, polling can not ensure the real-time data, when to request? How often are requests made? All of these are issues that have to be considered, and polling can also create a lot of pressure on the server.

Long polling

Nacos uses an active client-side pull model and uses Long Polling to retrieve configuration data.

The forehead? I’ve only heard of polling before. What’s long polling? How is this different from traditional polling (let’s call it short polling for comparison)?

Short polling

No matter whether the configuration data of the server is changed or not, it continuously initiates requests to obtain the configuration, such as the previous JS polling the payment status of the order in the payment scenario.

This disadvantage is obvious. Since configuration data does not change frequently, continuous requests will cause great pressure on the server. For example, the configuration is requested every 10s. If the configuration is updated at 11s, the push will be delayed by 9s, waiting for the next request.

In order to solve the problem of short polling, there is a long polling scheme.

Long polling

Long polling is not a new technology. It is simply an optimization method by which the server controls the return time of the response to the client’s request to reduce invalid requests from the client. In fact, it is not fundamentally different from the use of short polling for clients.

After a client initiates a request, the server does not immediately return the request result. Instead, the server suspends the request and waits for a period of time. If the data on the server changes during this period, the server responds to the request immediately.

Nacos met

I built a local Nacos for the rest of the demo. Note: There is a bit of a glitch in the runtime, as Nacos starts as a cluster by default, while the local setup is usually standalone. You need to manually change the startup mode in the startup script startup.X.

Direct execution/bin/startup.XThe default user password isnacos.

Several concepts

NacosThe core concepts of the configuration center are:dataId,group,namespace, their hierarchical relationship is shown as follows:

DataId: is the most basic unit in the configuration center. It is a key-value structure. Key is usually the name of our configuration file, such as application.yml and mybatis.

Currently supportedJSON,XML,YAMLAnd other configuration formats.

Group: Group management of dataId configurations. For example, if the same dev environment is developed, but different branches of the same environment need different configuration data, then you can use group isolation. The default group is DEFAULT_GROUP.

Namespace: During project development, there must be multiple environments such as Dev, Test, and Pro. Namespace isolates different environments. By default, all configurations are stored in the public domain.

Architecture design

The following figure briefly describes the architectural flow of the NACOS configuration center.

The client and console register the configuration data with the server by sending Http requests, and the server persists the data to Mysql.

The client pulls the configuration data and sets the batch to initiate a long polling request to the dataId listener. If the server changes the configuration item, it immediately responds to the request. If there is no data change, the request will be suspended for a period of time until the timeout period is reached. To reduce the pressure on the server and ensure the availability of the configuration center, the client that pulls configuration data saves a snapshot in the local file and preferentially reads the snapshot.

I’ve left out a lot of details, such as authentication, load balancing, and high availability design (which is really worth learning, but will be covered in another article), to figure out how the client-server data interacts.

Below we with Nacos 2.0.1 version of the source code analysis, 2.0 version more changes, and a lot of information on the web slightly different

Address: github.com/alibaba/nac…

Client source code analysis

The client source for the Nacos configuration center is in the Nacos-Client project, where the NacosConfigService implementation class is the core entry point for all operations.

CacheMap, a client data structure, is important to remember because it runs through almost all operations on the Nacos client. CacheMap uses the AtomicReference atomic variable to ensure data consistency due to multi-threaded scenarios.

/** * groupKey -> cacheData. */
private final AtomicReference<Map<String, CacheData>> cacheMap = new AtomicReference<Map<String, CacheData>>(new HashMap<>());

Copy the code

CacheMap is a Map structure. The key is the groupKey and is a string of dataId, Group, and Tenant. Value is a CacheData object. Each dataId holds a CacheData object.

Access to the configuration

Nacos obtains configuration data in a simple logic. If the local snapshot file does not exist or is empty, the configuration data is pulled from the remote dataId through HTTP request and saved to the local snapshot. The request is retried for three times by default, and the timeout period is 3s.

GetConfig () and getConfigAndSignListener() are interfaces to get the configuration, but getConfig() simply sends a plain HTTP request, Much getConfigAndSignListener () a long polling and to the operation of the dataId data change registration to monitor addTenantListenersWithContent ().

@Override
public String getConfig(String dataId, String group, long timeoutMs) throws NacosException {
    return getConfigInner(namespace, dataId, group, timeoutMs);
}

@Override
public String getConfigAndSignListener(String dataId, String group, long timeoutMs, Listener listener)
        throws NacosException {
    String content = getConfig(dataId, group, timeoutMs);
    worker.addTenantListenersWithContent(dataId, group, content, Arrays.asList(listener));
    return content;
}
Copy the code

Register to monitor

The client registers to listen by getting the CacheData object corresponding to the dataId from cacheMap.

public void addTenantListenersWithContent(String dataId, String group, String content, List
        listeners) throws NacosException {
    group = blank2defaultGroup(group);
    String tenant = agent.getTenant();
    // 1. Obtain the corresponding CacheData of dataId. If not, send a long polling request to the server to obtain the configuration
    CacheData cache = addCacheDataIfAbsent(dataId, group, tenant);
    synchronized (cache) {
        // 2. Register data change listener for dataId
        cache.setContent(content);
        for (Listener listener : listeners) {
            cache.addListener(listener);
        }
        cache.setSyncWithServer(false); agent.notifyListenConfig(); }}Copy the code

If not, it sends a long poll request to the server to obtain the configuration. The default Timeout is 30s. The returned configuration data is backfilled into the Content field of the CacheData object and the MD5 value is generated using the content. Register the listener with addListener().

CacheData is also a class that shows up very frequently, and we see that in addition to the related base properties dataId, Group, Tenant, and content, there are several more important properties such as: Note The listeners, MD5 (content MD5 values calculated from real configuration data), registered listeners, data comparison, and server data change notification operations are all here.

Listeners are a set of listeners registered with dataId, and the ManagerListenerWrap object holds a Listener class as well as a lastCallMd5 field, which is important for determining whether server data is being changed.

The listener is added with the current latest MD5 value of the CacheData object assigned to the lastCallMd5 property of the ManagerListenerWrap object.

public void addListener(Listener listener) {
    ManagerListenerWrap wrap =
        (listener instanceof AbstractConfigChangeListener) ? new ManagerListenerWrap(listener, md5, content)
            : new ManagerListenerWrap(listener, md5);
}
Copy the code

See this pair of dataId listening Settings and we’re done, okay? We found that all operations are centered around the CacheData object in the cacheMap structure, so it’s a safe guess that there must be a task dedicated to this data structure.

Change notification

How does the client sense that the server data has changed?

Starting at the beginning, a ClientWorker is initialized in the constructor of the NacosConfigService class, and a thread pool is started in the constructor of the ClientWorker class to poll cacheMap.

The executeConfigListen() method checks the MD5 field and the lastCallMd5 value of the registered listener in the CacheData object of the dataId in cacheMap. If not, the safeNotifyListener method is triggered to send a data change notification.

void checkListenerMd5(a) {
    for (ManagerListenerWrap wrap : listeners) {
        if(! md5.equals(wrap.lastCallMd5)) { safeNotifyListener(dataId, group, content, type, md5, encryptedDataKey, wrap); }}}Copy the code

The safeNotifyListener() method starts a separate thread and pushes the changed data content to all clients that have registered to listen to dataId.

The client receives the notification, implements the receiveConfigInfo() method directly to receive the callback data, and processes its own business.

configService.addListener(dataId, group, new Listener() {
    @Override
    public void receiveConfigInfo(String configInfo) {
        System.out.println("receive:" + configInfo);
    }

    @Override
    public Executor getExecutor(a) {
        return null; }});Copy the code

In order to understand more intuitively, I use the test demo to demonstrate, obtain the server configuration and set up the listening, whenever the server configuration data changes, the client listening will receive notification, let’s take a look at the effect.

public static void main(String[] args) throws NacosException, InterruptedException {
    String serverAddr = "localhost";
    String dataId = "test";
    String group = "DEFAULT_GROUP";
    Properties properties = new Properties();
    properties.put("serverAddr", serverAddr);
    ConfigService configService = NacosFactory.createConfigService(properties);
    String content = configService.getConfig(dataId, group, 5000);
    System.out.println(content);
    configService.addListener(dataId, group, new Listener() {
        @Override
        public void receiveConfigInfo(String configInfo) {
            System.out.println("Data change receive:" + configInfo);
        }
        @Override
        public Executor getExecutor(a) {
            return null; }});boolean isPublishOk = configService.publishConfig(dataId, group, "I am new configuration content ~");
    System.out.println(isPublishOk);

    Thread.sleep(3000);
    content = configService.getConfig(dataId, group, 5000);
    System.out.println(content);
}
Copy the code

The result, as expected, is that when the publishConfig data is changed to the server, the client can immediately sense that the server is pushing in real time using active pull mode.

Data change Receive: I am a new configuration ~trueI am new configuration content ~Copy the code

Server source code analysis

The server source code of the Nacos configuration center is mainly in the ConfigController class of the Nacos-Config project. The server logic is a little more complicated than that of the client.

Handles long polling

/v1/cs/configs/listener (doPollingConfig)

The server uses the long-pulling -Timeout attribute in the request header to distinguish between Long polling and short polling. Here we only focus on Long polling. Now look at how the addLongPollingClient() method in the LongPollingService class (it’s important to remember this service) handles long polling requests from clients.

The default client timeout is 30s, but here we found that the server “sneaked” 500ms off the request timeout, and now the timeout is 29.5s. Why do we do this?

According to the official explanation, Nacos should respond to the request 500ms in advance, in order to ensure the client will not time out due to network delay to the maximum extent. Considering that the request may take some time in load balancing, after all, Nacos was originally designed according to the business volume of Ali.

In this case, the MD5 value of the groupkey submitted by the client is compared with the current MD5 value of the server. If the MD5 value is different, the configuration item of the server has been changed. The groupkey is directly added to the changedGroupKeys set and returned to the client.

MD5Util.compareMd5(req, rsp, clientMd5Map)
Copy the code

If no changes have occurred, the client request is suspended. This process starts with the creation of a Runnable scheduling task named ClientLongPolling that is submitted to the scheduler timed thread pool for 29.5 seconds.

ConfigExecutor.executeLongPolling(
                new ClientLongPolling(asyncContext, clientMd5Map, ip, probeRequestSize, timeout, appName, tag));
Copy the code

Here, each long polling task carries an asyncContext object, so that each request can delay the response until the delay is reached or the configuration is changed. The call asyncContext.plete () response is completed.

AsyncContext is a new feature in Servlet 3.0. It can be processed asynchronously, so that Servlet threads do not have to block all the time, waiting for the completion of service processing before output response. The threads and related resources allocated by the container to the request can be released first, reducing the burden on the system, and its response will be delayed until the business or operation is finished before it responds to the client.

When the ClientLongPolling task is submitted to the deferred thread pool for execution, the server stores all pending ClientLongPolling request tasks via an allSubs queue, which is the process by which the client registers to listen.

If the number of client data does not change during the delay period, the long polling task is removed from the allSubs queue after the delay time reaches and responds to the request for response, which means canceling listening. After receiving the response, the client initiates a long poll again, and the cycle repeats.

At this point we know how the server suspends the client long-polling request once the user has operated on the configuration item through the management platform or the server has received a request from another client node to modify the configuration.

How can a pending task be canceled immediately and the client be notified that the data has changed?

Data changes

The administration platform or client changes the publishConfig method in ConfigController.

It is worth noting that there is logic in the publishConfig interface to trigger a data change Event when a dataId configuration data is modified.

ConfigChangePublisher.notifyConfigChange(new ConfigDataChangeEvent(false, dataId, group, tenant, time.getTime()));
Copy the code

A close look at the LongPollingService shows that in its constructor, exactly the data change event is subscribed to and a data change scheduling task, DataChangeTask, is performed when the event is triggered.

The main logic in DataChangeTask is to traverse the allSubs queue, which maintains the long polling request tasks of all clients. From these tasks, the ClientLongPolling task containing the currently changed groupkey can be found. In this way, the data is pushed to the client and the long polling task is removed from the allSubs queue.

We call asyncContext.plete () to terminate the asynchronous request when we see the client’s response.

conclusion

This is just the tip of the iceberg of the NACOS configuration center. In fact, there are many important technical details that are not covered. I recommend that you take a look at the source code. For example, today this topic before I really did not care too much, suddenly asked to be unsure of all of a sudden, look at the source code decisively, and this memory is more profound (others chew to feed your knowledge is always worse than their chew so point meaning).

The source code of NACos is relatively plain to me, the code is not too showy and looks relatively easy. Don’t be afraid to read the source code. It’s just business code written by someone else.


I am small rich ~, if useful to you in see, pay attention to support, let’s see next time ~

I have sorted out hundreds of technical e-books for students who need them. The technology group is almost full, want to enter the students can add my friend, and the big guys blow technology together.

Ebook address

Individual public number: programmer point matter, welcome to exchange