Part one: Consistent Hash algorithm

Part two: Cluster clock synchronization issues

Part three: Distributed ID solutions

When the data volume of table A (ID) is large, we will split table A into TABLE A1 (ID) + table A2 (ID). A scheme that can generate globally unique IDS in the distributed cluster architecture is needed

Part IV: Distributed Scheduling problems (Distribution of scheduled Tasks)

Part five: Session sharing (consistency) issues

Session sharing demo address: http://39.101.136.208/LoginProject/

Distributed and clustered

Distributed is not the same as cluster. Distributed is definitely cluster, but cluster is not necessarily distributed (because cluster is multiple instances working together, and distributed is multiple instances when a system is split up; A cluster is not necessarily distributed, because a replicative cluster replicates instead of splitting)

The single architecture evolved into a distributed architectureFor example, the number of user instances in the figure above (the number of instances must be greater than 1) is a cluster, distributed: a large system is broken up into multiple subsystems, and then these subsystems are deployed on their own machines, each subsystem calls each other to build a whole system external services.

Part 1 Consistent Hash Algorithm Application scenarios of the Hash algorithm in a distributed cluster architecture

Hash algorithms are used in many distributed cluster products, such as Distributed cluster architecture Redis, Hadoop, ElasticSearch, Mysql sub-table, The main application scenarios such as Nginx load balancing can be reduced to load balancing of two requests (such as the IP_hash policy of Nginx). The IP_hash policy of Nginx can always route the requests sent by the client to the same target server without changing the IP address of the client, thus achieving session stickiness. Avoid handling session sharing problems

If there is no IP_hash policy, how do you implement session stickiness? You can maintain a mapping table to store client IP addresses or sessionIDS with specific target servers < IP,tomcat1>

disadvantages

1), then, on the client side a lot of cases, the mapping table is very large, waste of memory space, 2) the client up and down the line, the target server up and down the line, will cause to maintain the mapping table, the mapping table maintenance costs a lot If you use the hash algorithm, a lot of things is simple, we can look at the IP address or sessionid to calculate the hash value, The hash value is modulated with the number of servers, and the resulting value is the number of servers to which the current request should be routed. In this way, requests from the same client IP address can be routed to the same target server, achieving session stickily.

Distributed storage

Take the distributed memory database Redis as an example. There are three Redis servers in the cluster: REdis1, Redis2, and Redis3. Hash (key1)%3=index, using the remainder index to lock the specific server node stored

Section 2. Problems with ordinary Hash algorithms

The common Hash algorithm has a problem. Take ip_hash as an example. Assume that the IP address of the download user is fixed and does not change.If there are many back-end servers and clients in real production, the impact will be great. Capacity reduction and expansion will cause such problems. A large number of user requests will be routed to other target servers for processing, and the user’s session in the original server will be lost.

Article iii. Consistent Hash Algorithms

First, there is a straight line, which begins and ends with 1 and 2 to the power of 32 minus 1, respectively, which is equivalent to an address. For such a line, it bends to form a ring to form a closed loop. Such a ring is called a hash ring. We hash the IP or host name of the server and then we hash it to the hash ring, and then we hash the client user based on its IP, and then we hash it to some bit on the ring, and then how do we determine which server a client is routing to? Find the nearest server node clockwiseIf the server 3 referrals, server 3 after offline, routed to the original 3 client reroute to server 4 (clockwise) find a recent nodes, has no effect for other client is only the small number of affected (migration has reached the minimum request, this algorithm is very suitable for distributed cluster, avoid the migration of a large number of requests)

Hash deflection: As shown on the left above, client requests falling between interval 1 and 2 are rare (machine 2 is stressed), while requests falling between interval 2 and 1 are large (machine 1 is wasted).

Solution: To solve this data skewness problem, the consistent hashing algorithm introduces a virtual node mechanism, which calculates multiple hashes for each service node and places one of the service nodes at each calculated result location, called a virtual node.

You can do this by adding a number to the server IP address or host name. For example, you can compute three virtual nodes for each server, so you can compute the hash of “ip#1 for node 1”, “ip#2 for node 1”, “ip#3 for node 1”, “ip#1 for node 2”, “ip#2 for node 2”, and “ip#3 for node 2”, resulting in six virtual nodes. When a client is routed to a virtual node, it is actually routed to the real node corresponding to the virtual node

Part 2 Cluster clock synchronization problems

Article I. Problems caused by clock unsynchronization

Clock here refers to the server time, if the cluster server clock is inconsistent is bound to lead to a series of problems, just think “cluster is a team of each server, everyone is not working at the same point, how not chaos set!” For example, in the business of e-commerce website, a new order is added, so it is bound to add a record in the order table, which should have a field like “order time”. Often we will get the current system time in the program and insert it into the database or directly get the time from the database server. Then our order subsystem is deployed in cluster, or our database is also deployed in cluster by database and table. However, their system clock is inconsistent. For example, if the time of one server is yesterday, then the time of order at this time will be yesterday, and our data will be confused! As follows:Cluster clock synchronization idea: All server nodes in a distributed cluster can connect to the InternetYou can run the ntpdate command every 10 minutes to synchronize data from the NTPdate server or time server

Part three: Distributed ID solutions 1.UUID(available),

Select * from table A where ID = 1; select * from table A where ID = 1; select * from table A where ID = 1; select * from table A where ID = 1; select * from table A where ID = 1; We can create a separate Mysql database, create a table in the database, set the ID of the table to increment, and insert a record into the Mysql database when we need a global unique ID elsewhere, and then insert the record into the Mysql database, and the ID will increment. Mysql > select last_insert_id(), insert_id(), insert_id(), insert_id(), insert_id());

3.SnowFlake

The Snowflake algorithm is a strategy introduced by Twitter for generating distributed ids. Snowflake algorithm is an algorithm, based on which ID can be generated. The generated ID is a long, so in Java, a long is 8 bytes, which is 64bit. The following is the binary form of an ID generated by snowflake algorithm:In addition, all Internet companies also encapsulate some distributed ID generators based on the above solution, such as Didi’s TinyID (based on database implementation), Baidu’s UidGenerator (based on SnowFlake) and Meituan’s Leaf (based on database and SnowFlake), etc., they are in.

4. Use the Redis Incr command to obtain the global unique ID (recommended) :

The Redis Incr command increments the numeric value stored in the key by one. If the key does not exist, the value of the key is initialized to 0 before the INCR operation is performed.

The fourth part is distributed scheduling problem

Scheduling – > scheduled tasks, distributed scheduling – > Scheduled tasks in a distributed cluster

Section 1 Scenarios of Scheduled Tasks Scheduled tasks are executed at intervals or at a specific time

Such as:

1. Order review and delivery

2. The order will be automatically cancelled and the refund will be paid if the order times out

3. Synchronizing, generating and issuing gift certificates

4. Logistics information push, grab operations, return and exchange operations

5. Perform data backlog monitoring, log monitoring, and service availability detection

6. Periodically back up data

7. Regular settlement of the financial system every day

8. Data archiving and clearing

9. Report and offline data analysis

What is distributed Task scheduling? There are two meanings

1) Scheduling tasks running in the distributed cluster environment (if multiple copies of the same scheduled task program are deployed, only one scheduled task should be executed)

2) Distributed scheduling – > distributed scheduling of scheduled tasks – > Split of scheduled tasks (that is, to split a large task into several small tasks and execute them at the same time)

Section 3. Differences between scheduled tasks and message queues

In common

Asynchronous processing: such as registration, single event

Application decoupling: Both scheduled tasks and MQ can be used as a gear between two applications to decouple the application. This gear can transfer data. Of course, individual services do not need to consider this

Traffic peak clipping: double tenth, task assignment and MQ can be used to carry traffic, backend systems according to the service ability timing process orders or order from MQ grab grab onto an order coming events trigger processing, for front end users see is the result of the orders has been successful, is not affected by any order

Different in nature:

Scheduled task jobs are time driven, while MQ is event driven; Time-driven is irreplaceable. For example, the daily interest settlement of the financial system is not to calculate the interest (interest arrival event), but often through batch calculation of scheduled tasks. As a result, scheduled tasks tend to be processed in batches and MQ tends to be processed item by item;

Section 4 implementation of scheduled tasks

Scheduled tasks can be implemented in various ways. In the early days when we didn’t have a timed task framework, we used the JDK’s Timer mechanism and Runnable (thread sleep) to implement timed or interval execution of certain programs. Then came the scheduled task framework, like the well-known Quartz task scheduling framework, which uses time expressions (seconds, minutes, hours, days, weeks, and years) to configure when a task should be executed:

Section 5 distributed scheduling framework Elastic-Job

Elastic-job is an open-source distributed scheduling solution of Dangdang, based on Quartz secondary development, consisting of two independent sub-projects elastic-job-Lite and Elastic-job-cloud. The elastic-job-Lite is positioned as a lightweight centerless solution that provides distributed task coordination services in the form of Jar packages, while the elastic-job-cloud subproject needs to be combined with Mesos and Docker for use in the Cloud. The Github address of elastic-Job is github.com/elasticjob

Main functions:

Distributed scheduling coordination: In a distributed environment, tasks can be executed according to the specified scheduling policy, and the repeated execution of multiple instances of the same task can be avoided. Rich scheduling policies are used to execute scheduled tasks based on the mature Quartz cron expression

Elastic scaling: When an instance is added to the cluster, it should also be able to be elected and perform tasks. When a cluster reduces an instance, the tasks it performs can be transferred to another instance.

Failover: After a task fails to be executed, an instance is transferred to another instance for execution. Missed Execution Job retriggering If a job is missed due to certain reasons, the missed jobs are automatically recorded and automatically triggered after the last job is completed.

Supporting parallel scheduling Supporting task sharding: Task sharding means that a task is divided into multiple small tasks and executed in multiple instances at the same time. Job sharding consistency After tasks are sharded, there is only one execution instance for the same shard in a distributed environment.

5.2 Elastic – Job – Lite application

Jar package (API) + Install zK software

Zookeeper software (version 3.4.6 or later) must be installed. We will not elaborate on Zookeeper here. In Phase 3, there will be deep learning, but we need to understand the essential functions of Zookeeper. Storage + notification.

1. Install Zookeeper (singleton configuration) [Storage + Notification] Storage is a ZNode, and notification is a client listening on a node

1) We use the version 3.4.10 and decompress the downloaded Zookeeper-3.4.10.tar.gz on the Linux platform

2) Go to the conf directory, cp zoo_sample.cfg zoo.cfg

  1. Go to the bin directory and start the ZK service

Start. / zkServer. Sh start

Stop. / zkServer. Sh stop

Check the status./ zkserver. sh status

Tree node structure of Zookeeper 2. Import the Jar package3. Scheduled task instance

Part five Session sharing problems

Session sharing and Session persistence or Session consistency

SpringSession

Core annotations is @ EnableRedisHttpSession, the annotation import RedisHttpSessionConfiguration, and the configuration inheritance in SpringHttpSessionConfiguration again// When the container is started, create a Filter. The Filter class has a doFilterInternal method that wraps a wrapper around the native request and response.

When requesting request.getSession() for the first time, it will enter getSession of SessionRepositoryFilter [Impl Filter] to obtain the session nullCreate a sessionSave the session ID to redis. This method will call sessionRepository.save(session);