The author | Dong Peng alibaba technical experts

Micro service

Benefits: Cross-team decoupling, higher concurrency (currently only C10K can be achieved in a single machine) no code copy, basic services can be shared, better support for service governance, and better compatibility with cloud computing platforms.

RPC

  • RPC: Call remote functions as if they were local methods
  • Client: the general use of dynamic proxy generation of an interface implementation class, in this implementation class through the network interface name, parameters, methods serialized after the transmission, and then control synchronous call or asynchronous call, asynchronous call needs to set a callback function;

The client also needs to maintain load balancing, timeout handling, and connection pool management. The connection pool maintains connections to multiple servers for load balancing, and removes connections when a server goes down. The request context maintains the request ID and callback function. The timeout request will be discarded when the reply message arrives because the request context cannot be found.

  • Server: maintain the connection, the network received the request deserialized method name, interface name, parameter name after the call through reflection, and then send the result back to the client;
  • Serialization: one way is to serialize only the values of the fields, and then rebuild the object and set the values in the deserialization, and the other way is to serialize the entire object structure directly to binary.

The former saves space, while the latter is fast in deserialization. The current serialization framework is also a trade-off between deserialization time and space. Kind of like Huffman coding, or how a database stores rows of data.

The registry

There are generally three modes:

  • F5 as the centralized agent;
  • Client-side embedded agents such as Dubbo;
  • Another is a combination of the two, in which multiple clients share an agent and the agent is deployed as an independent process on the same physical machine as the client server. This is ServiceMesh.

Why ZooKeeper is not a good registry: ZooKeeper sacrifices availability for consistency, but registries don’t really require consistency. The consequence of an inconsistency is that a service goes offline and the client doesn’t know about it, but the client can retry other nodes.

Additionally when a network partition, if more than half of all nodes hang up and they are not used, but in fact it should still be on it room nodes providing registration services, such as three rooms respectively with 2, 2, 1, if the network is broken between each room, but the room is on the inside, This makes the registry unavailable even if the internal nodes are not serviced.

Zookeeper does not provide strict consistency. It supports read/write separation. When other nodes receive write requests, they forward them to the master node, while other nodes support read requests.

Configuration center

Configuration center requirements: guaranteed high availability, real-time notification, grayscale publishing, permission control, one-click rollback, environment isolation (development/test/production), current open source implementation: Nacos Disconf Apollo.

  • The disconf:scan module scans annotations and listeners;
  • The store module stores the configuration obtained remotely to the local, and a local job detects whether the configuration is changed, and notifies the listener if there is any change.
  • The FETCH module gets the configuration remotely via HTTP;
  • The Watch module monitors the changes of nodes on ZooKeeper, and fetch will be called when there are changes.

Apollo has the following four modules:

– Portal Functions as a management background and provides an entrance for administrators to perform operations. Have a separate database;

– AdminService Provides the underlying service for configuration modification and publishing. Configdb and ConfigService share the same database. Each configuration modification inserts a releasemessage record into the database.

– ConfigService uses a scheduled task to scan the database for new Releasemessages, notifies the client if there are any, and the client uses a scheduled poll to check for new messages from configService. Deferredresult is executed asynchronously;

– Eruka provides the registration discovery service for AdminService and ConfigService. The client also writes the configuration file to the disk after obtaining it.

Task scheduling

  • The executor is the application itself, and the task unit is the thread of specific task execution. It can actively register in the scheduler and update it when it is started, such as deleting the tasks that have been cleared.
  • The scheduling center supports cluster deployment to avoid a single point. You can elect a master node and other slave nodes.
  • The load balancing algorithm randomly selects actuators for each task, supports retry after failure, and removes actuators that perform slowly or are disconnected.
  • Control the concurrency of tasks, for example, whether to allow a task to be scheduled before completion.
  • Support task dependency, for example, one task cannot be executed before another task is completed, or another task can be automatically executed.
  • Support task fragmentation, a task is fragmented to different actuators according to the parameters to execute together;
  • You can cancel a task;
  • The Glue pattern is already supported to execute a task unit without publishing.

A distributed lock

  • Redis setnx has parameters that can support distributed locks, but it is better to save the owner of the lock in the value, when releasing, otherwise it may release the wrong lock, that is, A will release the lock of B;
  • Zk creates temporary nodes, and other threads that fail to create them listen for the lock status.
SET resource_name my_random_value NX PX 30000Copy the code

Unified monitoring

  • Collect and analyze logs. Logs can be associated with RPC links, de-noised or compressed for storage.
  • The way to provide API and interceptor mode, can be based on JavaAgent to achieve no embedding;
  • Implementation of OpenTracing link;
  • Disruptor Ringbuffer based production consumer model;
  • Mass data storage elasticSearch;
  • Report generation, monitoring index setting;
  • Each node collects messages and uploads them to the server for unified processing.
  • Monitoring indicators: RPC link, database, CPU indicators, HTTP status, various middleware;
  • Logs can be collected by adding interceptors directly to the logging framework, or by using Flink + Kafka.

The cache

Clear the cache first or update the database first?

  • If the cache is updated instead of deleted: either way the cache and database are inconsistent;
  • If it is to delete the cache: delete the cache to update the database first, if the update database fails, there is no big impact, the cache is cleared and reload. But also consider the issue of cache penetration, if this is the time when heavy traffic will overwhelm the database?

The above is to consider the situation of one success and one failure in distributed transactions, but this probability is small after all, can be used in the case of not very high concurrency but high requirements for data consistency, if the concurrency is very high, it is recommended to update the database first and then clear the cache.

If first clear the cache, then update the database, in the case of haven’t update to the database in addition a transaction to query, found the cache missed it will go to the database, and then write to the cache, then a transaction database updates, thus lead to the cache and database inconsistencies, if update the database to empty the cache, after updating the database cache haven’t update, At this time, the cache is old and inconsistent, but it will be consistent when the cache is cleared.

But this way also can produce permanent, but the probability is very small, such as a read request, missed the cache, this time may be another thread just clear the cache, and then it will go to the data on the inside, but there is a thread in the database is the database after reading it to another value, so the read requests data written to the cache is the dirty data.

Redis uses a single-threaded model, which works well for ONLY IO operations, but it also provides computing capabilities such as sort aggregation, where the CPU blocks all IO operations during computation.

Memecached applies for a block of memory and splits it into chunks of varying sizes to store key and value pairs. This approach is efficient but may waste space. Redis simply wrapped malloc and Free.

Redis provides two ways to persist data, one by writing all data to disk at a given time, and the other by incremental logging

Memecache provides CAS to ensure data consistency; Redis provides transactions that execute or roll back a series of instructions together.

Memechache can only be clustered by consistent hashing, whereas Redis provides clustering. The client does the routing to select the master node, and the master node can have multiple slave nodes as secondary and read nodes.

String of redis USES the structure of the c language, the extra added free memory and have memory, I read this because already know the char array size, so can take out directly, avoiding traversal operation, can be avoided when string become bigger or smaller to allocate memory, you can use free space, That is, Redis will pre-allocate a space.

In addition, two tables are used to store hashes in Redis, mainly for expansion, that is, rehash, so that the two sides can be exchanged when expansion. Redis uses asymptotic expansion, that is, two hash tables are executed for each operation, and only new tables are added when new tables are added. The set data structure can be used to store the total number of likes, while zset is an ordered linked list that is stored in a hop table to speed up queries.

  • How to prevent cache avalanche: Cache is highly available, you can set up multi-level cache;
  • How to prevent cache penetration: Set different expiration times.

The message queue

How can the order of messages be guaranteed

Strictly consistent, only one producer can be sent to one broker, and then only one queue to one consumer, but this model has many drawbacks. One exception will block the whole process. RocketMQ passes this problem on to the application layer, meaning that the sender chooses which queue to send to. For example, messages for the same order are sent to the same queue. But the algorithm also has problems when one of the queues fails.

How do I ensure that messages are not repeated

As long as the transmission is over the network, this will be a problem, so the application layer should be able to support idempotent, or a de-duplicate table to store every processed message ID.

Message sending process

  • The routing information for the topic is retrieved (the routing information is returned from namesRV and cached on the client side, which brokers the topic corresponds to and how many queues there are on each broker).
  • If not, there may be no topic, which needs to be created automatically. When the client sends a message to the namesRV, the NamesRV requests the broker, and the broker returns after it is created
  • Obtain a queue according to the routing policy (obtain a queue from all queues according to the routing policy, and then determine whether the broker corresponding to this queue is healthy, which is returned). This place can make brokers highly available.
  • So we find that messages to which broker and which queue are sent are determined at the time the client sends them, rather than after generating commitlogs so that we can specify a fixed queue.
  • When the message is sent, a request will be constructed, which contains the message body, queue information and topic information, etc., and a message ID will be added to the message body.
  • If a message fails after multiple retries, it enters a dead-letter queue, a fixed topic.

Message storage

Each commitlog is 1 GB, and the start offset of the second file is 1 GB byte. When retrieving a file based on an offset, 1 GB is used for more files. These commitlog files are maintained through a file queue. Each write file returns the last file in the queue, which then needs to be locked.

After a file is created, it will be preheated. During the preheating, a byte0 will be written in each 4kb memory page, so that the system can cache the page to prevent the page missing. The mechanism of mmap is that only a virtual address will be recorded, and the physical memory address will be obtained when the page is missing.

There are two ways to create a file:

  • Filechannel. map gets the MappedByteBuffer.
  • Another option is to use the off-heap memory pool and flush.

Consumption of messages

A queue can only be consumed by one client.

When there are multiple queue, but only one client, the client needs to four the consumption on the queue, when only one queue there would be only one client can receive messages, so usually need to the number of the client and the queue is consistent, the client will generally kept the position of each queue consumption, because this queue will only have one client consumption, So each consumption by the client will record the offset of the queue, the broker, and the offset of the same grouo consumption.

MappedByteBuffer is an old read that reads data from the file system into the operating system kernel cache, and then copies the data to the user-mode memory supply. Mmap maps the data of a file or a segment of data into virtual memory, without reading the data. When a user accesses the address of virtual memory, a page-missing exception is triggered, which reads data directly from the underlying file system into user-mode memory.

MappedByteBuffer, on the other hand, returns a virtual address when mapped through the FileChannel map method. MappedByteBuffer uses this virtual address, combined with its UnSafe address, to retrieve bytes.

When triggering the page miss exception, the operating system will go to the file system to read the data and load it into the memory. At this time, the prefetch is generally 4KB. When the system accesses the data next time, the page miss exception will not occur, because the data is already in the memory. We can preheat the files mapped by MappedByteBuffer by, for example, writing one data per Pagecache so that no page misses occur when the data is actually written.

Depots table

General three ways: in the DAO layer and ORM layer using MyBatis interceptor, based on JDBC layer interception rewrite JDBC interface enhancement, based on database agent.

JDBC agent, realize the datasource, connection, preparestatement, druid parse SQL, generate the execution plan, the result set to make use of the resultset merger (group by the order of Max sum).

Partition strategy, generally hash, to ensure that there is no correlation between the database and the algorithm of the partition table, otherwise the data distribution will be uneven.

During data expansion, you can configure the center to dynamically modify the write policy. You can read the old table first, write data to the new table and the old table at the same time, read the new table and write data to the new table after data migration is complete, and then write data to the new table after reading the new table.

A unique id

Database increment ID, a single machine limit, in addition to the database increment ID internal also used a lock, but at the end of SQL execution even if the transaction is not committed will also release the lock.

Variations of Snowflake algorithm: 15-bit timestamp, 4-bit autoincrement sequence, 2-bit distinguish order type, 7-bit machine ID, 2-bit sub-library suffix, 2-bit sub-table suffix, a total of 32 bits.

Obtain the self-increasing ID using the sequence node of ZooKeeper.

Distributed transaction

Two-phase commit: transaction manager, resource manager, one-phase prepare, two-phase commit (THE XA solution is business-neutral, supported by the database vendor, but has poor performance).

What compensation

TCC: There are two phases: the first phase tries to lock the resource and the second phase confirms or rolls back the resource.

Design specification:

  • The business operation is divided into two parts, such as transfer: the trial stage is frozen balance, the second stage is submitted to deduct from frozen balance, and the rollback is unfrozen.
  • The transaction coordinator records the primary transaction log and the branch transaction log, and supports compensation or reverse compensation to ensure the final consistency after an exception occurs in any step.
  • Concurrency control, reduce the granularity of lock, improve concurrency, and ensure that there is no need to add exclusive lock between two transactions. For example, the transfer operation of hot account is frozen in the first stage, so the subsequent balance deduction has no influence between different transactions.
  • Allow empty rollback: The attempted operation in phase 1 May time out, and phase 2 May initiate a rollback. During the rollback, check whether phase 1 has performed any operation. If no request is received in phase 1, the rollback succeeds.
  • Avoid suspension of phase 1 operations: phase 1 May timeout, phase 2 rollback, phase 1 request arrived, at this time to reject phase 1 attempt operation;
  • Idempotent control, because the operation of the first and second stages may be executed multiple times, in addition, it is better for the operation interface to provide the status query interface for the normal execution of the compensation task in the background.

Framework Transactions (SEATA)

  • First stage: the framework will intercept the business SQL, generate undolOG according to the result before the statement execution, generate RedolOG according to the result after the statement execution, and generate row lock according to the database table name and primary key;
  • Stage 2: If the transaction ends, the undolog redolog lock will be deleted. If the transaction will be rolled back, the undolog SQL will be executed to delete the intermediate data. Using Redolog as a comparison, if dirty writes occur, the data can only be repaired manually (two-stage cleanup can be performed asynchronously).

When a transaction is started, a global transaction ID will be applied to the TC. This transaction ID will be passed to the called end through the RPC framework interceptor, and then put into threadLocal. The called party will check whether it is in a global transaction when executing the SQL.

The default isolation level is read uncommitted, because the first phase of a transaction has a local transaction committed but the global transaction has not completed. It may be rolled back later, and other transactions can see this state. The provided way to read committed is through for UPDATE. If there is a conflict wait until released.

  • Tm initiates a global transaction to TC and generates a globally unique XID.
  • Xids are passed along the microservice invocation chain;
  • Rm registers branch transactions with tc.
  • Tm initiates a global submission or rollback resolution to TC;
  • The TC sends a rollback request or submits a request to RM.

Consistent message queue: first to send a message, if successful in performing the local transactions, local transaction success submit news, local transaction is rolled back half a message failure, if the message queue for a long time did not receive confirmation or rollback can be checked against the local affairs of the state, the consumer end after receiving the message, executive business consumer end, failure to obtain, Perform confirmation of successful send consumption.

MYCAT

CAP

  • C: Consistency
  • A: Availability
  • P: partition tolerance

MySQL standalone is C; Primary/secondary synchronous replication CP; The primary and secondary aps are asynchronously replicated.

Zookeeper chose P, but instead of implementing either C or A, chose final consistency. It can be read on multiple nodes, but only one node is allowed to accept write requests. Write requests received by other nodes are forwarded to the master node and submitted as long as half of the nodes return a successful return.

If a client is connected to a follower node that has not been submitted, the data read on this node will be old, resulting in data inconsistency, so C is not fully implemented. Since half of the nodes need to return successfully to submit, if more than half return failure or do not return, then ZooKeeper will be unavailable, so A has not been fully implemented.

Of course, whether A system is CP or AP can be measured according to whether it sacrifices more A or C, while ZK is actually sacrificing A to satisfy C. When more than half of the nodes in the cluster go down, the system will become unavailable, which is why it is not recommended to use ZK as A registry.

CAP theory is described in a distributed environment consistency, availability, partition tolerance can’t meet at the same time, did not let us be sure to three choose two, because the network partition is inevitable in a distributed environment, so in order to pursue high availability, we will often sacrifice strong execution, the scheme of weak consistency and eventual consistency That is the famous BASE theory, and BASE theory is actually for ACID of traditional relational data.

However, ACID is proposed based on the single node, in a distributed environment, how to coordinate data consistency, that is, to make choices on the isolation level of data, even in a single relational database to improve performance, that is, availability, defined isolation level, to break the strong consistency C in ACID. Of course, databases also serve businesses, and some businesses, or most businesses, do not have strong consistency requirements.

Second kill processing

  • Static separation: Ajax does not refresh the page, cache, CDN;
  • Hotspot data discovery: Flexible business processes can isolate hotspot services and obtain hotspot data for a period of time through link monitoring.
  • Isolation: business isolation, database isolation;
  • Last-ditch solution: service degradation, limiting traffic;
  • Traffic peaking: queuing, filtering invalid requests, answering or captcha, message queuing;
  • Inventory reduction :(users who place an order to reduce inventory need to roll back if they do not make payment. The payment to reduce inventory may end up with insufficient inventory and need to be refunded. After placing an order, the inventory will be rolled back after occupying a period of time).

Normal e-commerce uses the third method, second kill the first method, not oversold control need not be placed in the application layer, directly add WHERE statement in the SQL layer to judge, but mysql for the same row record, that is, the inventory reduction of the same commodity, will definitely strive for row lock under high concurrency. This will cause the TPS of the database to decrease (deadlock detection will traverse all connections that need to wait for locks, and this operation is very CPU consuming), thus affecting the sales of other goods. Therefore, we can queue the requests in the application layer, and if the share is small, we can directly discard them. Another solution is to queue the requests in the database layer. This solution requires a mysql patch.

docker

namespace

Docker can specify a set of namespace parameters when creating the container process, so that the container can only see the resources, files, devices, networks, users and configuration information specified by the current namespace, but not the host computer and other unrelated programs. PID Namespace Enables the process to view only the process in the namespace. Mount namespace Enables the process to view only the Mount point information in the namespace. Network Namespace Allows the process to see only the Network adapter and configuration information in the namespace.

cgroup

The Linux Control group is used to limit the maximum number of resources a process group can use, such as CPU, memory, network, etc. Cgroup can also set priorities for processes and suspend and resume processes. The Cgroup interface exposed to users is a file system. /sys/fs/cgroup this directory contains cpuset,memery and other files. Each resource that can be managed has a file. How to set the upper limit of resource access for a process?

Create a new folder under /sys/fs/cgroup, the system will create the above series of files by default, and then after the Docker container starts, write the process ID into taskid file, modify the corresponding resource file according to the parameters passed by the docker when starting.

chroot

Change root file system To change the root directory of a process to the mount location. Typically, a complete Linux file system is mounted using chroot, but not the Linux kernel. In this way, when we deliver a Docker image, we not only include the program to be run, but also the environment that the program depends on. Because we package the entire dependent Linux file system, for an application, the operating system is the most complete dependency library it depends on.

Incremental layer

Docker introduces the concept of layers in the image design, that is, every revision by users in making Docker images adds another layer on the original rootfs and then combines them through union FS technology. If the two rootfs have the same file during the merge, the original file will be overwritten by the outermost file for deduplication.

For example, when we pull an image of mysql from the mirror center to the local site, and create a container through this image, we add a high level of addition to the original layers of the image. The file system only retains incremental changes, including the addition, deletion, and modification of files. This incremental layer will be mounted to the same directory with the original layer by means of Union FS. This added layer can read and write, while other original layers can only read, thus ensuring that all operations on docker images are incremental.

The new image contains both the original layer and the new layer. Only the original layer is a complete Linux FS. So how can I delete the read-only layer files since the read-only layer cannot be modified? Just generate a whiteout file in the read/write layer (the outermost layer) to block the original file.

Release and Deployment

Most companies today use the following deployment approach.

  • Create Pileline to specify the project name and corresponding tag, as well as the dependent project. A pipeline is a complete project lifecycle (development submits code to a code repository, packaging, deployment to development, automated testing, deployment to test, deployment to production);
  • Pull the latest code from gitlab based on the project name and tag (execute shell script using Java Runtime);
  • Using Maven for packaging, at this point you can create a separate workspace (shell script) for Maven;
  • According to the pre-written docfile, copy maven package to generate image, and upload image (shell script);
  • Release the upgrade in the test environment via K8s API;
  • Release to production environment through gray scale and other schemes.



Ali Cloud – Cloud native application platform – Basic Software Zhongtai team (former Container platform basic software team) is recruiting senior engineers/technical experts/senior technical experts

TL; DR

Ali Cloud – Cloud native Application Platform – Basic Software Zhongtai team (original Container platform basic software team) invites Kubernetes/ Container/Serverless/ application delivery technology field experts (P6-P8) to join us.

Working years: p6-7 from 3 years, P8 from 5 years, depending on actual ability. Working Place:

  • Domestic: Beijing, Hangzhou, Shenzhen;
  • Overseas: San Francisco Bay Area, Seattle

Resume reply immediately, 2~3 weeks results. Start after holiday.

Job content

Basic Product Division is the core R&D department of Ali Cloud Intelligent business group, responsible for computing, storage, network, security, middleware, system software and other research and development. The cloud native application platform basic software final state team is committed to creating a stable, standard and advanced cloud native application system platform, promoting the industry to upgrade and revolution for cloud native technology.

Here, not only CNCF TOC and SIG co-chairmen, but also etCD founder, K8s Operator founder and Kubernetes core maintenance members, the country’s top Kubernetes technical team.

Here, you will work closely with experts in cloud native technologies from around the world, such as the founders of Helm project, the founders of Istio Project, In the unique scene and scale, engaged in the development and implementation of cloud computing ecological core technologies such as Kubernetes, Service Mesh, Serverless, Open Application Model (OAM). On the benchmark-level platform of the industry, It not only gives Alibaba global economy, but also serves developers all over the world.

  1. Kubernetes as the core, to promote and create the next generation of “application-centric” basic technology system; In the ali economy scenario, the development and implementation of “application-centered” infrastructure architecture and the next generation NoOps system based on Open Application Model (OAM) enable Kubernetes and cloud native technology stack to give full play to the real value and energy;


  1. Develop core technologies for multi-environment complex application delivery; Combine ali with the core business scenarios in the ecosystem to create industry standards and core dependencies for multi-environment complex application delivery (benchbenchers Google Cloud Anthos and Microsoft Azure Arc);


  1. Design and development of core products and back-end architecture of cloud native application platform; With the support of ecological core technology and cutting-edge architecture, in the platform scene of world-class cloud vendors, we use technology to create continuous vitality and competitiveness of cloud products;


  1. Continue to promote the evolution of ali-economy application platform architecture, including Serverless infrastructure, construction of standard cloud native standard PaaS, construction of a new generation of application delivery system and other core technical work.

Technical requirements: Go/Rust/Java/C++, Linux, distributed system

Resume submitted

lei.zhang AT alibaba-inc.com