1. Application and static resources are separated

The beginning of application with static resource is saved, when the concurrency value reaches a certain amount of time you need to static resource saved to a dedicated server, static resources mainly including images, video, js, CSS, and some resource files, such as the file because there is no state separation is simpler, so direct deposit to the response of the server is ok, Usually use a special domain name to access.

Using a different domain name allows the browser to access the resource server without having to access the application server. The architecture diagram is as follows:

Page caching

Page caching caches pages generated by applications so that they do not need to be generated every time, which can save a lot of CPU resources, and can be faster if cached pages are stored in memory. If you use an Nginx server, you can use its built-in caching capabilities, and of course you can use a dedicated Squid server. The default invalidation mechanism for page caches is based on cache time, but you can also manually invalidate the corresponding cache after modifying the data.

Page caching data is mainly used in less change page, but a lot of pages are most data are rarely change, and one of few data change frequency is very high, such as a display of the article page, normally can be static, but if behind the articles have the function of the “top” and “on” and displays the number of a response, This data will change more frequently, which will affect the statics. This problem can be solved by creating a static page and then using Ajax to read and modify the response data. This can be done by using the page cache or displaying frequently changing data in real time.

In fact, we all know that the highest efficiency, consumption of the least is pure static HTML pages, so we try to make our website pages using static pages to achieve, the simplest method is actually the most effective method. However, for websites with a large number of contents and frequent updates, we cannot implement them one by one manually, so the common information publishing system CMS emerged. For example, the news channels of various portal sites we often visit, even their other channels, are managed and implemented through the information publishing system. Information publishing system can achieve the simplest information input automatically generate static pages, but also with channel management, authority management, automatic capture and other functions, for a large website, have a set of efficient, manageable CMS is essential.

In addition to the portal and information release type of site, for interactive website demanding community type, as static as possible is the essential means to improve performance and real-time of the post within the community, the article is static, when there are updates to static is also extensive use of the strategy, like the hodgepodge of Mop is the use of the strategy, Netease community and so on.

At the same time, the HTML static is an efficient way of some caching strategies use, for frequent use in the system database query but content update small applications, consider using HTML static, such as BBS BBS in public Settings information, the information management background and the current mainstream BBS can be stored to the database, In fact, a lot of these information is called by the foreground program, but the update frequency is very small, you can consider this part of the content of the background update when static, so as to avoid a large number of database access requests.

3. Cluster and distributed

Cluster is each server has the same function, processing requests to call that server can, mainly play a diverting role.

Distributed means that different services are placed on different servers. Multiple servers may be required to process a request. In this way, the processing speed of a request can be improved.

There are two ways to cluster: one is in a static resource cluster. The other is application clustering. A static resource cluster is simple. Session synchronization is at the heart of the application cluster process.

Session synchronization can be handled in two ways: one is to automatically synchronize to other servers when a Session changes, and the other is to use a program to manage sessions in a unified manner. All cluster servers use the same Session. Tomcat is used by default in the first way, which can be achieved through simple configuration. In the second way, you can use special servers to install efficient caching programs such as Mencached to manage sessions in a unified manner. The Request is then overridden by the getSession method in the application to get the Session in the delegate server.

Another core problem for clustering is load balancing, which is the problem of allocating a request to which server to handle it. This problem can be handled by software or specialized hardware (such as F5).

4. Reverse proxy

A reverse proxy is one in which the server that the client accesses directly does not actually provide the service, but retrieves resources from another server and returns the results to the user.

Graph:

4.1 Differences between reverse Proxy Servers and proxy servers

The function of the proxy server is to obtain the resources we want on behalf of our door and then return the results to us. The resources we want to obtain are actively told by our door to the proxy server. For example, we want to access Facebook, but can not directly access, then we can let the proxy server access, and then return the results to us.

A reverse proxy server is a normal access to a server, the server itself to call another server resources and return to us, we do not know.

Proxy server is our active use, is for our service, he does not need to have their own domain name; The reverse proxy server is used by the server on a trial basis. We do not know that the reverse proxy server has its own domain name, and we access it with no difference from normal web addresses.

The reverse proxy server provides three functions:

  1. Can be used as a front-end server to integrate with the actual server processing requests;

  2. You can perform load balancing

  3. Forwarding requests, such as different types of resource requests, to different servers for processing.

5. CDN

CDN is actually a special cluster page cache server. Compared with the common cluster multi-page cache server, the main reason is that its location and the way of allocating requests are a little special. CDN servers are distributed all over the country. After receiving user requests, they will allocate the requests to the most appropriate CDN server nodes to obtain data. For example, users in China Unicom are assigned to nodes in China Unicom, and users in Shanghai are assigned to nodes in Shanghai.

Each node of CDN is actually a page cache server. If there is no cache of requested resources, it will get from the main server, otherwise it will directly return the cached page.

The CDN allocates requests (load balancing) by using a dedicated CDN domain name resolution server (DNS) that is assigned during domain name resolution. The general practice is to try out CNAME at ISP to resolve the domain name to a specific domain name, and then use a special CDN server to resolve the resolved domain name to the corresponding CDN node. As shown in figure.

The second step to access the DNS server of the CDN is to use the NS record to point to the DNS server of the CDN for the target domain name of the CNAME record. Each node of a CDN may also cluster multiple servers.

6. Low-level optimization

All of this is architecture built on top of the infrastructure described above. A lot of places need to send data over the Internet, and if you can speed up the transmission of data over the Internet, that will improve the whole system.

7. Database cluster and library table hash

Large sites have complex applications, these applications must use the database, so in the face of a large number of access, the bottleneck of the database will soon appear, then a database will soon be unable to meet the application, so we need to use database cluster or library table hash.

In terms of database cluster, many databases have their own solutions, Oracle, Sybase, etc., have good solutions, and the Master/Slave provided by MySQL is similar. What KIND of DB you use, refer to the corresponding solutions to implement.

The database cluster mentioned above is limited by the DB type in terms of architecture, cost and scalability, so we need to consider improving the system architecture from an application perspective. Library table hashing is the common and most effective solution. We installed in the application and application or function module database, different modules for different database or table, and then according to certain strategy on a page or smaller database hash function, such as user table, carried out in accordance with the user ID hash table, so that it can improve system performance in low cost and good extensibility. Sohu BBS is adopted such architecture, set the BBS user, separation and post information database, and then to post, the user shall be carried out in accordance with the plate and the ID hash database and table, finally can be simple configuration in the configuration file can let the system adds a low-cost added database in system performance.

Nodule 8.

The whole evolution of website architecture is mainly about big data and high concurrency. The solutions are mainly divided into two types: using cache and multi-resource. Multi-resource mainly refers to multi-storage (including multi-memory), multi-CPU and multi-network. For multi-resource, it can be divided into two types: single resource processing a complete request and multi-resource cooperation processing a request, such as cluster and distributed in multi-storage and multi-CPU, CDN and static resource separation in multi-network. Once you understand the whole idea, you grasp the nature of architectural evolution, and you may be able to design a better one yourself.

Other brief summaries:

First of all, I think we should have a clear idea before solving problems. If we only use others’ solutions, we can only use them as doctrines without real understanding and without drawing inferences from one another.

Large amounts of data and high concurrency are often mentioned together, although they are completely different things. Massive data pure refers to the massive data of the database, and refers to the database and the server of the high traffic.

So the question comes, since it is a large amount of data database, how to do that? To solve a problem, first know what the problem is!! So what kind of problems do I have with this massive amount of data?

The problem with massive data is nothing more than adding, deleting, changing and checking. What else could there be? It can’t be a security issue.

1 The database access is slow

2 insert update is slow, this problem can only be solved by sub-database sub-table

There are several other ways to solve the problem of slow database access. Since database access is slow, why not access the database when logic allows?

1 Using cache

2 Use page statics

Since we can’t get away without accessing the database, let’s optimize the database

3. Optimize the database (including a lot of content, such as parameter configuration, index optimization, SQL optimization, etc.)

4 Separate active data from the database

5 Read/Write Separation

6 Batch read and delay modification;

7 Use search engines to search for data in the database.

8. Use NoSQL and Hadoop.

9 Split services.

High concurrency solutions

In fact, this question must be combined with the above massive data to discuss, under what circumstances will occur high concurrency? Must be usually the traffic is relatively large, then usually the traffic is relatively large corresponding data storage is more and more, which is complementary to each other, of course, there are a few cases, such as just need, such as 12306, the high concurrency here is not massive compared to its data. So usually how to solve the traffic? Because of the server and database issues involved here, it is necessary to optimize from both sides

I personally invite all BATJ architecture masters to create a Java architect community group (group number: 673043639), committed to providing a free platform for Java architecture industry communication, through this platform to let everyone learn and grow from each other, improve their skills, make their level to a higher level, and successfully lead to the development of Java architecture technology masters or architects

1 Increase the number of Web servers, that is, create clusters and perform load balancing. Since one server can’t do the job, use more. A few servers are not enough for the computer room

Before moving on to the second solution, are there any other optimizations that can be made beyond the database server? Of course there are

1.1 Page Caching

1.2 the CDN

1.3 Reverse Proxy

1.4 Separation of application and static resources (e.g. resources for download are placed separately to provide high bandwidth resources for this server)

2 Increase the number of database servers to perform cluster and load balancing.

Massive data solutions

1 Using cache

Compared to a lot of things are complementary to each other, use the cache more is used to solve the problem of high concurrency, because of the huge amounts of data has led to a slow, easy to cause high concurrency the seriousness of the problem, because the database is usually the bottleneck of web access again, so we in the case of business logic allows as far as possible to avoid database operation, so there will be a cache. Keeping the necessary data in memory, rather than having to read from the database every time, causes unnecessary wasted performance and speeds up access – this is the benefit of caching. So what should you look for when using caching and when choosing software to manage caching?

Page statics – what else is worth explaining?

3 Database Optimization

3.1 Database table Structure Involved

3.2 Selection of data types

3.3 SQL optimization

3.4 Index Optimization

3.5 Configuration Optimization

There is so much to note that it should be presented as a separate chapter

4 Separate active data from the database

Why separation? Tell me about a problem I encountered in my actual environment. There is a table only ten several fields, table 1.3 million data, but the size has reached 5 gigabytes of data, that in itself is not very reasonable, so little data takes up too much data, explain some of the fields to store huge amounts of string (such as articles, etc.), each time to retrieve the table most is to use less than the content of these larger field, However, it takes a long time and generates many slow logs. At this point, we can consider vertical shards of the table to separate out the active data, which can greatly speed up access

5 Read/Write Separation