The challenges of large websites come from the large number of users, high concurrent access and large amounts of data. Any simple business that has to deal with tens of millions of data and hundreds of millions of users becomes a problem. Large web architectures are designed to solve these problems. For more information, you can also read the summary of the evolution of the architecture of major Internet companies and the overview of large website architecture technology in two articles.

Much of this article comes from Technical Architecture for Large Web Sites, which is well worth reading and highly recommended.

Features of large web site systems

High concurrency and high traffic

High concurrent users and heavy traffic are required. Google has an average daily PV of 3.5 billion and a daily IP access of 300 million. Tencent QQ has 140 million maximum online users (2011 data).

High availability

System 7 x 24 hours uninterrupted service.

Huge amounts of data

Need to store and manage massive data, need to use a large number of servers. Facebook uploads nearly a billion photos a week, Baidu catalogs tens of billions of web pages, and Google has nearly one million servers serving users around the world.

Users are widely distributed and the network situation is complex

Many of the largest Internet sites serve a global audience, with a wide distribution of users and a wide variety of networks. In China, there are also problems with the interoperability of various operators’ networks.

Poor security environment

Due to the openness of the Internet, Internet sites are more vulnerable to attack, and large websites are attacked by hackers almost every day.

Requirements change quickly and are released frequently

Different from the release frequency of traditional software versions, Internet products have a high release frequency in order to quickly adapt to the market and meet user needs. Generally large websites have new versions of their products released online every week, while small and medium-sized websites are released more frequently, sometimes dozens of times a day.

Incremental development

Almost all large Internet sites started from a small site, gradually developed. Facebook was created by Zuckerberg in his harvard dorm room; Google’s first server was deployed in a lab at Stanford University; Alibaba was born in Jack Ma’s living room. Good Internet products are slowly operated out, not developed at the beginning, which is just the development and evolution process of the website architecture.


Evolution and development of large website architecture

The technical challenges of large web sites are mainly due to the large number of users, high concurrent access and large amount of data. Any simple business that needs to deal with tens of millions of data and hundreds of millions of users becomes very difficult. Large web architecture addresses this type of problem.

Initial site architecture

Large websites are developed from small websites, and so is the website architecture, which gradually evolved from small website architecture. Small websites at the beginning do not have too many people to visit, only need a server is more than enough, then the website architecture is shown as follows:

Applications, databases, files, and all resources are on one server.


Application services and data services are separated

With the development of website services, a server gradually cannot meet the needs: more and more users access the performance is worse and worse, more and more data leads to insufficient storage space. This is where you need to separate the application from the data. After separating applications and data, the entire site uses three servers: application server, file server, and database server. The three servers have different requirements for hardware resources:

Application servers need to handle a lot of business logic, so they need faster and more powerful cpus.

Database servers need fast disk retrieval and data caching, so they need faster disks and more memory;

File servers need to store a large number of files uploaded by users, so they need a larger hard disk.

At this point, the architecture of the website system is shown in the figure below:

After the separation of applications and data, servers with different features assume different service roles, and the concurrent processing capability and data storage space of the website are greatly improved, supporting the further development of the website business. However, as the number of users gradually increased, the site again faced a challenge: too much database pressure caused access delays, which in turn affected the performance of the entire site, the user experience suffered. This needs to further optimize the website architecture.


Use caching to improve site performance

Web visits are characterized by the same 80/20 rule as wealth distribution in the real world: 80% of business visits focus on 20% of data. Since most business access is concentrated in a small part of the data, so if this small part of the data cache in memory, you can reduce the database access pressure, improve the speed of data access throughout the website, improve the database write performance. There are two types of caches used by web sites: local caches cached on application servers and remote caches cached on dedicated distributed cache servers.

Local caches are faster to access, but they are limited by the application server’s memory limitations, and can compete with applications for memory.

Remote distributed cache can be deployed in a cluster. Servers with large memory can be deployed as dedicated cache servers. In theory, the cache service is not limited by memory capacity.

With the use of caching, data access pressure is effectively alleviated, but a single application server can only handle the limited number of requests and connections, and the application server becomes the bottleneck of the entire website during peak times of website access.


Use application server clusters to improve the concurrent processing capabilities of web sites

Cluster is a common method to solve the problem of high concurrency and mass data. When a server’s processing capacity, storage space is insufficient, do not attempt to replace a more powerful server, for large websites, no matter how powerful the server, can not meet the continuous growth of the business needs of the website. In this case, it is more appropriate to add a server to share the access and storage burden of the original server. As far as website architecture is concerned, as long as we can improve the load by adding one server, we can continue to increase the number of servers and improve system performance in the same way, so as to achieve system scalability. Application server implementation cluster is a relatively simple and mature design of website scalable architecture, as shown in the figure below:

Through the load balancing scheduling server, the access requests from users’ browsers can be distributed to any server in the application server cluster. If there are more users, more application servers can be added to the cluster, so that the pressure of application servers will no longer become the bottleneck of the whole website.


Database read/write separation

Site after using a cache, making access to most of the data read operation can be done through a database can not but there is still a part of the read operation (cache access do not hit, cache expiration), and all the write operation need to access the database, in web site after reaching a certain size, the database because of the high pressure load and become the bottleneck of the website. Most mainstream databases provide the master/slave hot backup function. By configuring the master/slave relationship between two databases, data updates from one database server can be synchronized to the other server. The website uses this function of database, realizes the database read and write separation, thus improves the database load pressure. As shown below:

When writing data, the application server accesses the master database. The master database synchronizes data updates to the slave database through the master/slave replication mechanism, so that the application server can obtain data from the slave database when reading data. In order to facilitate application programs to access the database after read/write separation, a special data access module is usually used on the application server to make the database read/write separation transparent to applications.


Use reverse proxy and CDN to speed web site response

With the continuous development of the website business, the scale of users is increasing. Due to the complex network environment in China, the speed of accessing the website varies greatly among users in different regions. Studies have shown that website access delay is positively correlated with user turnover rate. The slower a website access is, the more likely users are to lose patience and leave. In order to provide a better user experience and retain users, websites need to speed up website access. The main means are the use of CDN and direction proxy. As shown below:

CDN and reverse proxy are based on caching.

CDN is deployed in the equipment room of the network provider, so that users can obtain data from the nearest equipment room of the network provider when requesting website services

A reverse proxy is deployed in the central equipment room of a website. When a user requests a reverse proxy, the reverse proxy server is the first access server. If the reverse proxy server caches the requested resources, the reverse proxy server directly returns the requested resources to the user

The purpose of using CDN and reverse proxy is to return data to the user as soon as possible, on the one hand, to speed up the user access speed, on the other hand, to reduce the load on the back-end server.


Use distributed file systems and distributed database systems

No single powerful server can meet the growing business needs of large web sites. After read and write separation, the database is divided from one server into two servers. However, with the development of website business, it still cannot meet the demand, so it needs to use distributed database. The same goes for file systems, which require a distributed file system. As shown below:

Distributed database is the last resort for website database splitting and is only used when single table data size is very large. A more common method of database splitting for web sites is to deploy data from different businesses on different physical servers.


Use NoSQL and search engines

As website business becomes more and more complex, the demand for data storage and retrieval is also more and more complex, so websites need to adopt some non-relational database technologies such as NoSQL and non-database query technologies such as search engines. As shown below:

Both NoSQL and search engines are technologies derived from the Internet and have better support for scalable distributed features. The application server accesses all kinds of data through a unified data access module, which relieves the application of managing many data sources.


business

Large sites respond to increasingly complex business scenarios by dividing their entire web business into different product lines using a divide-and-conquer approach. For example, large shopping and transaction websites will split their home page, shops, orders, buyers and sellers into different product lines, which are under the responsibility of different business teams.

Technically, a website will be divided into many different applications according to product lines, and each application will be deployed independently. The relationship between applications can be established through a hyperlink (each navigation link on the home page points to a different application address), and data can be distributed through message queues. Of course, at most, an associated complete system can be formed by accessing the same data storage system, as shown in the figure below:


Distributed service

As service separation becomes smaller and smaller and storage systems become larger, the overall complexity of application systems increases exponentially, making deployment and maintenance more difficult. Because all applications need to connect to all database systems, the number of connections is the square of the server size in a website with tens of thousands of servers, resulting in insufficient database connection resources and denial of service.

Since each application system needs to perform many same service operations, such as user management and product management, these shared services can be extracted and deployed independently. These reusable services connect to the database to provide shared services, while application systems only need to manage user interfaces and complete specific business operations by calling shared services through distributed services. As shown below:

With the evolution of the architecture of large web sites, most of the technical problems, such as real-time data synchronization across data centers and specific web business-related issues, can be solved by combining and improving existing technology architectures. For more on distributed content, see the distributed series.