Taobao technology structure changes

Since its establishment in 2003, Taobao’s business has developed rapidly, growing at a rate of almost 100% every year. At the beginning, in order to quickly go online and seize the market, I chose the LAMP architecture, which was popular at that time, and used PHP as the website development language, Linux as the operating system, Apache as the Web server, and MySQL as the database. Taobao went online in less than three months. At that time, there were about 10 application servers in the whole website. The MySQL database adopted the deployment mode of read and write separation and one primary and two standby databases.

In 2004, driven by the business development of Taobao, we transformed LAMP architecture into database architecture and EMC storage mode of Oracle+IBM minicomet by referring to some enterprise solutions of telecom operators and banks (Figure 2). Although the solution is expensive, the performance is very good. At the same time, as the site traffic increases, the system appears to be somewhat overwhelmed. At that time, the most worried problem is that if the website traffic continues to increase, the transaction volume continues to increase, how to design the system architecture of the website? How do I select a database? How do I choose cache? How do you build business systems? … Later, REFERRING to the Internet design architecture of eBay, I designed a Technical scheme of Java and used many Java open source products. For example, JBoss, which was popular at the time, was chosen as the application server. Choose an open source IOC container, Spring, to manage business classes; A database access tool, IBatis, is packaged as an Object-Reletionship mapping tool for databases and Java classes. In addition, for commodity search function, ISearch search engine developed by ourselves is used to replace the search in Oracle database to reduce the pressure of database server. The method is relatively simple. Every night, we dump the data of Oracle minicompany and Build it into ISearch index. At that time, the commodity quantity was not large.




Since 2006, In order to improve user experience, Taobao began to establish its own CDN site. As the main traffic of Taobao comes from static data such as pictures and descriptions of various commodities, self-established CDN can make these resources closer to users, improve users’ access speed and improve their browsing experience.

In 2007, taobao’s annual transaction volume exceeded 40 billion yuan, averaging nearly 100 million a day, with more than 1 million transactions created every day. At that time, the main problems are: some of the system traffic is very large, such as commodity details, if the direct access to the database, will lead to the database pressure is very large; Such as user information, visit a page, need to query buyer information, seller information, show the buyer’s credit, seller’s service star, etc. At this point, Taobao uses distributed cache TDBM (the predecessor of Tair) to cache these hot static data in memory to improve access performance. In addition, their research and development of the distributed file system deployment of TFS in x86 server, replace the commercial NAS storage device to store all kinds of file information, taobao, like a commodity snapshot pictures, product description information, transaction information, to reduce the cost and improve the capacity and performance of the overall system, and can realize more flexible expansibility. About 200 TFS servers will go live in the first phase. In addition, the ISearch search engine was changed to a distributed architecture to support horizontal scaling with 48 nodes deployed. Figure 3 illustrates this architectural idea.




At the beginning of 2008, in order to solve the bottleneck problems of centralized Oracle database architecture (connection number limit, I/O performance), the system was divided into user domain, commodity domain, transaction domain, store domain and other business areas, and more than 20 business centers were established, such as commodity center, user center, transaction center, etc. All systems with user access requirements must be accessed through the remote interface provided by the service center instead of directly accessing the underlying MySQL database. HSF is used to invoke the service interface of the service center, and asynchronous Notify messaging middleware is used to invoke service systems. Figure 4 is the distributed architecture of Taobao.




Since 2010, Taobao has focused on the unified architecture system, considering the requirements of development efficiency, standardization of operation and maintenance, high performance, high scalability, high availability and low cost from the overall system level. The underlying infrastructure uniformly adopts Ali Cloud computing platform (FIG. 5). Ali cloud computing services such as SLB, ECS, RDS, OSS, ONS and CDN are used to realize dual machine room Dr And remote machine room uncentralized deployment through the high availability features provided by Ali Cloud services, providing stable, efficient and easy to maintain infrastructure support for Taobao services.




In the process of transferring from IOE architecture to cloud computing platform technology architecture, the following technical challenges are mainly faced.

■ Availability: Can the cloud computing platform based on DISTRIBUTED architecture of PC servers achieve high availability without the high redundancy mechanism of minicomputers and high-end storage?

■ Consistency: Oracle implements physical consistency based on RAC and shared storage. Can RDS for MySQL achieve the same effect?

■ High performance: High-end storage has a strong I/O capability. Can RDS based on PC servers provide the same or even higher I/O processing capability? Can MySQL and Oracle provide the same PERFORMANCE for SQL processing?

■ Expansibility: how to split the business logic, how to servize, how much data is divided into how many tables, what dimension is divided, how to split the second time more convenient, etc.

Based on ali cloud computing platform, through the adoption of appropriate technology strategy and best practices, including: application of stateless, effective use of the cache (the browser cache, the reverse proxy cache, page caching, partial page caching, object caching and separation, speaking, reading and writing), service atomization, database partition, asynchronous performance problems, minimize things unit, appropriate to give up. As well as automatic monitoring/operation and maintenance means, including monitoring and early warning, unified configuration management, basic server monitoring, URL monitoring, network monitoring, inter-module call monitoring, intelligent analysis monitoring, integrated fault management platform, capacity management. Can solve the above problems well, so as to achieve the overall system of high scalability, lower cost, higher performance and availability of the implementation effect.

Migrating cloud Architecture best practices

The technical architecture of Taobao is a process of gradual evolution with the gradual development of the business, in which many valuable architectural best practices are deposited. For the majority of enterprise customers, they can choose the appropriate technical architecture based on their own business scenarios to realize the internet-based design of the overall IT system. Migrating cloud architectures in different application scenarios, including file storage, application services, OLTP databases, and OLAP databases.

For file storage, OSS can directly replace EMC storage to store massive data files. The maximum capacity of OSS storage can be up to 40PB. In addition, because OSS is distributed storage, it can significantly improve data access performance through parallel read and write on multiple nodes. For large files, you can also use the Multipart Upload method to transfer and store large files in parallel to achieve high performance.

For application services, IBM minicomputer can be replaced by SLB+ multiple ECS instances combination (FIG. 6), or ali Cloud middleware cloud services such as ACE, ONS and OpenSearch can be deployed directly according to different application types.

The migration of OLTP applications is relatively complex. Currently, the maximum RDS instance of aliyun is 48GB memory, 14,000 iops, and 1TB storage capacity (SSD storage). It supports MySQL and SQL Server. This configuration can be used as a single database server to meet the database application requirements in many scenarios, and can directly replace the IBM minicomputer, Oracle database, and EMC storage in most scenarios.

For applications with higher performance requirements, open cache service OCS can be introduced to load part of query data into distributed cache to reduce RDS data query times, improve system data query concurrency efficiency and reduce response time, as shown in Figure 7.

For scenarios where read requests are much larger than write requests, multiple RDS databases can be used to achieve read and write separation in a distributed manner. Write transactions mainly occur in the master library, while read requests access the standby library. Read libraries can be expanded according to requirements to improve overall request performance.



For large database tables, data can be distributed across multiple RDS instances by horizontal shard, and performance and capacity can be improved by parallel distributed database operations.




In general, the original IOE architecture can be replaced by scale-out mode through migration to RDS, introduction of data cache, database and table, read and write separation and other ways, and better performance and scalability can be achieved.




For OLAP applications, ODPS+OTS+RDS/ADS can be used instead of minicomputer +Oracle DB+OLAP+RAC+EMC storage solution, as shown in Figure 11. Generally speaking, the general architecture scheme of cloud migration is shown in Figure 12. The cloud migration scheme for specific business systems needs to be analyzed and rationally selected based on the actual situation.