Send you the following learning materials, the end of the article has the way to get







Toutiao was founded in March 2012 and is only four years old. From a dozen engineers to a hundred, and then to more than 200 people. The product line from the connotation of the paragraph, to Today’s Toutiao, Today’s sale, Today’s movie product line.

1. Product background

Toutiao is to provide users with personalized information client. Here are some of the current Toutiao figures (both internal and public) :

  • 500 million registered users

150 million in May 2014, 300 million in May 2015, and 500 million in May 2016. It’s almost doubled.

  • 48 million daily users

10 million live days in 2014 and 30 million live days in 2015.

  • 500 million PV per day

500 million articles, 100 million videos. More than 3 billion page requests were made.

  • Users stay for more than 65 minutes

1, article capture and analysis

We produce about 10,000 pieces of original news daily, including major news websites and local websites, as well as some novels, blogs and other articles. Writing a Crawler is not that difficult for an engineer.

Next, Toutiao manually censors and filters sensitive articles. In addition, a large number of original articles have been selected on ToutiaoToutiao.

Next, we will carry out text analysis on the article, such as classification, tag, topic extraction, and calculate according to the region, popularity, weight and so on of the article or news.

2. User modeling

When users start using Toutiao, real-time analysis is performed on the logs of users’ actions. The tools used are as follows:

– Scribe

– Flume

– Kafka

We will mine the user’s interest and learn every action of the user. Main uses:

– Hadoop

– Storm

The resulting user model data, like most architectures, is stored in MySQL/MongoDB (read/write separation) and Memcache/Redis.

As the number of users continues to expand, the number of machine clusters processed by user model is larger. Before 2015, the number was about 7,000. Among them, the user recommendation model includes the following dimensions:

1 User Subscription

2 tag

The three parts of the article are broken and pushed

At this point, you need to make recommendations every minute of the day.

3. “Cold Start” for New Users

Toutiao is “identified” by the user’s phone, operating system, version, etc. In addition, for example, when a user logs in through a social account, such as sina weibo, Toutiao will make a preliminary “portrait” of the user in terms of his or her friends, fans, microblog contents, forwarding and comments.

The main parameters for analyzing users are as follows:

– Attention and fan relationships

Relationship –

– User Labels

In addition to the phone’s hardware, Toutiao also analyzes the apps users have installed. For example, combining models and apps, using Xiaomi, Samsung and Apple, as well as bookmarks in the user’s browser. Toutiao captures what users are doing to the APP channel in real time. It also includes the channels to which the user subscribes, such as movies, jokes, merchandise, etc.

4. Recommendation system

Recommendation system, also known as recommendation engine. It is a core part of Toutiao’s technology architecture. Including two types of automatic recommendation system and semi-automatic recommendation system:

1. Automatic recommendation system

– Automatic candidate

– Automatic matching of users, such as user address location, extraction of user information

– Automatically generate push tasks

This requires an efficient, large concurrent push system that hundreds of millions of users need to receive.

2 Semi-automatic recommendation system

– Automatically select candidate articles

– According to the user station inside and outside the action

The channels of Toutiao, on the technical side, are divided into classified channels, interest tag channels, keyword channels, text analysis, etc., which are divided into relatively independent development teams. At present, there are more than 300 classifiers, and new user models are still being added. The original user models do not need to be undone and still play a role.

When the headline number has not been launched, the content is mainly to grab the articles of other platforms, and then to heavy, a year several million levels, not too big. It mainly includes user action log collection, interest collection and user model collection.

Information APP technical indicators, such as screen sliding, whether the user has read a post, stay time and so on all need our special attention

5. Data storage

Toutiao uses MySQL or Mongo persistent storage +Memched (Redis), divided many libraries (a large memory library), also tried to use SSD products.

Toutiao’s picture storage, directly placed in the database, distributed to save the file, read the time using CDN.

6. Message push

Message push, for users: up-to-the-minute access to information. For operations, it can improve user activity. For example, the DAU of Toutiao can be increased by about 20% after it is promoted, while the DAU of Toutiao will be affected by about 10% if it is not promoted (2015 data).

ROI to follow after push: click rate, number of clicks. Can monitor the number of APP uninstall and push disabled.

The main content of Toutiao’s push includes news about emergencies and hot topics, comments and replies, and friends outside the site registering to join.

In the headlines, the push is also personalized:

– Frequency personalization

– Personalization of content

– the regional

Interest –

Such as:

According to the city: a news event in chaoyang, liaoning, sent to local users in chaoyang.

According to interest: For example, Jingdong purchased Yihaodian and sent it to users who are interested in the Internet.

The tools and choices of push platform should have the following standards:

– Channels, first of all fast, but controllable, reliable, and resource efficient

– The push speed should be fast, with different dimensions of policy support, traceability, and friendly development interface

– Push the background of operation, feedback should be fast, including timeliness, heat, convenient operation of tools

– For the operation side, it is clear whether to confirm the recommendation, including the copy processing of push

Therefore, the push background should provide daily newspaper, complete data background and A/B TEST program support.

Part of the push system uses its own IDC, which is particularly large in sending volume and consumes serious bandwidth. You can use services similar to Aliyun, which can effectively save costs.

II. Toutiao system architecture

3. Toutiao micro-service architecture

Toutiao splits subsystems, breaks down large applications into small applications, and abstracts common layers for code reuse.



The layering of the system is typical. The focus is on the infrastructure, which is expected to improve the rapid iteration, disaster recovery and a series of work through the infrastructure. It is hoped that each business team can do business iteration and structural adjustment faster.

IV. Toutiao’s virtualization PaaS platform planning

Through three layers of implementation, through the PaaS platform unified management. Provide a common SaaS service, while providing a common App execution engine. The lowest level is the IaaS layer.



IaaS manages all the machines and integrates the public cloud. Some hot events in Toutiao will be promoted and pushed across the country, which has a high network bandwidth. With the help of the public cloud, we will abstract together what type of computing resources are needed. With infrastructure combined with service-oriented ideas, such as logging, monitoring, etc., businesses can enjoy the capabilities provided by the infrastructure without paying attention to the details.

Five, the summary

The important part of Toutiao is:

  • Data generation and collection
  • Data transfer. Kafka does a message bus to connect online and offline systems.
  • Data entry. Data Warehouse, ETL (Extract Transformation Load)
  • Data calculation. How the data tables in the data warehouse can be queried efficiently is very important, because it is directly related to the efficiency of data analysis. Common query engines can be grouped into three patterns, Batch, MPP, and Cube, and Toutiao applies to all three.