Interviewer: For example, there are 100,000 websites. Is there any way to collect data quickly?

Bytedance Interview (part 1) : Summary of Android Framework interview questions

Bytedance Interview Highlights (II) : Summary of high-frequency HR interviews for the project

Detailed analysis of each module in data acquisition architecture

The realization principle and technology of web crawler

How can crawler engineers efficiently support the work of data analysts?

The basic architecture of Internet data acquisition platform based on big data platform

The making of a reptile engineer

How to establish an effective monitoring system in data collection?

One girl sighed: “My resume is not packaged, and I don’t even get an interview. What can I do?

Interview preparation, HR, Android technology and other interview questions

Yesterday, a netizen said that he recentlyThe interviewSeveral companies, a question was asked several times, each time the answer is not very good.

Interviewer: For example, if there are 100,000 websites to collect, what method do you have to get data quickly?

Want to answer this question well, actually need you to have enough knowledge area, have enough technical reserve.

Recently, we are also in the recruitment, every week will interview more than a dozen people, feel the right only one or two, most and this net friend’s situation is similar, all lack of overall thinking, that is afraid of those who have three or four years of work experience of the old driver. Their ability to solve specific problems is very strong, but they can rarely from the point to the surface, stand at a new height, comprehensive thinking.

The collection coverage of 100,000 websites is already wider than the data collection range of most professional public opinion monitoring companies. In order to meet the collection requirements mentioned by the interviewer, we need to comprehensively consider all aspects of website collection and data storage, and propose an appropriate scheme to achieve the purpose of saving costs and improving work efficiency.

Let’s take a brief look at all aspects of web site collection, right down to data storage.

Where do 100,000 websites come from?

Generally speaking, the collection of websites, are based on the development of the company’s business, gradually accumulated.

Let’s assume for now that this is the need of a startup. The company was just established, so many websites, basically can be said to be cold start. So how do we collect these 100,000 websites? This can be done in the following ways:

1) Accumulation of historical business

Whether it is cold start, or what, since there is a collection demand, there must be a project or product has this demand, its related personnel must have investigated some data sources in the early stage, collected some important websites. These can be used as the original seed for our collection site and collection.

2) Related websites

At the bottom of some websites, there are usually links to related websites. Especially for government websites, there are usually official websites of subordinate departments.

3) Website navigation

Some websites may be for some purpose (such as drainage, etc.), collect some websites, and classify them for display, in order to facilitate people to find. These sites can quickly provide us with the first batch of seed sites. Then, we can get more sites through other means such as site association.

4) Search engines

We can also prepare some keywords related to the company’s business and search in Baidu, Sogou and other search engines. By processing the search results, we can extract the corresponding website as our seed website.

5) Third-party platforms

Some third-party SaaS platforms, for example, offer a 7-15 day free trial. Therefore, we can use this period of time to collect data related to our business, and then extract the website as our initial collection seed.

Although, this way is the most efficient and fastest way to collect websites. However, in the trial period, the possibility of obtaining 100,000 websites is also very small, so it is still necessary to combine the above-mentioned associated websites and other ways to quickly obtain the required websites.

Through the above five ways, I believe we can quickly collect the 100,000 websites we need. But how do we manage so many websites? How do you know if it’s normal or not?

Two, how to manage 100,000 websites?

When we collect 100,000 websites, the first thing we have to face is how to manage them, how to configure collection rules, and how to monitor whether the websites are normal or not.

1) How to manage

A hundred thousand websites, without a dedicated system to manage them, would be a disaster.

At the same time, due to business needs, such as intelligent recommendation, we need to do some pre-processing for the website (such as labeling). At this point, a website management system will be required.

2) How do I configure a collection rule

The 100,000 websites we collected in the early stage were only the home page. If we only took the home page as the collection task, we could only collect little information on the home page, with a large rate of missed collection.

If you want to collect the whole site according to the home PAGE URL, the server resource consumption is relatively large, the cost is too high. Therefore, we need to configure the columns we care about and collect them.

But, 100,000 websites, how fast, efficient configuration column? At present, we use automatic parsing HTML source code, the column semi-automatic configuration.

Of course, we have tried machine learning methods to deal with it, but the results are not so good.

As the number of sites to be collected reaches 100,000 levels, it is important not to use a precise location method such as xpath to collect. Otherwise, when you configure these 100,000 sites, the day lily is cold.

At the same time, the data collection must use the general crawler, using regular expression to match the list data. In the collection of text, the time, text and other attributes are resolved by using algorithms.

3) How to monitor

As there are 100,000 websites, there will be website revision, or column revision, or new/removed column and so on every day. Therefore, it is necessary to analyze the situation of the website simply according to the data collected.

For example, if a site doesn’t have new data for a few days, something is wrong. Either the website is revised, resulting in the failure of information regularization, or the website itself.

To improve collection efficiency, a separate service can be used to check websites and columns at regular intervals. One is to detect whether the website and column can be accessed normally; Second, check whether the regular expression of column information is normal. So that operation and maintenance personnel can maintain it.

Task caching

100,000 websites, after configuring columns, the collected entrance URL should reach millions of levels. How does the collector efficiently obtain these entry urls for collection?

If these urls are put into the database, no matter MySQL or Oracle, the operation of collecting tasks will waste a lot of time and greatly reduce the collection efficiency.

How to solve this problem? In-memory databases are preferred, such as Redis, Mongo DB, etc. The general collection uses Redis to cache. Therefore, the column information can be synchronized to Redis as a collection task cache queue while configuring the column.

Four, how to collect the website?

For example, if you want to achieve an annual salary of one million, the greatest probability is to go to Huawei, Ali, Tencent such a first-line big factory, but also need to reach a certain level. The road is not meant to be easy.

Similarly, if you need to collect millions of list urls, the normal method is definitely not possible.

The distributed + multi-process + multi-thread approach must be used. At the same time, it also needs to be combined with the memory database Redis to do the cache, which has realized the efficient acquisition task, as well as the weight of the collected information;

At the same time, information parsing, such as release time, text, etc., must also be handled by algorithms. For example, GNE, which is popular right now,

Some attributes, which can be obtained during list collection, should not be parsed with the body. For example: headlines. In general, the accuracy of the title obtained from the list is much higher than the algorithm can parse from the information HTML source code.

At the same time, if there are some special websites, or some special needs, we can use customized development to deal with.

5. Unified data storage interface

In order to maintain the timeliness of collection, collection of 100,000 websites may need more than ten or twenty servers. At the same time, each server is deployed with N collectors and some customized scripts, the total number of collectors will reach hundreds.

If each collector/custom script develops its own data storage interface, then development and debugging will waste a lot of time. And the subsequent operation and maintenance will also be a non-dregs thing. Especially when the business changes and needs to be adjusted. Therefore, it is necessary to unify the data storage interface.

Due to the unified data storage interface, when we need to do some special processing of relative data, such as cleaning and correction, we do not need to modify each collection and storage part, but only need to modify the interface and redeploy.

Fast, convenient and fast.

Vi. Data collection and monitoring

The collection coverage of 100,000 websites is more than 2 million data per day. No matter how accurate the data parsing algorithm is, it will never reach 100% (90% is a good thing). Therefore, data parsing is bound to have exceptions. For example, the release time is longer than the current time, and the body contains relevant news information.

However, since we unified the data storage interface, we can now carry out unified data quality check at the interface. To optimize the collector and customize scripts according to exceptions.

At the same time, the data collection of each website or column can be counted. In order to be able to timely judge, the current collection of website/column source is normal, so as to ensure that there are always 100,000 valid collection sites.

7. Data storage

Due to the large amount of data collected every day, common databases (such as mysql and Oracle) cannot perform well. Even NoSql databases such as Mongo DB are no longer suitable. At this point, ES, Solr and other distributed indexes are the best choice at present.

As for whether to use big data platforms such as Hadoop and HBase, it depends on the specific situation. In the case of a small budget, a distributed index cluster can be built first, and a big data platform can be considered later.

To ensure quick response to queries, do not store information in distributed indexes. Titles, post dates, urls, and so on can be saved, reducing the need for secondary queries when displaying list data.

In the absence of big data platform period, the text can be fixed data standards, save to TXT and other file systems. After logging in to the big data platform, save the data to HBASE.

Eight, automatic operation and maintenance

Due to the large number of servers, collectors, and customized scripts, manual deployment, startup, update, and running status monitoring are cumbersome and prone to human error.

Therefore, it is necessary to have an automatic operation and maintenance system that can deploy, start, close, and run collectors and scripts to respond quickly to changes.

“Let’s say you have 100,000 websites to collect. How do you get that data quickly?” If you can answer these questions, you should have no problem getting a good offer.

Finally, I wish all the friends who are looking for a job can obtain satisfactory offer and find a good platform.

# the interview
# Data acquisition

Interviewer: For example, there are 100,000 websites. Is there any way to collect data quickly?

Where do 100,000 websites come from?

Two, how to manage 100,000 websites?

Task caching

Four, how to collect the website?

5. Unified data storage interface

Vi. Data collection and monitoring

7. Data storage

Eight, automatic operation and maintenance

Related Posts

Java Basics (1)

Fine! Write down a test case of penetration of financial killing pig dish

SwiftUI Learning Notes [Path drawing]