Collect tens of thousands of sites without words, a can automatically resolve the news web page algorithm

Enter the text of the web page (no xpath is required), and automatically structure the output of the title, publication time, body, author, source, and other information.

For traffic, the title is a bit of a bluff. But the algorithm can achieve multi-source, multi-site universal, has been applied in the production environment, the effect is ok.

Experience it first

Choose Experience Address > Experience page. The experience page is relatively simple and divided into three areas: Experience notes, parameter input area, and parsing result display area.

Before you start playing, you can read the instructions for playing.

① Open a news web page, such as Yongfu: Forestry science and technology commissioner to help jatropha planting.

② Then right-click in the blank of the page and select the TAB to view the page source code

We then see the original text of the web page displayed in a new window of the browser

③ Select all the text and copy it. Find an online tool for Base64 encoding

④ Paste the original text of the copied web page into box 1, and then click the encryption button, the corresponding Base64 encoding will appear in box 2, click the copy button to copy the content to the clipboard

⑤ Go back to our experience page, paste the Base64 content into the web box in the parameter input area, and fill in the url corresponding to this article.

⑥ Click the start analysis button, wait a moment, the experience page will pop up a prompt about the results of the analysis. You can then scroll down to the parse results pane to see the parse results.

The parsing result display area consists of interface information, parsing time statistics, and parsing results.

The interface information is mainly the information returned by the back-end interface.

Parsing time statistics are the time records of each link, in milliseconds.

Parsing results will display the results of this algorithm, such as article title, article source, article publication time, article author, article body, the HTML tag of the body, the Class attribute of the HTML tag of the body, and so on.

There are also the article classification, article label, article summary calculated according to the content of the text. Multi-entity naming + Sentiment analysis is still in training, so the experience page is not yet available.

I recommend you find some other news pages, copy and paste them into the experience page in the same way, and see how the algorithm works.

What’s the use of this algorithm

In fact, we have seen this algorithm in tool applications: 360 browser launched in the early years of reading mode, almost such an algorithm. Read mode blocks out ads, sidebars, and bottom columns, allowing you to focus on the document and the novel.

At the research and development level, it also plays a large role. Let’s look at some business scenarios:

① Suppose a public opinion company, it collects news article data, extracts the content for marking, training, and finally forms a public opinion product (such as Baidu public opinion, Sina public opinion).

(2) Another example, suppose a bidding company, it collects bidding information, and then format the content extraction, take out the bidding title, the amount of the target, the bidder information, agent information, bidding requirements, etc., you can form a bidding product (such as qianli Horse bidding).

Whether it is news website, or bidding information website, the number of sites is very large, usually tens of thousands. It’s common practice to hire a bunch of crawler engineers plus a bunch of people (usually hard-luck interns) who write xpath rules, fill in each xpath from each of the tens of thousands of sites, and then read the corresponding xpath for parsing as crawlers are collected.

Dozens of sites, a hundred sites also say, the tens of thousands of sites want to fill in the data, it took several months. Also, some sites have page rule changes that prevent data from being parsed, so you need to update your xpath every day. When you think about the workload…

But with an algorithm like this, you don’t need to fill in each xpath.

Your team/company can collect a lot of data in a short period of time

Is that a good algorithm

Regardless of whether it’s awesome or not, let’s take a look at where these algorithms or products are currently available.

1. It was mentioned earlier that the 360 browser (now other browsers) has such a product.

2. Microsoft seems to have similar capabilities and has opened up apis.

3. Readability of Python library

4. GNE, the domestic open source Python library.

5, some domestic master’s research papers (can be found in Baidu Library).

6. Other deep learning-based libraries, I can’t remember their names. I remember it was written by Cui Qingcai, an engineer at Microsoft.

A foreign website, name forgotten, charge, very expensive very expensive.

8, there is a Java written abroad, with News in the name, forgotten.

The algorithm that you’re experiencing right now was inspired by GNE. GNE early I read through the source code, with the original author had a lot of exchanges, consulted a lot of knowledge. Later I wrote the book “Python3 Web crawler bible” has a chapter is to explain GNE algorithm principle and source code, here again thanks to GNE author Qingnan.

I have experienced the browser reading mode, Readability I read the source code, domestic can find the relevant papers I also read through again. At present, deep learning related libraries and charging interfaces have not been tested.

This kind of automatic parsing algorithm is good or bad in several points: efficiency, extraction ability, accuracy. I’d like to comment on some of the algorithms I’ve worked with:

1. Readabiliti is graded based on the weight of HTML tags, such as p tags having a higher weight than div tags, H tags having a higher weight than SPAN tags, etc. In a well-regulated news site, the results are ok, but in general, the results are quite ridiculous.

2. GNE- Early, GNE was based on punctuation density in its early days and over 90% of web page text was interpreted with no problem. However, several problems were found in the practical application: the content would be truncated, the errors would be identified with less text, and the publication time was different from the page display, etc. The problems related to text extraction are all caused by density algorithms. Time is not a criterion because priorities are extracted and logic chooses another way.

3, domestic papers, because I can not understand the foreign, I can only search the domestic to see. Generally speaking, based on text density, punctuation density, position, distance, etc., the results are not very good. Now, you might ask, why are the results so good in the paper?

That is because the test sample is chosen well!!

4, GNE- modern, GNE modern is based on human vision + news page feature rules written, the general logic is that the content of the web page is usually in the middle of the page, so that you can eliminate left and right and up and down noise. The noise in the middle can be judged by the length of the block, and very accurate text can be extracted eventually.

Don’t give too much away. Let me give you an example. The blue block in the image above is where to place the image. From a GNE-modern point of view, it would say that the width of the blue block is different from the width of the text below, so it would be noise and should be excluded.

I won’t mention deep learning algorithms that require a large number of sample training, because I haven’t experienced them myself. However, it is certain that deep learning based solely on classification and regression classes is impossible to achieve good results. I wonder if anyone has trained a better Bert model now.

Summary of comparison: In the above examples, the GNE-Modern is the best for extracting body parts, but I remember it requires browser rendering and I don’t seem to have found a good solution for efficiency.

The algorithm in this article, you can experience, after all, the actual experience is good or bad. I think the algorithm in this paper can be ranked in terms of efficiency, accuracy and extraction ability.

What is the logic of the algorithm

Sorry, I’m not going to talk about that right now. Next.

Which algorithms are referenced

As mentioned above, I have read Readability, GNE-early source code, and read most relevant papers in China.

I started with gnE-based improvements and modifications.

After reading a lot of relevant materials on deep learning, I finally decided not to take this route, because the results proved that they did not achieve the effect I wanted.

And then one day, I was there, watching the “Nine Songs” episode, and it just came to me. After a brief coding test, I found the result was feasible, so I jumped in. This one, that’s 20 years…

wrong

Was 200 days

Which areas can be extended horizontally

Now it is mainly used in the analysis of news data, and can be extended to bidding web page analysis, e-commerce web page analysis, drug web page analysis and so on.

If you look at it from a deep learning perspective, they may require different training, different samples, different algorithmic models. But from the point of view of my algorithm principle, they are all the same, appropriate modification can be another domain of parsing algorithm.

digression

When I interviewed a listed company named WHC, I told him about distributed crawler and automatic parsing algorithm.

The interviewer doesn’t believe me. I have drawn pictures for him to popularize knowledge.

I say that using message queues is better than using Redis for data transfer. So I went on to list some message confirmation mechanisms, buffer layers, multiple subscriptions, and scenarios. In the end, redis is the best.

He couldn’t technically beat me, but I didn’t make it to the second round. You said it worked?

Collect tens of thousands of sites without words, a can automatically resolve the news web page algorithm

Experience it first

What’s the use of this algorithm

Is that a good algorithm

What is the logic of the algorithm

Which algorithms are referenced

Which areas can be extended horizontally

digression

Related Posts

2. Junit 4 use (2) Junit test method introduction (1)

Some use summaries of eggJS

ArrayList: Why are threads unsafe?