Efficient methods for extracting HTML text in Python

When solving natural language processing problems, sometimes you need to get large sets of text. The Internet is the largest source of text, but extracting text from any HTML page can be an arduous and painful task.

Suppose we need to extract the full text from various web pages and strip out all the HTML tags. Typically, the default solution is to use the get_text method in the BeautifulSoup package, which uses LXML internally. This is a well-tested solution, but it can be very slow when dealing with thousands of HTML documents.

By replacing BeautifulSoup with Selectolax, you get 5-30x acceleration for almost nothing!

This is a simple benchmark that analyzes commonCrawl (‘ When dealing with NLP problems, sometimes you need to get a large set of text. The Internet is the biggest source of text, but unfortunately, extracting text from any HTML page is a daunting and painful task.

By replacing BeautifulSoup with Selectolax, you get 5-30x acceleration for almost nothing! This is a simple benchmark that analyzes 10,000 HTML pages at CommonCrawl (https://commoncrawl.org/) :

# coding: utf- 8 -

from time import time

import warc
from bs4 import BeautifulSoup
from selectolax.parser import HTMLParser


def get_text_bs(html):
    tree = BeautifulSoup(html, 'lxml')

    body = tree.body
    if body is None:
        return None

    for tag in body.select('script'):
        tag.decompose()
    for tag in body.select('style'):
        tag.decompose()

    text = body.get_text(separator='\n')
    return text


def get_text_selectolax(html):
    tree = HTMLParser(html)

    if tree.body is None:
        return None

    for tag in tree.css('script'):
        tag.decompose()
    for tag in tree.css('style'):
        tag.decompose()

    text = tree.body.text(separator='\n')
    return text


def read_doc(record, parser=get_text_selectolax):
    url = record.url
    text = None

    if url:
        payload = record.payload.read()
        header, html = payload.split(b'\r\n\r\n', maxsplit=1)
        html = html.strip()

        if len(html) > 0:
            text = parser(html)

    return url, text


def process_warc(file_name, parser, limit=10000):
    warc_file = warc.open(file_name, 'rb')
    t0 = time()
    n_documents = 0
    for i, record in enumerate(warc_file):
        url, doc = read_doc(record, parser)

        if not doc or not url:
            continue

        n_documents += 1

        if i > limit:
            break

    warc_file.close(a)print('Parser: %s' % parser.__name__)
    print('Parsing took %s seconds and produced %s documents\n' % (time() - t0, n_documents))
Copy the code

> > >! wget https:/ / commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-201801160 90444-00000.warc.gz
>>> file_name = "CC-MAIN-20180116070444-20180116090444-00000.warc.gz"
>>> process_warc(file_name, get_text_selectolax, 10000)
Parser: get_text_selectolax
Parsing took 16.170367002487183 seconds and produced 3317 documents
>>> process_warc(file_name, get_text_bs, 10000)
Parser: get_text_bs
Parsing took 432.6902508735657 seconds and produced 3283 documents
Copy the code

Obviously, this is not the best way to benchmark something, but it gives you an idea that Selectolax is sometimes 30 times faster than LXML.

Selectolax is best suited for stripping HTML to plain text. If I have more than 10,000 HTML fragments, I need to index them into Elasticsearch as plain text. Elasticsearch has an HTML_strip text filter, but that’s not what I want/don’t need to use in this context. Stripping HTML down to plain text at this scale turns out to be actually quite inefficient. So, what’s the most effective way?

PyQuery

from pyquery import PyQuery as pq

text = pq(html).text()
Copy the code

selectolax

from selectolax.parser import HTMLParser

text = HTMLParser(html).text()
Copy the code

Regular expression

import re

regex = re.compile(r'<. *? > ')
text = clean_regex.sub(' ', html)
Copy the code

The results of

I wrote a script that iterated through 10,000 files containing HTML snippets to count the time. Attention! These fragments are not complete < HTML > documents (with , , etc.), but only small pieces of HTML. The average size is 10,314 bytes (the median is 5,138 bytes). The results are as follows:

pyquery
  SUM:    18.61 seconds
  MEAN:   1.8633 ms
  MEDIAN: 1.0554 ms
selectolax
  SUM:    3.08 seconds
  MEAN:   0.3149 ms
  MEDIAN: 0.1621 ms
regex
  SUM:    1.64 seconds
  MEAN:   0.1613 ms
  MEDIAN: 0.0881 ms
Copy the code

I have run it many times and the results are very stable. Here’s the thing: Selectolax is 7 times faster than PyQuery.

Regular expressions are good? Is it true?

For the most basic HTML Blob, it probably works fine. In fact, if HTML is

Foo& Bar
, I want the plain text conversion to be Foo&Bar, not Foo& The bar.

More importantly, PyQuery and Selectolax support something very specific but important to my use case. Before continuing, I need to remove some tags (and their contents). Such as:

<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<div style="display: none">This should also get stripped.</div>
Copy the code

Regular expressions can never do this.

Version 2.0

So, my requirements may change, but basically, I want to remove certain labels. For example:

PyQuery

from pyquery import PyQuery as pq

_display_none_regex = re.compile(r'display:\s*none')

doc = pq(html)
doc.remove('div.warning, div.hidden')
for div in doc('div[style]').items():
    style_value = div.attr('style')
    if _display_none_regex.search(style_value):
        div.remove()
text = doc.text()
Copy the code

selectolax

from selectolax.parser import HTMLParser

_display_none_regex = re.compile(r'display:\s*none')

tree = HTMLParser(html)
for tag in tree.css('div.warning, div.hidden'):
    tag.decompose()
for tag in tree.css('div[style]'):
    style_value = tag.attributes['style']
    if style_value and _display_none_regex.search(style_value):
        tag.decompose()
text = tree.body.text()
Copy the code

It actually works. When I now run the same benchmark for 10,000 fragments, the new result is as follows:

pyquery
  SUM:    21.70 seconds
  MEAN:   2.1701 ms
  MEDIAN: 1.3989 ms
selectolax
  SUM:    3.59 seconds
  MEAN:   0.3589 ms
  MEDIAN: 0.2184 ms
regex
  Skip
Copy the code

Similarly, Selectolax beat PyQuery by about 6 times.

conclusion

Regular expressions are fast but weak. Selectolax’s efficiency is impressive.

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

Top 10 Sand Sculptures and fun GitHub apps \

Special recommendation \

Click below to read the article and join the community

Efficient methods for extracting HTML text in Python

Related Posts

How to design high concurrency, high performance, high availability, and high security architectures?

Java Design Mode – Decoration Mode: Cross Fire RMB happy ~

Spring Batch Describes the Batch processing framework