• Analysing 1.4 billion rows with Python
  • By Steve Stagg
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: Ryden Sun
  • Proofreader: luochen1992 Allen

Analyze 1.4 billion pieces of data using Python

Use Pytubes, Numpy, and Matplotlib

The Google Ngram Viewer is an interesting and useful tool that uses Google’s vast trove of data scanned from books to map word usage over time. For example, the word Python (case sensitive) :

The picture from: books.google.com/ngrams/grap… , depicts the use of the word ‘Python’ over time.

It’s driven by Google’s N-Gram data set, which records the usage of a particular word or phrase in Google Books for each year the book was printed. However, this is not complete (it does not include every book that has been released!). And there are millions of books in the data set, from the 16th century to 2008. The dataset can be downloaded for free from here.

I decided to use Python and my new data loading library PyTubes to see how easy it would be to recreate the diagram above.

challenge

A 1-gram data set can be expanded to 27 Gb of data on a hard disk, which is a large amount of data when reading Python. Python can easily process gigabytes of data at once, but when the data is corrupted and processed, it becomes slow and memory inefficient.

In total, the 1.4 billion pieces of data (1,430,727,243) were scattered across 38 source files, totaling 24 million (24,359,460) words (and pos tags, see below), calculated from 1505 to 2008.

When you’re processing a billion lines of data, it slows down very quickly. And native Python does not handle this data optimization. Fortunately, Numpy is really good at handling large volume data. With a few simple tricks, we can use Numpy to make this analysis work.

Handling strings in Python/Numpy is complicated. The memory overhead of strings in Python is significant, and Numpy can only handle strings of known and fixed length. In this case, most words have different lengths, so this is not ideal.

Loading the data

All the code/examples below are running on a 2016 Macbook Pro with 8 GB of ram. If the hardware or cloud instance has a better RAM configuration, the performance will be better.

The 1-gram data is tab-split in a file and looks like this:

Python 1587 4 2
Python 1621 1 1
Python 1651 2 2
Python 1659 1 1
Copy the code

Each data piece contains the following fields:

1. Word
2. Year of Publication
3. Total number of times the word was seen
4. Total number of books containing the word
Copy the code

To generate the chart as required, we only need to know this information, namely:

1. Is this the word we are interested in? 2. Year of publication 3. Total number of words usedCopy the code

By extracting this information, the extra cost of processing string data of different lengths is ignored, but we still need to compare the values of different strings to distinguish which rows of data have fields of interest. This is what PyTubes does:

import tubes

FILES = glob.glob(path.expanduser("~/src/data/ngrams/1gram/googlebooks*"))
WORD = "Python"

# Set up the data load pipeline
one_grams_tube = (tubes.Each(FILES)
    .read_files()
    .split()
    .tsv(headers=False)
    .multi(lambda row: (
        row.get(0).equals(WORD.encode('utf-8')),
        row.get(1).to(int),
        row.get(2).to(int)
    ))
)

Read the data into a NUMpy array. By setting an approximate accuracy
# Estimate the number of rows. Pytubes optimizes allocation patterns
# fields=True this is redundant, but ensures the return of ndarray
# Use fields instead of a separate multidimensional array one_grams = one_grams_tube.ndarray(ESTIMated_rows =500_000_000, fields=True)
Copy the code

After about 170 seconds (3 minutes), one_grams is a NUMpy array containing about 1.4 billion rows of data that looks something like this (table header added for illustration) :

╒ ═ ═ ═ ═ ═ ═ ═ ═ ═ ═ ═ ╤ ═ ═ ═ ═ ═ ═ ═ ═ ╤ ═ ═ ═ ═ ═ ═ ═ ═ ═ ╕ │ Is_Word during │ │ Count │ ╞ ═ ═ ═ ═ ═ ═ ═ ═ ═ ═ ═ ╪ ═ ═ ═ ═ ═ ═ ═ ═ ╪ ═ ═ ═ ═ ═ ═ ═ ═ ═ ╡ │ │ │ │ 1799 0 2 ├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤ │ │ │ │ 1 1804 0 ├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤ │ │ │ │ 1 1805 0 ├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤ │ │ │ │ 1 1811 0 ├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤ │ │ │ 1820 0... │ ╘ ═ ═ ═ ═ ═ ═ ═ ═ ═ ═ ═ ╧ ═ ═ ═ ═ ═ ═ ═ ═ ╧ ═ ═ ═ ═ ═ ═ ═ ═ ═ ╛Copy the code

From here on, it’s just a matter of using the NUMpy method to compute something:

The total number of words used per year

Google shows the percentage of occurrences of each word (the number of occurrences of a word in a year/the total number of occurrences of all words in a year), which is more useful than just counting the original words. To calculate this percentage, we need to know what the total number of words is.

Fortunately, Numpy makes this fairly easy:


last_year = 2008
YEAR_COL = '1'
COUNT_COL = '2'

year_totals, bins = np.histogram(
    one_grams[YEAR_COL], 
    density=False, 
    range=(0, last_year+1),
    bins=last_year + 1, 
    weights=one_grams[COUNT_COL]
)
Copy the code

Plot this graph to show how many words Google collects each year:

It is clear that prior to 1800, the total amount of data declined rapidly, so this distorts the final results and hides the patterns of interest. To avoid this problem, we only import data from 1800 onwards:

one_grams_tube = (tubes.Each(FILES)
    .read_files()
    .split()
    .tsv(headers=False)
    .skip_unless(lambda row: row.get(1).to(int).gt(1799))
    .multi(lambda row: (
        row.get(0).equals(word.encode('utf-8')),
        row.get(1).to(int),
        row.get(2).to(int)
    ))
)
Copy the code

This returned 1.3 billion rows of data (only 3.7% before 1800).

Percentage of Python per year

Getting python’s percentage per year is now particularly easy.

Using a simple trick, create a year-based array. The 2008 element length means that the index for each year is equal to the number of years, so, for example, 1995 is just a matter of getting the elements for 1995.

None of this is worth using numpy:

Select * from row where column is true
word_rows = one_grams[IS_WORD_COL]
Create an empty array to hold the annual percentage value
word_counts = np.zeros(last_year+1)
# iterate over to each matched data (there should be only a few thousand lines for a word match)
for _, year, count in one_grams[word_rows]:
    # Set the calculated value of the associated word_counts action
    word_counts[year] += (100*count) / year_totals[year]
Copy the code

Draw the result of word_counts:

The shape looks similar to Google’s version

The actual percentage doesn’t match, I think, because the downloaded dataset contains words in a different way (e.g. Python_VERB). This data set is not explained very well in Google Page and raises several issues:

  • How do people use Python as a verb?
  • Does the total calculated for ‘Python’ include ‘Python_VERB’? Etc.

Fortunately, we all know that the method I use produces a Google-like icon and the related trends are not affected, so I’m not going to try to fix this quest.

performance

Google generates images in about 1 second, which is reasonable compared to the 8 minutes of this script. The background of Google’s word calculation comes from an apparently prepared view of the data set.

For example, calculating the total number of words used in the previous year in advance and keeping it in a separate lookup table can significantly save time. Similarly, keeping word usage in a separate database/file, and then indexing the first column, reduces almost all processing time.

This exploration really showed that it’s possible to load, process, and extract arbitrary statistics from a data set of a billion rows in a reasonable amount of time using Numpy and the PyTubes that are so young, as well as standard commercial hardware and Python.

The language of war

To demonstrate this concept with a slightly more complex example, I decided to compare three related programming languages: Python, Pascal, and Perl.

The source data is noisy (it contains all the English words ever used, not just references to programming languages, but also, for example, non-technical meanings of Python!). To make this adjustment, we have done two things:

  1. Only capitalized names can be matched (Python, not Python)
  2. The total number of mentions for each language has been converted to a percentage average from 1800 to 1960, which should have a reasonable baseline considering Pascal was first mentioned in 1970.

Results:

Comparison to Google (without any baseline adjustment):

Running time: just over 10 minutes

Code: gist.github.com/stestagg/91…

PyTubes will improve in the future

At this stage, PyTubes only has the concept of a single integer, which is 64-bit. This means that the Numpy array generated by PyTubes uses i8 dtypes for all integers. In some places (like ngrams), 8-bit integers are a bit excessive and a waste of memory (total Ndarray is 38Gb, dtypes can easily reduce that by 60%). I plan to add some level 1,2 and 4 bit integer support (github.com/stestagg/py…)

More filtering logic – tube.skip_unless () is a simpler way to filter rows, but lacks the ability to combine conditions (AND/OR/NOT). This can reduce the volume of loaded data more quickly in some use cases.

Better string matching – Simple tests like startswith, Endswith, contains, and is_one_of can be easily added to significantly improve the effectiveness of loading string data.

As always, we welcome Patches!


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.