Using Node.js to Read Really, Really Large Files (Pt 1)

Originally written by Paige Niedringhaus

The Nuggets translation Project

Permanent link to this article: github.com/xitu/gold-m…

Translator: lucasleliane

Proofread by: Sunui, Jane Liao

Reading Large files with Node.js (Part 1)

This post has a very interesting revelation. Last week, someone posted a coding challenge on my Slack channel that he received while applying for a development job at an insurance technology company.

What piqued my interest was the challenge of reading a large number of federal Election Commission data files and displaying specific data from those files. Since I hadn’t done much work with raw data, and I was always open to new challenges, I decided to tackle the problem with Node.js to see if I could complete the challenge and have fun with it.

Here are the four questions raised, along with links to the data sets the program needs to parse.

Implement a program that can print out the total number of lines of a file.
Notice that the eighth column contains the name of the person. Write a program to load this data and create an array to store all the name strings in. Print the 432nd and 43,243rd names.
Notice that the fifth column contains the time of formatting. Count your monthly donations and print out the results.
Notice that the eighth column contains the name of the person. Create an array to hold each first name. Mark the most commonly used first name in the data and the number of times it occurs.

Links to data: www.fec.gov/files/bulk-…

When you’re done unzipping the folder, you’ll see a 2.55GB.txt main file and a folder that contains some of the main file’s data (which I used to test my solution before running the main file).

Not terribly scary, right? It seems feasible. So let’s see how I do that.

I came up with two native Node.js solutions

Handling large files is nothing new to JavaScript, and in fact, there are many standard solutions for reading and writing files at the core of Node.js.

The most straightforward of these is fs.readfile (), which reads the entire file into memory and executes it as soon as Node finishes reading it. The second option is fs.createreadstream (), which processes the input and output of data as a data stream. Similar to Python or Java.

What solution am I using and why am I using it

Since my solution involves counting the total number of rows and parsing the data for each row to get the donor name and date, I chose the second method: fs.createreadstream (). Then I can use rl.on(‘line’,…) while iterating through the file. Function to get the necessary data from each line of the file.

To me, this is much easier than reading the entire file into memory and then reading it line by line.

Node.js CreateReadStream() and ReadFile() code implementation

Here is the code I implemented using node.js’s fs.createreadstream () function. I’ll break it down below.

The first thing I had to do was import the required functions from Node.js: fs (file system), readline, and stream. After importing these, I can create an Instream and outstream and call readline.createInterface (), which let me read the stream line by line and print out the data from it.

I also added variables (and comments) to hold various data: a lineCount, a Names array, a Donation array, and an object, and a firstNames array and a dupeNames object. You can see them in action later.

In rl. On (‘ line ‘,…). Function, I can do a line by line analysis of the data. Here, I incremented each row of the data stream with lineCount. I use JavaScript’s split() method to parse each name and add it to the Names array. I’ll further reduce each name to first name, and with the help of JavaScript’s trim(), includes(), and split() methods, calculate the initial of the middle name and the number of occurrences of the name. I then split the year and time of the time column, formatted it into a more readable YYYY-MM format, and added it to the dateDonationCount array.

In rl. On (‘ close ‘,…). The console.log function transforms the data I collected from the array and presents all my data to the user with the help of console.log.

No further operations are required to find lineCount and names at the 432nd and 43,243rd subscripts. Finding the most common names and monthly donations is trickier.

For the most common names, I first needed to create a key-value Object to store each name (as key) and the number of occurrences of that name (as value), and THEN I used the ES6 function Object.entries() to convert it into an array. It’s easy to sort the array and print out the maximum.

Getting the number of donations requires a similar key-value pair object. We create a logDateElements() function that we can use ES6’s string interpolation to show the key value of the number of donations per month. Then, a new Map() is created to convert the dateDonations object into a nested array and the logDateElements() function is called for each array element. Shout! It’s not as easy as I first thought.

At least it worked for the 400 MB files I tested…

After finishing with the fs.createreadstream () method, I went back and tried implementing my solution using fs.readfile () to see what the difference was. Here’s the code for this method, but I won’t go into all the details here. This code is very similar to the first code slice, except it looks more synchronous (unless you use the fs.readfilesync () method, but don’t worry, JavaScript will execute this code just as it would any other asynchronous code).

If you want to see the full version of my code, you can find it here.

The initial result of Node.js

Using my solution, I replaced the path of the file passed into readFilestream.js with the 2.55GB monster file and watched my Node server crash due to a JavaScript heap out of memory error.

As it turns out, although Node.js uses streams to read and write files, it still tries to keep the entire file content in memory, which is not possible for the size of the file. Node can hold up to 1.5 GB of content at a time, but not more.

Therefore, none of my current solutions can complete the entire challenge.

I need a new solution. A Node-based solution capable of handling larger data sets.

New data flow solution

EventStream is a currently popular NPM module with more than 2 million downloads per week that claims to “make streaming easier to create and use.”

With the help of the EventStream documentation, I figured out again how to read the code line by line and do it in a more CPU-friendly way.

EventStream code implementation

This is the new code I implemented using the EventStream NPM module.

The biggest change is the pipe command at the beginning of the file — all of this syntax, as suggested by the EventStream documentation, is to break up the stream with the \n character at the end of each line of the.txt file.

The only thing I have changed is the result of names. I have to be honest, because I tried to put 13 million names into the array, and I still ran out of memory. I circumvented this problem by collecting only the 432nd and 43,243rd names and adding them to their own arrays. It wasn’t for any other reason. I just wanted to be creative.

Implementation of Node.js and EventStream: Round 2

Ok, new solution implemented, once again, I started Node.js with a 2.55GB file, hands clasped, and this time it worked. Let’s see what happens.

Success!

conclusion

Finally, node.js’s pure files and big data processing capabilities are a little short of what I need, but with an additional NPM module like EventStream, I can parse huge amounts of data without crashing the Node server.

Stay tuned for part 2 of this series, where I test and compare the performance of three ways to read data from Node.js to see which one performs better than the others. The results became remarkable — especially as the volume of data grew…

Thanks for reading, and I hope this article helped you understand how to use Node.js to handle large amounts of data. Thanks for your likes and attention!

If you enjoyed reading this, you may also enjoy some of my other blogs:

Postman vs. Insomnia: COMPARISON of API testing tools
How do I use Netflix Eureka and Spring Cloud to sign up for the service
Jib: get expert-level Docker results without knowing about Docker

Quote and continue reading resources:

Node.js documentation, file system: nodejs.org/api/fs.html
Node.js documentation, Readline: nodejs.org/api/readlin…
Github, Read File Repo: github.com/paigen11/fi…
NPM, EventSream:www.npmjs.com/package/eve…

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.

The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Reading Large Files with Node.js (Part 1)