How does Golang read the contents of a file

This article aims to provide a quick introduction to the many options for reading files in the Go standard library.

In Go (and for that matter, most low-level languages and some dynamic languages (such as Node) return byte streams. One of the benefits of not automatically converting everything to a string is to avoid expensive string allocation, which increases GC stress.

To make this article simpler, I’ll use string(arrayOfBytes) to convert an arrayOfBytes into a string. However, this should not be used as a general recommendation when releasing production code.

1. Read the entire file to the memory

First, the library provides a variety of functions and utilities to read file data. We’ll start with the basics provided in the OS package. This implies two prerequisites:

The file must be contained in memory
We need to know the size of the file in advance so that we can instantiate a buffer large enough to hold it.

With a handle to the os.file object, we can query the size and instantiate a list of bytes.

package main


import (
	"os"
	"fmt"
)
func main() {
	file, err := os.Open("filetoread.txt")
	iferr ! = nil { fmt.Println(err)return
	}
	defer file.Close()

	fileinfo, err := file.Stat()
	iferr ! = nil { fmt.Println(err)return
	}

	filesize := fileinfo.Size()
	buffer := make([]byte, filesize)

	bytesread, err := file.Read(buffer)
	iferr ! = nil { fmt.Println(err)return
	}
	fmt.Println("bytes read: ", bytesread)
	fmt.Println("bytestream to string: ", string(buffer))
}
Copy the code

2. Read files in blocks

While most of the time you can read a file once, sometimes you want to use a more memory efficient approach. For example, read files in chunks of a certain size, process them, and repeat until the end. In the following example, the buffer size used is 100 bytes.

package main


import (
	"io"
	"os"
	"fmt"
)

const BufferSize = 100

func main() {
	
	file, err := os.Open("filetoread.txt")
	iferr ! = nil { fmt.Println(err)return
	}
	defer file.Close()

	buffer := make([]byte, BufferSize)

	for {
		bytesread, err := file.Read(buffer)
		iferr ! = nil {iferr ! = io.EOF { fmt.Println(err) }break
		}
		fmt.Println("bytes read: ", bytesread)
		fmt.Println("bytestream to string: ", string(buffer[:bytesread]))
	}
}

Copy the code

Compared to full file reading, the main differences are:

Read until the EOF tag is obtained, so we areerr == io.EOFSpecific checks were added
We define the buffer size so we can control the required “block” size. If the operating system correctly caches the files being read, it can improve performance when used correctly.
If the file size is not an integer multiple of the buffer size, the last iteration will only add the remaining bytes to the buffer, so the callBuffer [: bytesread]. Under normal circumstances,bytesreadWill be the same size as the buffer.

For each iteration of the loop, the internal file pointer is updated. The next time it reads, it returns data from the file pointer offset up to the buffer size. This pointer is not a construct of the language, but one of the operating systems. On Linux, this pointer is an attribute of the file descriptor to be created. All read/read calls (in Ruby/Go, respectively) are internally converted to system calls and sent to the kernel, which manages this pointer.

3. Read file blocks concurrently

What if we want to speed up processing of the above blocks? One way is to use multiple GO routines! The other thing we need to do as opposed to serial reading blocks is that we need to know the offset of each routine. Note that ReadAt behaves slightly differently than Read when the size of the target buffer is greater than the number of bytes remaining.

Also note that I do not limit the number of Goroutines, which are defined only by the buffer size. In practice, there may be an upper limit.

package main

import (
	"fmt"
	"os"
	"sync"
)

const BufferSize = 100

type chunk struct {
	bufsize int
	offset  int64
}

func main() {
	
	file, err := os.Open("filetoread.txt")
	iferr ! = nil { fmt.Println(err)return
	}
	defer file.Close()

	fileinfo, err := file.Stat()
	iferr ! = nil { fmt.Println(err)return
	}

	filesize := int(fileinfo.Size())
	// Number of go routines we need to spawn.
	concurrency := filesize / BufferSize
	// buffer sizes that each of the go routine below should use. ReadAt
	// returns an error if the buffer size is larger than the bytes returned
	// from the file.
	chunksizes := make([]chunk, concurrency)

	// All buffer sizes are the same in the normal case. Offsets depend on the
	// index. Second go routine should start at 100, for example, given our
	// buffer size of 100.
	for i := 0; i < concurrency; i++ {
		chunksizes[i].bufsize = BufferSize
		chunksizes[i].offset = int64(BufferSize * i)
	}

	// check for any left over bytes. Add the residual number of bytes as the
	// the last chunk size.
	ifremainder := filesize % BufferSize; remainder ! = 0 { c := chunk{bufsize: remainder, offset: int64(concurrency * BufferSize)} concurrency++ chunksizes = append(chunksizes, c) } var wg sync.WaitGroup wg.Add(concurrency)for i := 0; i < concurrency; i++ {
		go func(chunksizes []chunk, i int) {
			defer wg.Done()

			chunk := chunksizes[i]
			buffer := make([]byte, chunk.bufsize)
			bytesread, err := file.ReadAt(buffer, chunk.offset)

			iferr ! = nil { fmt.Println(err)return
			}

			fmt.Println("bytes read, string(bytestream): ", bytesread)
			fmt.Println("bytestream to string: ", string(buffer))
		}(chunksizes, i)
	}

	wg.Wait()
}
Copy the code

There is much more to this method than any previous method:

I’m trying to create a specific number of Go routines, depending on the file size and buffer size (100 in this case).
We need a way to ensure that we are “waiting” for all execution routines. In this example, I’m using a Wait group.
At the end of each routine, signal from the inside instead ofbreak forCycle. Because we called it latewg.Done(), so it is called when each routine returns.

Note: Always check the number of bytes returned and reallocate the output buffer.

Reading files with Read() goes a long way, but sometimes you need more convenience. IO functions are commonly used in Ruby, such as each_line,each_char, each_codepoint, and so on. We can achieve a similar goal by using the Scanner type and the associated functions in the Bufio package.

The bufio.Scanner type implements a function with a “split” function and advances the pointer based on that function. For example, for each iteration, the built-in bufio.scanlines split function advances the pointer until the next newline. In each step, the type also exposes a method to get an array/string of bytes between the start and end positions.

package main

import (
	"fmt"
	"os"
	"bufio"
)

const BufferSize = 100

type chunk struct {
	bufsize int
	offset  int64
}

func main() {
	file, err := os.Open("filetoread.txt")
	iferr ! = nil { fmt.Println(err)return
	}
	defer file.Close()
	scanner := bufio.NewScanner(file)
	scanner.Split(bufio.ScanLines)

	// Returns a boolean based on whether there's a next instance of `\n` // character in the IO stream. This step also advances the internal pointer // to the next position (after '\n') if it did find that token. for { read := scanner.Scan() if ! read { break } fmt.Println("read byte array: ", scanner.Bytes()) fmt.Println("read string: ", scanner.Text()) } }Copy the code

So, to read the entire file line by line in this way, you can use something like this:

package main

import (
	"bufio"
	"fmt"
	"os"
)

func main() {
	file, err := os.Open("filetoread.txt")
	iferr ! = nil { fmt.Println(err)return
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	scanner.Split(bufio.ScanLines)

	// This is our buffer now
	var lines []string

	for scanner.Scan() {
		lines = append(lines, scanner.Text())
	}

	fmt.Println("read lines:")
	for _, line := range lines {
		fmt.Println(line)
	}
}
Copy the code

4. Scan word by word

The Bufio package contains basic predefined splitting capabilities:

ScanLines (default)
ScanWords
ScanRunes(useful for traversing UTF-8 code points (not bytes))
ScanBytes

So, to read the file and create a list of words in the file, you can use something like this:

package main

import (
	"bufio"
	"fmt"
	"os"
)

func main() {
	file, err := os.Open("filetoread.txt")
	iferr ! = nil { fmt.Println(err)return
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	scanner.Split(bufio.ScanWords)

	var words []string

	for scanner.Scan() {
		words = append(words, scanner.Text())
	}

	fmt.Println("word list:")
	for _, word := range words {
		fmt.Println(word)
	}
}
Copy the code

The ScanBytes split function will provide the same output as the earlier Read() example. The main difference between the two is the dynamic allocation problem in the scanner every time it needs to append to a byte/string array. This can be avoided by techniques such as pre-initializing buffers to a specific length, and increasing the size only if the previous limit is reached. Use the same example as above:

package main

import (
	"bufio"
	"fmt"
	"os"
)

func main() {
	file, err := os.Open("filetoread.txt")
	iferr ! = nil { fmt.Println(err)return
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	scanner.Split(bufio.ScanWords)

	// initial size of our wordlist
	bufferSize := 50
	words := make([]string, bufferSize)
	pos := 0

	for scanner.Scan() {
		iferr := scanner.Err(); err ! = nil { // This error is a non-EOF error. End the iterationif we encounter
			// an error
			fmt.Println(err)
			break
		}

		words[pos] = scanner.Text()
		pos++

		if pos >= len(words) {
			// expand the buffer by 100 again
			newbuf := make([]string, bufferSize)
			words = append(words, newbuf...)
		}
	}

	fmt.Println("word list:")
	// we are iterating only until the value of "pos"because our buffer size // might be more than the number of words because we increase the length by // a constant value.  Or the scanner loop might've terminated due to an // error prematurely. In this case the "pos" contains the index of the last // successful update. for _, word := range words[:pos] { fmt.Println(word) } }Copy the code

As a result, we end up doing much less slice “growth”, but may end up leaving some empty slots at the end, depending on the buffer size and the number of words in the file, which is a compromise.

5. Break up long strings into words

Bufio. NewScanner takes as a parameter a type that satisfies the IO.Reader interface, which means it will be used with any type that defines a Read method. One of the library’s string utility methods that returns a reader type is the strings.NewReader function. When reading words from a string, we can combine the two:

package main

import (
	"bufio"
	"fmt"
	"strings"
)

func main() {
	longstring := "This is a very long string. Not."
	var words []string
	scanner := bufio.NewScanner(strings.NewReader(longstring))
	scanner.Split(bufio.ScanWords)

	for scanner.Scan() {
		words = append(words, scanner.Text())
	}

	fmt.Println("word list:")
	for _, word := range words {
		fmt.Println(word)
	}
}
Copy the code

6. Scan for comma-separated strings

Manual parsing of CSV files/strings via the basic file.read () or Scanner types is complex. Because according to bufio.scanwords, a “word” is defined as a string of runes defined by a Unicode space. Reading individual runes and keeping track of buffer sizes and locations (as is done in lexical analysis, for example) is too much work and manipulation.

But it can be avoided. We can define a new split function that reads characters until the reader encounters a comma, and then returns the block when Text () or Bytes () is called. The bufio.splitfunc function has the following function signature:

type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)
Copy the code

For simplicity, I’ve shown an example of reading strings instead of files. A simple reader using the CSV string signed above could be:

package main

import (
	"bufio"
	"bytes"
	"fmt"
	"strings"
)

func main() {
	csvstring := "name, age, occupation"

	// An anonymous function declaration to avoid repeating main()
	ScanCSV := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
		commaidx := bytes.IndexByte(data, ', ')
		if commaidx > 0 {
			// we need to return the next position
			buffer := data[:commaidx]
			return commaidx + 1, bytes.TrimSpace(buffer), nil
		}

		// if we are at the end of the string, just return the entire buffer
		if atEOF {
			// but only do that when there is some data. If not, this might mean
			// that we've reached the end of our input CSV string if len(data) > 0 { return len(data), bytes.TrimSpace(data), nil } } // when 0, nil, nil is returned, this is a signal to the interface to read // more data in from the input reader. In this case, this input is our // string reader and this pretty much will never occur. return 0, nil, nil } scanner := bufio.NewScanner(strings.NewReader(csvstring)) scanner.Split(ScanCSV) for scanner.Scan() { fmt.Println(scanner.Text()) } }Copy the code

7.`ioutil`

We’ve seen multiple ways to read files. But what if you just want to read the file into the buffer?

Ioutil is a package in the standard library that contains some features that make it a single line.

Read the entire file

package main

import (
	"io/ioutil"
	"log"
	"fmt"
)

func main() {
	bytes, err := ioutil.ReadFile("filetoread.txt")
	iferr ! = nil { log.Fatal(err) } fmt.Println("Bytes read: ", len(bytes))
	fmt.Println("String read: ", string(bytes))
}
Copy the code

This is closer to what we see in high-level scripting languages.

Read the entire directory of files

Needless to say, do not run this script if you have large files

package main

import (
	"io/ioutil"
	"log"
	"fmt"
)

func main() {
	filelist, err := ioutil.ReadDir(".")
	iferr ! = nil { log.Fatal(err) }for _, fileinfo := range filelist {
		if fileinfo.Mode().IsRegular() {
			bytes, err := ioutil.ReadFile(fileinfo.Name())
			iferr ! = nil { log.Fatal(err) } fmt.Println("Bytes read: ", len(bytes))
			fmt.Println("String read: ", string(bytes))
		}
	}
}
Copy the code

reference

Go language read file overview

How does Golang read the contents of a file

1. Read the entire file to the memory

2. Read files in blocks

3. Read file blocks concurrently

4. Scan word by word

5. Break up long strings into words

6. Scan for comma-separated strings

7.ioutil

Read the entire file

Read the entire directory of files

reference

Related Posts

Five Git workflows to improve your development process

Product manager’s year-end summary: Keep improving yourself

Front-end fishing in the architect’s summary | the nuggets essay

7.`ioutil`