Two days ago with GO write a website crawler practice, but climb down the content is garbled, a look at the original site is GBK code, and GO in the default code is UTF-8, so will lead to non-UTF-8 content is garbled.

So I went to look for GO’s transcoding library, and there are mainly three libraries, Mahonia, ICONV-GO and the official golang.org/x/text, which are used more.

I have used all three libraries and found that I am not very satisfied with them. Let’s take a look at GBK to UTF-8 for these three libraries.

  • mahonia
package main

import "fmt"
import "github.com/axgle/mahonia"

func main(a) {
	testBytes := []byte{0xC4.0xE3.0xBA.0xC3.0xA3.0xAC.0xCA.0xC0.0xBD.0xE7.0xA3.0xA1}
	testStr := string(testBytes)
	enc := mahonia.NewDecoder("gbk")
	res := enc.ConvertString(testStr)
	fmt.Println(res) // Hello, world!
}
Copy the code
  • iconv-go
package main

import (
	"fmt"

	"github.com/axgle/mahonia"
	iconv "github.com/djimenez/iconv-go"
)

func main(a) {
	testBytes := []byte{0xC4.0xE3.0xBA.0xC3.0xA3.0xAC.0xCA.0xC0.0xBD.0xE7.0xA3.0xA1}
	var res []byte
	iconv.Convert(testBytes, res, "GBK"."UTF-8")
	fmt.Printf(string(res)) // Hello, world!
}
Copy the code
  • golang.org/x/text
package main

import (
	"bytes"
	"fmt"
	"io/ioutil"

	"golang.org/x/text/encoding/simplifiedchinese"
	"golang.org/x/text/transform"
)

func main(a) {
	testBytes := []byte{0xC4.0xE3.0xBA.0xC3.0xA3.0xAC.0xCA.0xC0.0xBD.0xE7.0xA3.0xA1}
	decoder := simplifiedchinese.GBK.NewDecoder()
	reader := transform.NewReader(bytes.NewReader(testBytes), decoder)
	res, _ := ioutil.ReadAll(reader)
	fmt.Printf(string(res)) // Hello, world!
}

Copy the code

The above is the basic use of these three libraries, and you can see that there are some problems with all three libraries:

  • Mahonia’s API is the simplest, but only I/OstringType, which is what we do a lot of the time with data in GO[]byteorio.ReaderType, this is a little bit more limited.
  • Iconv-go can readstring,[]byteio.ReaderType of data, but the underlying layer is the encapsulation of C iconv library, in a variety of environments will cause problems, compile error is not easy to locate, I have several times before failed to install successfully :(.
  • golang.org/x/text This is the official library, but the API is too cumbersome, pass.

transcode

I thought about it, and I thought chaining calls would be a good solution, so I built a wheel called Transcode. Here’s how it works:

  • GBK utf-8
package main

import (
	"fmt"

	"github.com/piex/transcode"
)

func main(a) {
	testBytes := []byte{0xC4.0xE3.0xBA.0xC3.0xA3.0xAC.0xCA.0xC0.0xBD.0xE7.0xA3.0xA1}
	res := transcode.FromByteArray(testBytes).Decode("GBK").ToString()
	fmt.Printf(res) // Hello, world
}
Copy the code
  • Utf-8 GBK
package main

import (
	"bytes"
	"fmt"

	"github.com/piex/transcode"
)

func main(a) {
	testBytes := []byte{0xC4.0xE3.0xBA.0xC3.0xA3.0xAC.0xCA.0xC0.0xBD.0xE7.0xA3.0xA1}
	testStr := "Hello, world!
	res := transcode.FromString(testStr).Encode("GBK").ToByteArray()
	fmt.Println(bytes.Equal(res, testBytes)) // true
}
Copy the code

The underlying library is a wrapper around the golang.org/x/text transcoding API. The knowledge API is too difficult to use, so it is a wrapper around the library. The library currently supports string and []byte input and output data types.

Here we use chained calls, mainly to return the structure at the end. Then we do something else with the ToString and ToByteArray methods.

Warehouse address: github.com/piex/transc… You can take a look at the source code, is still very simple, later will also add support for IO.Reader type, interested can PR.