An overview of the

In the last blog we talked about how to convert numeric data (such as Short, Int, and Long) to binary data in JavaScript. If you’re interested, you can read the second blog in this series, WebSocket series on how to convert numeric data to binary data in JavaScript. This time, let’s talk about how to handle string data. This article is the third part of WebSocket series, mainly introduces the conversion method between string data and binary data, the specific content is as follows:

  • The basics of string in JavaScript
  • How does JavaScript convert string to binary data
  • This article is not strongly related to WebSocket, but is included in this series as a foundation for passing binary data in WebSocket. If you are not familiar with WebSocket, or don’t understand its usage scenarios and details, you can read my first blog post in this series, The Basics of WebSocket.

String basics

The string type, for those familiar with JavaScript should be familiar, it is one of the basic data types in JavaScript. But first, we’re going to talk about DOMString.

DOMString is a UTF-16 string. Because JavaScript already uses such strings, DOMString maps directly to a String. Passing NULL to a method or parameter that accepts DOMString usually converts it to “NULL.”

DOMString is also used when transferring string data in WebSocket. However, based on the description of DOMString in MDN, we can think of DOMString as a string type in most everyday usage scenarios. So, we just need to understand this type and still use it as a string.

coding

Since UTF-16 is mentioned above, let’s briefly introduce UTF-16 and utF-8, two encoding methods commonly used on the back end. Why is it necessary to introduce encoding types? Because we may use different encoding methods when passing string data to the back end, this will result in different data for both sides. Therefore, when we communicate the binary data of string, we need not only to convert the string to binary, but also to agree on the string encoding.

UTF-16

Utf-16 (16-bit Unicode Transformation Format) is the third layer of the five-level model for Unicode character encoding: An implementation of the Character Encoding Form (also known as “storage format”). A sequence that maps abstract code points from the Unicode character set to 16-bit integers (code elements) for data storage or transfer. Unicode characters require one or two 16-bit symbols to represent them, so this is a variable-length representation.

Utf-16 is the string encoding type in JavaScript.

UTF-8

Utf-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode and a prefix code. It can be used to represent any character in the Unicode standard, and the first byte of its encoding remains ASCII compatible, allowing original ASCII character processing software to continue with little or no modification. Utf-8 uses one to four bytes for each character encoding (respecified in November 2003).

Utf-8 is a common encoding type used by many languages and is very common in back-end applications.

How to encode UTF16 and UTF-8 in JavaScript

Github-dcodeio/UTFX: A compact library to encode, decode and convert UTF8 / UTF16 in JavaScript. With this library, you can convert strings in UTF-8 and UTF-16 encoding. The principles and contents of the library and the two coding methods will be explained in detail in a future blog post. Let’s briefly introduce how it is used:

import utfx from './util/utfx';

let str = 'abcdefg';
let result = [];

function stringSource(s) {
    let i = 0;
    return function () {
        return i < s.length ? s.charCodeAt(i++) : null;
    };
}

utfx.encodeUTF16toUTF8(stringSource(str), function (b) {
    result.push(b);
}.bind(this));
Copy the code

Similarly, the class library provides other methods:

  • decodeUTF8toUTF16To convert utF-8 string data to UTF-16 string data.
  • calculateUTF16asUTF8, calculates the Byte length of the string encoded in UTF-16 after it is converted to UTF-8. We will also use these two methods in later chapters.

How does JavaScript convert string to binary data

Now that you know how to encode strings in JavaScript and how to convert them between UTF-8 and UTF-16, let’s look at how to convert strings to binary data. First, we assume that the encoding used to interact with the back end is UTF-8, which can be used for more scenarios. If you are still using UTF-16, simply ignore the logic of the conversion encoding. A brief introduction to the implementation idea: After we get a string to be converted, we first know its length, initialize the relevant parameters in the ArrayBuffer, and put the data into the ArrayBuffer. Let’s change the above example slightly:

import utfx from './util/utfx';

function stringSource(s) {
    var i = 0;
    return function () {
        return i < s.length ? s.charCodeAt(i++) : null;
    };
}

let str = 'abcdefg';

let strCodes = stringSource(str);
let length = utfx.calculateUTF16asUTF8(strCodes)[1];
let buffer = new ArrayBuffer(length + 4); // Initialize the binary buffer with UTF8 encoded string length +4 bytes
let view = new DataView(buffer);
let offset = 4;

view.setUint32(0, length); // Place the length at the head of the string

utfx.encodeUTF16toUTF8(stringSource(str), function (b) {
    view.setUint8(offset++, b);
}.bind(this));
Copy the code

In the example above, we have encoded a binary into the ArrayBuffer according to UTF-8 and stored its length as an Unsigned Int four bytes in the header of the binary.

How does JavaScript convert binary data to string

Now that we know how to convert string to binary, let’s look at how to read the entire data from binary back to string. Based on the conversion to binary process above, it is not difficult to think of the related binary to string method. Specific examples are as follows:

import utfx from './util/utfx';
let str = 'abcdefg';

function stringSource(s) {
    var i = 0;
    return function () {
        return i < s.length ? s.charCodeAt(i++) : null;
    };
}

let strCodes = stringSource(str);
let length = utfx.calculateUTF16asUTF8(strCodes)[1];
let buffer = new ArrayBuffer(length + 4); // Initialize the binary buffer with UTF8 encoded string length +4 bytes
let view = new DataView(buffer);
let offset = 4;

// String conversion binary process
view.setUint32(0, length); // Place the length in the string

utfx.encodeUTF16toUTF8(stringSource(str), function (b) {
    view.setUint8(offset++, b);
}.bind(this));

// Binary conversion string process
let Strlength = view.getUint32(0);
offset = 4;
let result = []; // Unicode encoding character
let end = offset + Strlength;


utfx.decodeUTF8toUTF16(function () {
    return offset < end ? view.getUint8(offset++) : null; Exit this function when null is returned
}.bind(this), (char) => {
    console.log(char)
    result.push(char);
});

let strResult = result.reduce((prev, next) = >{
    return prev + String.fromCharCode(next);
}, ' ');
Copy the code

As we can see from the example above, we only need to read the length of the string in the first four bytes, and then read the string encoding of the specified length from the fourth Byte (counting from 0). Finally, we have an array of Unicode codes that can be converted to a string using only the fromCharCode method.

conclusion

Using ArrayBuffer and DataView, we can convert string data and binary data to and from each other. With a foundation for string conversions, the reader will be able to understand the content and processing logic for subsequent WebSocket binary data transfers. The next WebSocket blog post will cover how to pass binary data to the back end through WebSocket and how to handle binary data received through WebSocket. Interested students can continue to pay attention.