directory

  1. Implementing a programming language with javascript – Preface
  2. Implementing a programming language in javascript – Language vision
  3. Implement a programming language in javascript – write a parser
  4. Use javascript to implement a programming language – character input stream
  5. Use javascript to implement a programming language – lexical analysis

The token Input Stream

Lexical analysis operates on a character input stream, but peek() or next() returns a special object, token. A token contains two attributes: type and value. Here are some examples:

{ type: "punc", value: "("} // Punctuation: parens, commas, and semicolon etc. {type: "num", value: 5} // number {type: "str", value: "Hello World!"} // string {type: "kw", value: "lambda"} // Keywords (keywords) {type: "var", value: "a"} // name () {type: "op", value: ! "" ="} // OperatorsCopy the code

Whitespace and comments are skipped directly, with no token returned.

To complete a lexical analyzer, we need to understand the syntax in great detail. We need to process the current character returned by peek() to return the token, with the following considerations:

  • Skip Spaces
  • If the end is reached, null is returned
  • If you see #, skip the comment, which is everything after the line
  • If it is in quotes, read in the string
  • If it is a number, read in the number
  • If it is a word, process it by keyword or variable
  • If it is punctuation, return the token of punctuation
  • If it is an operator, return the operator’s token
  • If it does not match any of the above, an error is outputinput.croak()

Here is the core code for lexical analysis – read next:

function read_next() {
    read_while(is_whitespace);
    if (input.eof()) return null;
    var ch = input.peek();
    if (ch == "#") {
        skip_comment();
        return read_next();
    }
    if (ch == '"') return read_string();
    if (is_digit(ch)) return read_number();
    if (is_id_start(ch)) return read_ident();
    if (is_punc(ch)) return {
        type  : "punc",
        value : input.next()
    };
    if (is_op_char(ch)) return {
        type  : "op",
        value : read_while(is_op_char)
    };
    input.croak("Can't handle character: " + ch);
}
Copy the code

This is a distribution function that decides when to call next() to get the next token. There are many utility functions used, such as read_string(), read_number(), and so on. We don’t have to write these functions here to add complexity.

Another thing to note is that we don’t get all the input streams all at once, the parser only reads the next token at a time, which makes it easier for us to locate errors (sometimes the parser doesn’t have to continue parsing because of syntax errors).

The read_ident() function reads as many characters as possible as variable names. The variable name must start with a letter,λ or _, and can contain letters, numbers, or? ! -<>=. So foo-bar is not read as three tokens, but as a variable. The reason for defining this rule is so that I can define a variable like IS-pair.

Of course, the read_ident() function also checks to see if the name read in is a keyword. Kw token is returned if it is a keyword, var token is returned otherwise.

Here is all the code for TokenStream:

function TokenStream(input) {
    var current = null;
    var keywords = "If then else lambda lambda true false";
    return {
        next  : next,
        peek  : peek,
        eof   : eof,
        croak : input.croak
    };
    function is_keyword(x) {
        return keywords.indexOf("" + x + "") > = 0; }function is_digit(ch) {
        return /[0-9]/i.test(ch);
    }
    function is_id_start(ch) {
        return/ [a-z lambda _] / i.t est (ch); }function is_id(ch) {
        return is_id_start(ch) || "? ! - < > = 0123456789".indexOf(ch) >= 0;
    }
    function is_op_char(ch) {
        return "+ - * / % = & | < >!".indexOf(ch) >= 0;
    }
    function is_punc(ch) {
        return ",; () {} []".indexOf(ch) >= 0;
    }
    function is_whitespace(ch) {
        return " \t\n".indexOf(ch) >= 0;
    }
    function read_while(predicate) {
        var str = "";
        while(! input.eof() && predicate(input.peek())) str += input.next();return str;
    }
    function read_number() {
        var has_dot = false;
        var number = read_while(function(ch){
            if (ch == ".") {
                if (has_dot) return false;
                has_dot = true;
                return true;
            }
            return is_digit(ch);
        });
        return { type: "num", value: parseFloat(number) };
    }
    function read_ident() {
        var id = read_while(is_id);
        return {
            type  : is_keyword(id) ? "kw" : "var",
            value : id
        };
    }
    function read_escaped(end) {
        var escaped = false, str = "";
        input.next();
        while(! input.eof()) { var ch = input.next();if (escaped) {
                str += ch;
                escaped = false;
            } else if (ch == "\ \") {
                escaped = true;
            } else if (ch == end) {
                break;
            } else{ str += ch; }}return str;
    }
    function read_string() {
        return { type: "str", value: read_escaped('"')}; }function skip_comment() {
        read_while(function(ch){ returnch ! ="\n" });
        input.next();
    }
    function read_next() {
        read_while(is_whitespace);
        if (input.eof()) return null;
        var ch = input.peek();
        if (ch == "#") {
            skip_comment();
            return read_next();
        }
        if (ch == '"') return read_string();
        if (is_digit(ch)) return read_number();
        if (is_id_start(ch)) return read_ident();
        if (is_punc(ch)) return {
            type  : "punc",
            value : input.next()
        };
        if (is_op_char(ch)) return {
            type  : "op",
            value : read_while(is_op_char)
        };
        input.croak("Can't handle character: " + ch);
    }
    function peek() {
        return current || (current = read_next());
    }
    function next() {
        var tok = current;
        current = null;
        return tok || read_next();
    }
    function eof() {
        returnpeek() == null; }}Copy the code
  • next()It’s not called every timeread_next()Because it might have been called in advanceread_next()So, if current exists, just return current.
  • We only support decimal numbers, we do not support scientific notation, we do not support hexadecimal, hexadecimal. If we need these, they’re right hereread_number()Just add a handler to the function
  • Unlike javascript, you can’t include quotes and backslashes in a string, but we won’t affect commonly used escape characters\n \t.

The AST is covered in the next section.

The original link: lisperator.net/pltut/parse…