Original address: mukulrathi.co.uk/create-your…

Mukulrathi.co.uk /

Published: October 3, 2020 -8 minutes to read

Series. Create the BOLT compiler

  • Part 1: Why write your own programming language? Why write your own programming language?
  • Part 2: So how do you structure a compiler project?
  • Part 3: Write a Lexer and parser using OCamllex and Menhir.
  • Part 4: An easy-to-understand introduction to type theory and implementing type checkers.
  • Part 5: Tutorial on liveliness and alias data flow analysis.
  • Part 6: Desugaring– Simplifying our high-level language!
  • Part 7: Protobuf tutorials for OCaml and C++.
  • Part 8: A complete guide to LLVM for programming language creators.
  • Part 9: Implementing concurrency and our runtime
  • Part 10: Inheritance and method rewriting in Bolt.
  • Part 11: Generality — Adding polymorphism to Bolt.

The forthcoming


Now that we have decomposed our language into a simple “Bolt IR”, close to LLVM IR, we want to generate LLVM IR, but there is a problem, our Bolt IR is an OCaml value, but we are using the native C++ API. But there is a problem, our Bolt IR is an OCaml value, but the LLVM API we use is the native C++ API. We can’t import this value directly into the C++ compiler background, so we need to serialize it into a language-independent representation of the data first.

Hey, everybody! If you’re reading this tutorial from Google and don’t care about the compiler, don’t worry! This tutorial does not cover anything related to the compiler (this is just an illustrative example). If you care about OCaml, the first half will be your thing, if you are coming for C++, you can skip the OCaml part.

Protocol buffer mode

A protocol buffer (also known as a protobuf) encodes data as a series of binary messages based on a given pattern. This pattern (written in the.proto file) captures your data structure.

Each message contains several fields.

The field is in the form of modifier type name = someIndex.

Each field has a modifier: required/optional/repeat. Repeated modifiers correspond to lists/vectors of that field, such as repeated PhoneNumber representing a value of type List .

For example, the following pattern defines a Person in the address book. Everyone must have a name and ID, and can choose to have an email address. They may have multiple phone numbers (and therefore duplicate PhoneNumber), such as their home, their cell phone and work.

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phones = 4;
}
Copy the code

Notice that in the pattern of the Person message, we also number the fields (=1, =2, =3, =4). This allows us to add fields later without changing the existing field order (so you can continue parsing these fields without knowing about the new fields).

Every time we want to add a new “type” field, we define a schema (just like you define a class to add a new type of object). We can even define a schema in another schema. For example, in the schema for Person, we define the schema for the PhoneNumber field.

We can also define a schema that enumerates type PhoneType. This enumeration acts as a “label” for our phone number.

Now, just as you might want a field to be one of many options (specified by an enumeration type), you might want the message to contain exactly one of many fields.

For example, in our compiler, we might want to encode an identifier as one of two options. In this case, we don’t want to store data in index fields if we have already marked it as a variable.

type identifier =
    | Variable of string
    | ObjField of string * index (* name and field index *)
Copy the code

In our Protobuf, we can add fields for each of these options and use the keyword oneof to specify that only one field in the group is set at the same time.

Now, while this tells Protobuf that one of these values is set, we have no way of knowing which one is set. So we can introduce a label enumeration field and query for its value (Variable or ObjField) when deserializing the object.

message identifier { enum _tag { Variable_tag = 1; ObjField_tag = 2; } message _ObjField { required string objName = 1; required int64 fieldIndex = 2; } required _tag tag = 1; oneof value { string Variable = 2; _ObjField ObjField = 3; }}Copy the code

Note that since the string constructor Variable has only one parameter of type string, we can simply set that field to type String. We can also define a message _Variable with schema.

message identifier { enum _tag { Variable_tag = 1; ObjField_tag = 2; } message _ObjField { required string objName = 1; required int64 fieldIndex = 2; } required _tag tag = 1; oneof value { string Variable = 2; _ObjField ObjField = 3; }}Copy the code

Map OCaml type definitions to Protobuf schemas.

Now that we’ve looked at the identifier pattern, let’s flesh out the rest of the front end’s “Bolt IR “pattern.

If you notice, whenever we have an identifier-like variant type of OCaml, we have a straightforward formula to specify the corresponding schema. We are.

  • Adds a label field to specify the value’s constructor.
  • If a constructor has more than one argument, as in the case of ObjField, define a message schema. If the constructor has only one argument (Variable), we don’t need it.
  • Specify a oneof block containing the fields of each constructor.

We can write out all the patterns manually, but we have to rewrite them every time our language changes. However, we have a hidden weapon up our sleeve — a library that can help us with this task.

If you want to learn more about other message types, see the complete Guide to Protobuf Schema.

PPX derived Protobuf

OCaml allows libraries to extend their syntax using px hooks. The PPX library can use these hooks to preprocess files and extend language functionality. So we can mark our files with other information, such as which types need to be protobuf generated.

We can update our Dune build file and preprocess our IR_gen library with the protobuf library generated by PPX.

(library
 (name ir_gen)
 (public_name ir_gen)
 (libraries core fmt desugaring ast)
 (flags
  (:standard -w -39))
 (preprocess
  (pps ppx_deriving_protobuf bisect_ppx --conditional))
Copy the code

Note that the flags command suppresses the warning 39- “unused REC flag “- because code generated by the px_deriving_protobuf library causes many such warnings. I strongly encourage you to suppress these warnings as well.

In addition, Bisect_PPx is the tool we use to calculate test coverage. It has a — conditional flag, because if we don’t calculate test coverage, we don’t want to preprocess the file with coverage information.

It’s easy to tell PPX to derive a protobuf that we want to serialize a type definition — we just slap a [@@ Derived Protobuf] at the end of the type definition. For variant types, we must specify a [@key 1], [@key 2] for each variant.

For example, the identifier type definition in our.mli file.

type identifier =
  | Variable of string [@key 1]
  | ObjField of string * int [@key 2]
  [@@deriving protobuf]
Copy the code

We do the same thing in the.ml file, except we also specify the file to which the Protobuf pattern is to be written.

type identifier =
  | Variable of string [@key 1]
  | ObjField of string * int [@key 2]
[@@deriving protobuf {protoc= "../../frontend_ir.proto"}]
Copy the code

This path is relative to the SRC file frontend_ir.ml. Therefore, since the frontend_ir.ml file is in the repo SRC /frontend/ir_gen/ folder, the Protobuf mode file will be written to SRC /frontend_ir.proto. If you don’t specify a file path for the protoc argument, the.proto file will be written to the _build folder.

As an additional tip, px-derived protobuf libraries do not serialize lists of lists. Like you can’t have

type something = Foo of foo list list [@key 1] [@@deriving protobuf])
Copy the code

This is because if we have a message schema for Foo, it’s fairly straightforward to get a list of fields of type Foo — we use duplicate Foo in the field schema. But we can’t say that repeating foo over and over gives us a list of lists of type foo. So to solve this problem, you have to define another type here.

type list_of_foo = foo list [@@deriving protobuf]
(*note no @key since not a variant type *)

type something = Foo of list_of_foo list [@key 1] [@@deriving protobuf]
Copy the code

Finally, to serialize our Bolt IR using this pattern, PPX derived Protobuf provides us with the

_to_protobuf serialization function. We use this function to take a binary protobuf message and write it as a sequence of bytes to the output channel (out_chan).

let ir_gen_protobuf program out_chan = let protobuf_message = Protobuf.Encoder.encode_exn program_to_protobuf program in  output_bytes out_chan protobuf_messageCopy the code

In our overall Bolt compiler pipeline, if an.ir file is provided, I write the output as an.ir file, or as stdout.

. match compile_out_file with | Some file_name -> Out_channel.with_file file_name ~f:(fun file_oc -> ir_gen_protobuf program file_oc) | None -> ir_gen_protobuf program stdout )Copy the code

Automatically generated Protobuf Schema.

A detailed explanation of the mapping is provided in the README of the PpX-derived repository. We’ve already gone through an example of auto-generated patterns, so patterns shouldn’t be too unfamiliar. But I did make one small change. For (ObjField of String * int), I say the generated output is.

message _ObjField {
    required string objName = 1;
    required int64 fieldIndex = 2;
  }
Copy the code

Unfortunately, this is not entirely true. I added objName and fieldIndex for clarity, but the library doesn’t know what the semantics of string *int are, so it looks like this.

  message _ObjField {
    required string _0 = 1;
    required int64 _1 = 2;
  }
Copy the code

This is the disadvantage of auto-generated mode. Instead of objName and fieldIndex, you get generic field names _0, _1, 2, and so on. For type block_expr = expr list, you get the field name.

message block_expr {
  repeated expr _ = 1;
}
Copy the code

Decode Protobuf in C++

Ok, so far we have looked at the definition of a Protobuf pattern and how PPX Deriving Protobuf encodes messages and generates patterns. Now we need to decode it using C++. As with the encoding part, we don’t need to know the details of how a Protobuf represents a message in binary.

Generate a Protobuf deserialization file

The Protoc compiler automatically generates class and function definitions from the.proto file: this is exposed in the.pb.h header file. For frontend_ir.proto, the corresponding header file is frontend_ir.pb.h.

For Bolt, we use the build system Bazel to build the C++ compiler back end. Instead of running the Protoc compiler manually to get our.pb.h header file and then linking all the other generated files, we can take advantage of Bazel’s integration with Protoc and have Bazel run Protoc for us during build and link it in.

The Bazel build system works in many languages, such as Java, Dart, Python, and not just C++. So we specify two libraries — a language-independent proto_library and a specific C++ library (cc_proto) around that library.

load("@rules_cc//cc:defs.bzl", "cc_library")

# Convention:
# A cc_proto_library that wraps a proto_library named foo_proto
# should be called foo_cc_proto.
cc_proto_library(
    name = "frontend_ir_cc_proto",
    deps = [":frontend_ir_proto"],
    visibility = ["//src/llvm-backend/deserialise_ir:__pkg__"],

)

# Conventions:
# 1. One proto_library rule per .proto file.
# 2. A file named foo.proto will be in a rule named foo_proto.
proto_library(
    name = "frontend_ir_proto",
    srcs = ["frontend_ir.proto"],
)
Copy the code

We will cc_proto_library (…). Bazel will compile the Protobuf file for us as a build dependency file included in any file that needs to use the.bb.h file. In our case, this is our Deserialise_ir library.

Compilation process, our library is deserialise_ir library.

load("@rules_cc//cc:defs.bzl", "cc_library")

cc_library(
    name = "deserialise_ir",
    srcs =  glob(["*.cc"]),
    hdrs = glob(["*.h"]),
    visibility = ["//src/llvm-backend:__pkg__", "//src/llvm-backend/llvm_ir_codegen:__pkg__", "//tests/llvm-backend/deserialise_ir:__pkg__"],
    deps = ["//src:frontend_ir_cc_proto", "@llvm"]
)
Copy the code

Deserialization Protobuf serializes files

The frontend_ir.pb.h file defines a namespace frontend_ir, where each message is mapped to a class. Thus, for our Bolt program, represented by the program message in the frontend_ir.proto file, the corresponding Protobuf class is frontend_IR ::program.

To deserialize a message of a given type, we create an object of the corresponding class. We then call the ParseFromIstream method, which deserializes the message from the input stream and stores it in an object. This method returns a Boolean value indicating success/failure. We have defined our custom DeserialiseProtobufException to handle failure.

Frontend_ir::program deserialiseProtobufFile(std::string &filePath) { Frontend_ir::program program; std::fstream fileIn(filePath, std::ios::in | std::ios::binary); . if (! program.ParseFromIstream(&fileIn)) { throw DeserialiseProtobufException("Protobuf not deserialised from file."); } return program; }Copy the code

So we’re done!

Well, one thing to be aware of… This auto-generated class is a stupid data placeholder. We need to read the data from the program object.

Read Protobuf data from protobuf classes.

If you try to read the protoc generated file frontend_ir.pb.h, you will realize that it is a Garguantuan freak and not as good looking as the standard example in the official tutorial. So instead of trying to read methods from a file, explain the structure of the header file.

Tip: Make sure you have code completion set up in the IDE – this means you don’t need to manually search for the correct method in frontend_ir.pb.h.

We will use the schema defined below to deserialize the message.

package Frontend_ir; message expr { enum _tag { Integer_tag = 1; Boolean_tag = 2; Identifier_tag = 3; . IfElse_tag = 11; WhileLoop_tag = 12; Block_tag = 15; } message _Identifier { required identifier _0 = 1; optional lock_type _1 = 2; }... message _IfElse { required expr _0 = 1; required block_expr _1 = 2; required block_expr _2 = 3; } message _WhileLoop { required expr _0 = 1; required block_expr _1 = 2; }... required _tag tag = 1; oneof value { int64 Integer = 2; bool Boolean = 3; _Identifier Identifier = 4; . _IfElse IfElse = 12; _WhileLoop WhileLoop = 13; block_expr Block = 16; . } } message block_expr { repeated expr _ = 1; }Copy the code

Protobuf message classes and enumerations

Each Protobuf message definition maps to a class definition. Protobuf enumerations map to C++ enumerations. To give each class/enumeration a globally unique name, they are prefixed by the package name (Frontend_ir) and any classes they are nested with.

So message expr corresponds to the class Frontend_ir_expr, and the enum _tag nested in Message expr maps to Frontend_ir_expr__tag.

In REPO, bin_op messages also have a nested _tag enumeration that maps to Frontend_ir_bin_op__tag(note that this naming spacing means that it does not conflict with the _tag definition in expR messages).

This is true for any level of nesting, such as the nested message _IfElse in expR messages that maps to the class Frontend_ir_expr__IfElse.

A specific enumeration value for enumer_tag, such as Integer_tag can be called Frontend_ir_expr__tag_Integer_tag.

Protobuf does not write class \enums by concatenating package names with class names. Protobuf also has an alias for nested class types, so you can write these nested message classes as Frontend_ir::expr::_IfElse.

Access specific Protobuf fields

Each required field field_name of a Protobuf has a corresponding access method field_name()(where the field name is converted to lowercase). So the field tag in the EXPR message maps to the method tag() in the Frontend_ir::expr class, the field Integer maps to the method Integer (), IfElse maps to IfElse (), and so on.

NB: To reiterate, do not confuse field names, such as the type of field _IfElse in IfElse and Protobuf messages (note the underline before). This is only because OCaml PPX-derived Protobuf libraries give them these names — if we had written the Proto pattern manually, we could have chosen names that were less confusing.

For optional fields, we can access the field in the same way, but we also have a has_field_name() Boolean function to check if the field exists.

For duplicate fields, we have a field_name_size() function to query the number of items, and we can access item I using the field_name(I) method. So for the field name _1, the corresponding methods are _1_size() and _1(I). For the field name _ in the auto-generated schema, the corresponding methods are __size() and _(I).

Purify our front IR

Let’s do this by mapping our protobuf class directly to the C++ class defined by our OCaml type. Each OCaml expR variant maps to an abstract subclass of the ExprIR class.

std::unique_ptr<ExprIR> deserialiseExpr(const Frontend_ir::expr &expr){ switch (expr.tag()) { case Frontend_ir::expr__tag_IfElse_tag: return std::unique_ptr<ExprIR>(new ExprIfElseIR(expr.ifelse())); case Frontend_ir::expr__tag_WhileLoop_tag: return std::unique_ptr<ExprIR>(new ExprWhileLoopIR(expr.whileloop())); . }}Copy the code

I use STD ::unique_ptr all behind the compiler to avoid explicit memory management. You can also use standard Pointers — it’s just a matter of personal preference.

Remember how we had a specific label field to distinguish between the expR type variants. We have a switch statement on the value of the label field. This tag tells us which field in the oneof{} field is set. We then access the fields set accordingly, such as the IfElse_tag case of expr.ifelse().

std::unique_ptr<ExprIR> deserialiseExpr(const Frontend_ir::expr &expr){ switch (expr.tag()) { case Frontend_ir::expr__tag_IfElse_tag: return std::unique_ptr<ExprIR>(new ExprIfElseIR(expr.ifelse())); case Frontend_ir::expr__tag_WhileLoop_tag: return std::unique_ptr<ExprIR>(new ExprWhileLoopIR(expr.whileloop())); . }}Copy the code

View message modes for _IfElse and block_expr messages.

  message _IfElse {
    required expr _0 = 1;
    required block_expr _1 = 2;
    required block_expr _2 = 3;
  }

  message block_expr {
  repeated expr _ = 1;
}
Copy the code

Our constructor reads each field of the _IfElse message in turn, and then we can use our deserialiseExpr to directly cross-string the _0 field. Then we can use our deserialiseExpr to deserialize the _0 field directly. However, because the message block _expr has a repeating field _, we must iterate over the list of EXPR messages in that field in the for loop.

ExprIfElseIR::ExprIfElseIR(const Frontend_ir::expr::_IfElse &expr) { Frontend_ir::expr condMsg = expr._0(); Frontend_ir::block_expr thenBlockMsg = expr._1(); Frontend_ir::block_expr elseBlockMsg = expr._2(); condExpr = deserialiseExpr(condMsg); for (int i = 0; i < thenBlockMsg.__size(); i++) { thenExpr.push_back(deserialiseExpr(thenBlockMsg._(i))); } for (int i = 0; i < elseBlockMsg.__size(); i++) { elseExpr.push_back(deserialiseExpr(elseBlockMsg._(i))); }}Copy the code

Other Bolt pattern deserialization code follows the same pattern.

conclusion

In this article, we have transitioned from our OCaml foreground to our foreground.


Translation via www.DeepL.com/Translator (free version)