The original address: blog.yossarian.net/2021/08/10/…

The original is by blog.yossarian.net

Release date: August 10, 2021

preface

Before that. Inside LLVM, part one.

In my last article, I gave a high-level overview of LLVM’s bitcode format (and the underlying bitstream container notation). The end result of that article was the publication of LLVM-BitCursor, which provides a key first abstraction for a pure Rust bitcode parser.

This article will be a more concrete process of parsing the actual LLVM bitstream, which is triggered by another release announcement: LLVM-BitStream.

Putting llVM-BitCursor and LLVM-Bitstream together puts us two-thirds of the way toward implementing a pure Rust parser for LLVM IR. The main remaining component is a “mapper” from the block and record representation in the bitstream to the actual IR level representation (corresponding to LLVM ::Module, LLVM ::Function, &c in the official C++ API).

Review: Bitstream format

A quick review. LLVM’s bitstream format has two types of entities: blocks and records. A block provides the basic amount of state needed to parse its constituent records (and subblocks); A record is a self-contained sequence of fields. At the bitstream level, the field of the record is just the unsigned integer 1.

Each block is identified by a globally unique “block ID” (that is, a global definition of the bitstream being parsed), and each record is identified by a “record code” unique only to the type of block in which the record is located 2.

Neither blocks nor records are byte aligned (and therefore “bits” in a bitstream), nor are they guaranteed to have a consistent size in a given bitstream: emitters are allowed to change their use of variable bit integers for compression purposes, and consumers are expected to handle this.

Every entity in the bitstream (records and blocks!) Both have an abbreviation ID, which tells the parser how to parse the entity following the ID. The abbreviation ID belongs to one of two groups (reserved or defined) and can be modeled with the following nice Rust enumeration.

/// Abbreviation IDs that are reserved by LLVM.
#[repr(u64)]
pub enum ReservedAbbrevId {
    /// Identifies an `END_BLOCK` record.
    EndBlock = 0./// Identifies an `ENTER_SUBBLOCK` record.
    EnterSubBlock,
    /// Identifies a `DEFINE_ABBREV` record.
    DefineAbbrev,
    /// Identifies an `UNABBREV_RECORD` record.
    UnabbrevRecord,
}

/// An abbreviation ID, whether reserved or defined by the stream itself.
pub enum AbbrevId {
    /// A reserved abbreviation ID.
    Reserved(ReservedAbbrevId),
    /// An abbreviation ID that's been defined within the stream.
    Defined(u64),}Copy the code

Reserved abbreviation ids can be considered “bootup” tags — they provide the minimum amount of structure required for scope demarcations (ENTER_SUBBLOCK and END_BLOCK) and record definitions (DEFINE_ABBREV and UNABBREV_RECORD). DEFINE_ABBREV in particular is a flash point for the “lead” analogy: it signals an “abbreviated definition” record, resulting in a specific range of defined abbreviation ID. More on record formatting and scope rules later.

Parsing bit stream

To successfully and correctly parse the LLVM bitstream, we need to deal with three discrete aspects of the container format.

  1. The format in which a single record can appear.
  2. Scope rules that determine how individual records are parsed.
  3. State machines and propulsion mechanisms for parsers.

Let’s do it one by one.

Record format

As mentioned above: All records begin with an abbreviation ID. More specifically, all records start with the “UNABBREV_RECORD” (that is, ID 3) or a defined abbreviation ID that points to the DEFINE_ABBREV record applicable to the scope of the record.

The UNABBREV_RECORD is referred to as a non-abbreviated record, whereas all other records are abbreviated records (because their format is defined by abbreviation definitions elsewhere in the stream, as described above). Non-abbreviated records are easier, so we’ll start with them.

Non-abbreviated record

The UNABBREV_RECORD format is an inefficient generic format for situations where custom record definitions are too complex or do not yield meaningful size improvements (such as within BLOCKINFO blocks). A record in the UNABBREV_RECORD format looks like this.

[code:VBR6, numops:VBR6, op0:VBR6, op1:VBR6, ...]
Copy the code

The suffixes here and below: :VBR

and :

denote variable-width and fixed-width integers, respectively. See the LLVM documentation and the previous article for more details on VBR.)

To make the connection.

  • We read the recorded code by reading a 6-bit VBR.
  • We read the number of “operands” (that is, fields) in this record as another 6-bit VBR.
  • For each operand in numOPS, we read another 6-bit VBR; This is the actual field value for that particular operand.

For example, consider the following string representation of the unabombreV_record.

[0b000100, 0b000011, 0b010000, 0b011111, 0b000001, 0b100000]
Copy the code

This defines a new record with code 4 (0B000100) with three fields.

  • Field 0:16 (0B010000)
  • Field 1:31 (0B011111)
  • Field 2:32 ([0b000001, 0b100000]).

Note that field 2 is 12 bits, not 6 bits, because the value of 32 is not represented in a VBR6 block!

DEFINE_ABBREV and brief notes

Now that we know how to parse unabbreviated records, let’s get to the heart of the bitstream: abbreviation definitions and records that use abbreviations.

As mentioned in the previous article, LLVM’s bitstream container is essentially a self-describing format: Emitters are strongly encouraged to use DEFINE_ABBREV to produce compact bitstreams, rather than relying on the more generic and less compact UNABBREV_RECORD encoding.

To parse an abbreviation record, we need to get its corresponding abbreviation definition, i.e. DEFINE_ABBREV in the appropriate scope, which defines the structure we are going to parse. The actual scope rules for obtaining this DEFINE_ABBREV are then described below, so we’ll just assume them in this section.

So once we actually have DEFINE_ABBREV, it looks like this.

[numabbrevops:VBR5, abbrevop0, abbrevop1, ...]
Copy the code

It looks a lot like the “UNABBREV_RECORD” structure above! Some salient differences.

  • DEFINE_ABBREV does not specify a separate field for the logging code. Instead, the first abbreviated operand specifies the recording code.

  • Abbreviated operand are not concrete fields: they are symbolic operand that tell the bitstream parser how to parse the actual field of a concrete record using the DEFINE_ABBREV structure.

An appropriate analogy is to compare the definition of a structure with any particular instance of the declarative structure: in the case of the UNABBREV_RECORD format, it is a self-describing format, and any abbreviated record needs to refer to its DEFINE_ABBREV for proper parsing.

Parsing the abbreviated operand is done in the same numabbrevOPS loop as it was with the UNABBREV_RECORD. Each operand has the following bit form.

[encoded:1, ...]
Copy the code

. Where encoded is a 1-bit field indicating whether the operand is “encoded.” When encoded is 1, the operand is a “literal” operand of the following form.

[litvalue:VBR8]
Copy the code

The “literal” operand is an exception to the “concrete and symbolic” conceptual rule of the previous 3: when DEFINE_ABBREV has a literal operand, the record field for that operand has the same value in all records that use the same abbreviation. The obvious application for this is record code, because we want most records that share an abbreviation to have the same code. However, LLVM does not actually mandate that only the first operand can be literal (or must be literal), so there is nothing to prevent bitstream emitters from using literals for other record fields.

In contrast, when the code is zero, the parser is elaborated in one of the following forms.

[encoding:3]
[encoding:3, value:VBR5]
Copy the code

The form of exposition is determined by the value of the code, mainly.

  • Fixed (1) and VBR (2). [encoding: 3, the value: VBR5].
    • For these encodings, value represents the “extra data” needed to parse: for Fixed, value represents the number of bits needed to read a fixed-width value, and for VBR, the number of bits per VBR block to read.
  • Array (3), Char6 (4), Blob (5): [Encoding :3]
    • Char6 represents a 6-bit value that maps the [A-Za-Z0-9-_] space to the number 0… 63. LLVM presumably uses this as a concise encoding for identifiers when they can be safely normalized to A Latin alphanumeric string.
    • Blob represents an 8-byte Blob value of the form[count:VBR6, <align32bits>, b0:8, b1:8, ..., <align32bits>]. Each DEFINE_ABBREV can only have one Blob operand and it must be the last operand in the list of operands.
    • Array is a complicated case. Like Blob, Array can only appear once per DEFINE_ABBREV. Unlike a Blob, an Array must be the penultimate in the list of operands. To complete the parsing, the Array steals the last operand in the list and uses it as its element type. Not all operands are valid elements of the Array; In particular, only Fixed, VBR and Char6 are effective. This (fortunately) means that nested arrays and blobs arrays cannot be naively represented in bitstream format4.

Note that the encoding forms 0, 6, and 7 are not defined. New encodings may be added to LLVM in future releases, but current parsers may fail when encountering these encodings 5.

When fully parsed, each DEFINE_ABBREV tells us exactly how to parse any abbreviation records that use it. For example, consider the following symbolized DEFINE_ABBREV.

[DEFINE_ABBREV, 0b00010, 0b1, 0b00001000, 0b0, 0b010, 0b01110]
Copy the code

Cracked.

  • We have two abbreviated operands (0b00010).
    • Our first operand is a literal operand (0b1) with a value of 8 (0b00001000).
      • Because it’s the first operand, we’ll think of it as the recording code. Any record defined using this abbreviation will have an 86The code.
    • Our second operand is an encoded operand (0B0) of the form VBR (2, 0b010
      • The VBR form always requires an extra value, in this case 14 (0b01110), so our VBR is a VBR14. Any specific fields that we resolve for this operand will be resolved to VBR14.

To understand how this DEFINE_ABBREV relates to an actual abbreviation record, let’s assume that its abbreviation ID is 16 and the current abbreviation ID width is 6, so the following bit string.

[0b010000, 0b00000011111111]
Copy the code

Resolves to.

  • Read the abbreviation ID: 0b010000 mapped to our DEFINE_ABBREV.
    • Start parsing the abbreviation.
  • The first operand is Literal(8).
    • The second operand is VBR(14); It resolves to 0b000000111111(0xFF).

The result of parsing is Record(code: 8, fields: [0x8, 0xFF]). . In the case of trivial, it is equivalent to the fact that each record has only 20 bits, whereas in the “UNABBREV_RECORD” table there are more than 24 bits.

DEFINE_ABBREV and abbreviation records are complicated! But most of the complexity is conceptual, not implementation: under the hood, they use exactly the same primibles (VBRs and fixed-width fields) as the rest of the bitstream. The only complication is in switching.

Scope rules and BLOCKINFO

In talking about abbreviation definitions, I overlooked an important detail: actually mapping the defined abbreviation ID to its corresponding abbreviation definition. This is where the bitstream scope rules come into play.

In short, the abbreviation ID that determines which DEFINE_ABBREV an abbreviation record is mapped to is calculated as follows.

  • At the beginning of a new block, we start with the abbreviation ID 4, which is the abbreviation ID defined by the first application.
  • If the new block has any DEFINE_ABBREV registered via BLOCKINFO, we’ll start with those. For example, if a new block with ID 17 has three define_abbrevs in BLOCKINFO, these define_abbrevs are abbreviated with ids 4, 5, and 6, respectively.
  • Finally, we add any DEFINE_ABBREV that appears in the block itself. These are only registered after any BLOCKINFO definition.

The only other scope rule is scope closure: the abbreviation (and abbreviation ID) defined for a block only applies to that block — neither its parent nor its child block can use the same abbreviation definition list. In other words: each block must calculate its own set of valid abbreviation definitions (along with their ids).

Oh, but wait: there’s one slightly ugly exception. BLOCKINFO itself. When in BLOCKINFO, DEFINE_ABBREV does not apply to the BLOCKINFO block itself; Instead, it applies to the ID of which block was last set by a special SETBID record. This means that we have to use a small state machine to collect the abbreviated definitions provided by the BLOCKINFO block.

(There are other records in the BLOCKINFO block, but SETBID and DEFINE_ABBREV are currently the only records required for proper bitstream parsing).

Bit flow state machine

Now that we know how to interpret individual records and appropriately relate abbreviation definitions to abbreviation record forms, we can consider the overall task of actually parsing the entire bit stream.

While working on my own bitstream parser, I found it useful to think of the bitstream as a state machine with three top-level internal states.

  • A cursor C that enters the underlying bitstream.
  • A scope stack S that contains all information related to the current block/scope outside the block.
    • Current abbreviation ID width (s.top.abbrev_width).
    • The current block ID (s.top.block_id) (or none if we have just started parsing).
    • BLOCKINFO Specifies the current block ID (s.top. blockinfo_block_id) (or none if we are not in the BLOCKINFO block).
    • Any abbreviated definition (S.top.abbrevs) that is valid for the current scope.
  • A blockinfo mapping B for block_id -> [abbrev_definition], which contains the abbreviated definitions for all blockInfo definitions encountered so far.

Given this state, we can define the initialization mechanism for the bitstream parser.

  1. Before we parse for the first time, push an initial range to S.
 { abbrev_width: 2, block_id: None, blockinfo_block_id: None, abbrevs: [] }
Copy the code
  1. Advance C with the s.top. abbrev_width bit. The resulting value is the first abbreviation ID and must be Enter_SUBBBLOCK; Any other ID represents an invalid state.

Use the ENTER_SUBBBLOCK format to populate a new range and push it onto S.

 [blockid:VBR8, newabbrevlen:VBR4, <align32bits>, blocklen:32]
Copy the code

In particular, the new scope contains.

 { abbrev_width: newabbrevlen, block_id: Some(blockid), blockinfo_block_id: None, abbrevs: [] }

Copy the code

Then, the main propulsion mechanism.

  1. Fill the current range with s.op. abbrevs with any abbreviation definition in B that matches S.op. block_id.

  2. If we are in a BLOCKINFO block, use the BLOCKINFO specific state machine to (further) populate B8.

  3. Proceed by s.top.abbrev_width.

  • If our next entry is a record, parse it with the appropriate policy (abbreviate or not abbreviate). Go to 3.

  • If our next entry is a new block (Enter_SUBBLOCK), just as we did in the initialization phase, populate a new scope and push it to S. Turn to 1.

  • If our next entry is the end of a block (END_BLOCK), pop up at the top of the current scope. Go to 3.

It is largely visual (ignoring scope state management and BLOCKINFO).

Put it all together

Here, I turn to the actual implementation. As far as I know, this is only the third LLVM bitstream parser to be exposed, and the second implementation to be completely independent of the LLVM code base (the other being Galois’s LLvm-pretty-bc-Parser) 9.

I’ve done all of the above in a new Rust library: llVm-Bitstream.

You can try it out today by building the dump-Bitstream sample program that comes with it.

$ git clone https://github.com/woodruffw/mollusc && cd mollusc
$ cargo build --examples
$ cargo run --example dump-bitstream some-input.bc
Copy the code

The result looks like this (excerpt).

Entered bitstream; magic: 0xDEC04342
BLOCK 13 {
  RECORD { code: 1, values: [Array([Char6(37), Char6(37), Char6(47), Char6(38), Char6(53), Char6(52), Char6(62), Char6(52), Char6(62), Char6(52)])] }
  RECORD { code: 2, values: [Unsigned(0)] }
}
BLOCK 8 {
  RECORD { code: 1, values: [Unsigned(2)] }
  BLOCK 17 {
    RECORD { code: 1, values: [Unsigned(6)] }
    RECORD { code: 7, values: [Unsigned(32)] }
    RECORD { code: 21, values: [Unsigned(0), Array([Unsigned(0), Unsigned(0), Unsigned(0)])] }
    RECORD { code: 8, values: [Unsigned(1), Unsigned(0)] }
    RECORD { code: 16, values: [] }
    RECORD { code: 8, values: [Unsigned(0), Unsigned(0)] }
    RECORD { code: 2, values: [] }
  }

... snip ...
Copy the code

The current output isn’t pretty or useful, but it shows the full consumption of the stream itself, including the internal interpretation of the BLOCKINFO block and the processing of abbreviations. These parsing details do not appear in the public API, which means that users do not need to worry about these details when consuming bitstreams — all the blocks and records appear on top of their extra bit representation, preventing any unexpected specific layout decisions that rely on bitstream emitters. All this takes only about 600 lines of Rust!

From here, there’s a lot of work to do.

  • The records emitted by LLVM-Bitstream are essentially content – and context-independent, which means that they themselves do not know what they are (a function, a basic block, and so on). For this purpose, another higher-level library is needed that maps individual bitstream blocks and records to specific, specialized structures for each feature in LLVM’s IR.

  • Similarly, other non-IR features in the bitstream need to be mapped. These include other top-level bitstream blocks that contain important metadata, debugging information, and string tables.

  • The test. The IR of LLVM is quite large, and there may be some edge cases that I haven’t considered. In addition to unit testing of individual analyzer capabilities, it makes sense to have two comprehensive testing strategies.

    • Equivalence tests were performed on the output of llVM-CAnalyzer –dump, which emits a pseudo-XML format that should be easy to copy.

    • The random LLVM IR generated by LLVM-Stress and converted to bitstream with LLVM-AS was fumigated.

Next step: Each of the above. Stay tuned for another article in this series, possibly next month!” .


  1. This is not entirely true, as the abbreviation record specifies a higher level of semantic and structural information (such as when a field is Char6 rather than a 6-bit fixed-width integer). But this is actually not necessary to write a proper bitstream parser; It just tells us something about the final expected record structure.

  2. In other words: In a given bitstream, block ID #8 will always refer to a block of the same “kind”, but record code #4 will refer to a different kind of record, depending on the ID of the enclosing block.

  3. This is one of the few places where LLVM’s bitstream format is conceptually very dense, and arguably not very rewarding.

  4. Another area where the bitstream format is very complex. Why do array operands “steal” the operands that follow them instead of using another encoding to embed the operands themselves? I think the LLVM developers did this for a reason, but when I wrote the parser, it made my head spin.

  5. A good example: It would be great if the signed VBRS had their own encoding, as this would make their use in the record actually self-describing, rather than knowing that a particular record code in a block happens to use them instead of the “normal” VBR. For now, the code ultimately responsible for mapping to IR objects will need special treatment for these. However, I believe there is a good historical or engineering reason why bitstream formats do not include this.

  6. This is also context-specific: what the 8 code means in one block ID, and what it means in another.

  7. The UNABBREV_RECORD form would be the six-bit code, six-bit operand, and then provide multiple six-bit VBR blocks for each specific operand (since we can’t use the best 14-bit VBR form). In the naive case, we would expect each record to have about 24 to 30 bits, depending on the distribution of values.

  8. As far as I know, there is no rule against having more than one BLOCKINFO block for a bitstream. I can’t figure out why you would do this, but it doesn’t seem to be prohibited and has a good definition semantics (abbreviations are presumably inserted in order of access).

  9. I’d love to use this implementation as a comparison, but my Haskell is nowhere near good enough.


www.deepl.com translation