X-Git-Url: http://demsky.eecs.uci.edu/git/?a=blobdiff_plain;f=docs%2FBitCodeFormat.html;h=4adf75e91b56f39649cf38f1b2bd57abf8c1385d;hb=95df6b3603e228cea714be21997fec82cb03011e;hp=0579a42115ebb53a70f2a4ccbf923e6afad62605;hpb=2c1ce4f28ec3b9d09b20b7cba83acf65756bd615;p=oota-llvm.git diff --git a/docs/BitCodeFormat.html b/docs/BitCodeFormat.html index 0579a42115e..4adf75e91b5 100644 --- a/docs/BitCodeFormat.html +++ b/docs/BitCodeFormat.html @@ -1,59 +1,651 @@ - + LLVM Bitcode File Format -
LLVM Bitcode File Format
  1. Abstract
  2. -
  3. Concepts
  4. +
  5. Overview
  6. +
  7. Bitstream Format +
      +
    1. Magic Numbers
    2. +
    3. Primitives
    4. +
    5. Abbreviation IDs
    6. +
    7. Blocks
    8. +
    9. Data Records
    10. +
    11. Abbreviations
    12. +
    13. Standard Blocks
    14. +
    +
  8. +
  9. LLVM IR Encoding +
      +
    1. Basics
    2. +
    +
-

Written by Reid Spencer and - Chris Lattner. +

Written by Chris Lattner + and Joshua Haberman.

+ -
Abstract
+
Abstract
+
-

This document describes the LLVM bitcode file format. It specifies -the binary encoding rules of the bitcode file format so that -equivalent systems can encode bitcode files correctly. The LLVM -bitcode representation is used to store the intermediate -representation on disk in a compacted form.

-

This document supercedes the LLVM bytecode file format for the 2.0 -release.

+ +

This document describes the LLVM bitstream file format and the encoding of +the LLVM IR into it.

+
+ -
Concepts
+
Overview
+
-

This section describes the general concepts of the bitcode file -format without getting into specific layout details. It is recommended -that you read this section thoroughly before interpreting the detailed -descriptions.

+ +

+What is commonly known as the LLVM bitcode file format (also, sometimes +anachronistically known as bytecode) is actually two things: a bitstream container format +and an encoding of LLVM IR into the container format.

+ +

+The bitstream format is an abstract encoding of structured data, very +similar to XML in some ways. Like XML, bitstream files contain tags, and nested +structures, and you can parse the file without having to understand the tags. +Unlike XML, the bitstream format is a binary encoding, and unlike XML it +provides a mechanism for the file to self-describe "abbreviations", which are +effectively size optimizations for the content.

+ +

This document first describes the LLVM bitstream format, then describes the +record structure used by LLVM IR files. +

+ +
+ + +
Bitstream Format
+ + +
+ +

+The bitstream format is literally a stream of bits, with a very simple +structure. This structure consists of the following concepts: +

+ + + +

Note that the llvm-bcanalyzer tool can be +used to dump and inspect arbitrary bitstreams, which is very useful for +understanding the encoding.

+ +
+ + +
Magic Numbers +
+ +
+ +

The first two bytes of a bitcode file are 'BC' (0x42, 0x43). +The second two bytes are an application-specific magic number. Generic +bitcode tools can look at only the first two bytes to verify the file is +bitcode, while application-specific programs will want to look at all four.

+ +
+ + +
Primitives +
+ +
+ +

+A bitstream literally consists of a stream of bits, which are read in order +starting with the least significant bit of each byte. The stream is made up of a +number of primitive values that encode a stream of unsigned integer values. +These +integers are are encoded in two ways: either as Fixed +Width Integers or as Variable Width +Integers. +

+ +
+ + +
Fixed Width Integers +
+ +
+ +

Fixed-width integer values have their low bits emitted directly to the file. + For example, a 3-bit integer value encodes 1 as 001. Fixed width integers + are used when there are a well-known number of options for a field. For + example, boolean values are usually encoded with a 1-bit wide integer. +

+ +
+ + +
Variable Width +Integers
+ +
+ +

Variable-width integer (VBR) values encode values of arbitrary size, +optimizing for the case where the values are small. Given a 4-bit VBR field, +any 3-bit value (0 through 7) is encoded directly, with the high bit set to +zero. Values larger than N-1 bits emit their bits in a series of N-1 bit +chunks, where all but the last set the high bit.

+ +

For example, the value 27 (0x1B) is encoded as 1011 0011 when emitted as a +vbr4 value. The first set of four bits indicates the value 3 (011) with a +continuation piece (indicated by a high bit of 1). The next word indicates a +value of 24 (011 << 3) with no continuation. The sum (3+24) yields the value +27. +

+ +
+ + +
6-bit characters
+ +
+ +

6-bit characters encode common characters into a fixed 6-bit field. They +represent the following characters with the following 6-bit values:

+ + + +

This encoding is only suitable for encoding characters and strings that +consist only of the above characters. It is completely incapable of encoding +characters not in the set.

+ +
+ + +
Word Alignment
+ +
+ +

Occasionally, it is useful to emit zero bits until the bitstream is a +multiple of 32 bits. This ensures that the bit position in the stream can be +represented as a multiple of 32-bit words.

+ +
+ + + +
Abbreviation IDs +
+ +
+ +

+A bitstream is a sequential series of Blocks and +Data Records. Both of these start with an +abbreviation ID encoded as a fixed-bitwidth field. The width is specified by +the current block, as described below. The value of the abbreviation ID +specifies either a builtin ID (which have special meanings, defined below) or +one of the abbreviation IDs defined by the stream itself. +

+ +

+The set of builtin abbrev IDs is: +

+ + + +

Abbreviation IDs 4 and above are defined by the stream itself, and specify +an abbreviated record encoding.

+ +
+ + +
Blocks +
+ +
+ +

+Blocks in a bitstream denote nested regions of the stream, and are identified by +a content-specific id number (for example, LLVM IR uses an ID of 12 to represent +function bodies). Block IDs 0-7 are reserved for standard blocks +whose meaning is defined by Bitcode; block IDs 8 and greater are +application specific. Nested blocks capture the hierachical structure of the data +encoded in it, and various properties are associated with blocks as the file is +parsed. Block definitions allow the reader to efficiently skip blocks +in constant time if the reader wants a summary of blocks, or if it wants to +efficiently skip data they do not understand. The LLVM IR reader uses this +mechanism to skip function bodies, lazily reading them on demand. +

+ +

+When reading and encoding the stream, several properties are maintained for the +block. In particular, each block maintains: +

+ +
    +
  1. A current abbrev id width. This value starts at 2, and is set every time a + block record is entered. The block entry specifies the abbrev id width for + the body of the block.
  2. + +
  3. A set of abbreviations. Abbreviations may be defined within a block, in + which case they are only defined in that block (neither subblocks nor + enclosing blocks see the abbreviation). Abbreviations can also be defined + inside a BLOCKINFO block, in which case they are + defined in all blocks that match the ID that the BLOCKINFO block is describing. +
  4. +
+ +

As sub blocks are entered, these properties are saved and the new sub-block +has its own set of abbreviations, and its own abbrev id width. When a sub-block +is popped, the saved values are restored.

+ +
+ + +
ENTER_SUBBLOCK +Encoding
+ +
+ +

[ENTER_SUBBLOCK, blockidvbr8, newabbrevlenvbr4, + <align32bits>, blocklen32]

+ +

+The ENTER_SUBBLOCK abbreviation ID specifies the start of a new block record. +The blockid value is encoded as a 8-bit VBR identifier, and indicates +the type of block being entered (which can be a standard +block or an application-specific block). The +newabbrevlen value is a 4-bit VBR which specifies the +abbrev id width for the sub-block. The blocklen is a 32-bit aligned +value that specifies the size of the subblock, in 32-bit words. This value +allows the reader to skip over the entire block in one jump. +

+ +
+ + +
END_BLOCK +Encoding
+ +
+ +

[END_BLOCK, <align32bits>]

+ +

+The END_BLOCK abbreviation ID specifies the end of the current block record. +Its end is aligned to 32-bits to ensure that the size of the block is an even +multiple of 32-bits.

+ +
+ + + + +
Data Records +
+ +
+

+Data records consist of a record code and a number of (up to) 64-bit integer +values. The interpretation of the code and values is application specific and +there are multiple different ways to encode a record (with an unabbrev record +or with an abbreviation). In the LLVM IR format, for example, there is a record +which encodes the target triple of a module. The code is MODULE_CODE_TRIPLE, +and the values of the record are the ascii codes for the characters in the +string.

+ +
+ + +
UNABBREV_RECORD +Encoding
+ +
+ +

[UNABBREV_RECORD, codevbr6, numopsvbr6, + op0vbr6, op1vbr6, ...]

+ +

An UNABBREV_RECORD provides a default fallback encoding, which is both +completely general and also extremely inefficient. It can describe an arbitrary +record, by emitting the code and operands as vbrs.

+ +

For example, emitting an LLVM IR target triple as an unabbreviated record +requires emitting the UNABBREV_RECORD abbrevid, a vbr6 for the +MODULE_CODE_TRIPLE code, a vbr6 for the length of the string (which is equal to +the number of operands), and a vbr6 for each character. Since there are no +letters with value less than 32, each letter would need to be emitted as at +least a two-part VBR, which means that each letter would require at least 12 +bits. This is not an efficient encoding, but it is fully general.

+ +
+ + +
Abbreviated Record +Encoding
+ +
+ +

[<abbrevid>, fields...]

+ +

An abbreviated record is a abbreviation id followed by a set of fields that +are encoded according to the abbreviation +definition. This allows records to be encoded significantly more densely +than records encoded with the UNABBREV_RECORD +type, and allows the abbreviation types to be specified in the stream itself, +which allows the files to be completely self describing. The actual encoding +of abbreviations is defined below. +

+ +
+ + +
Abbreviations +
+ +
+

+Abbreviations are an important form of compression for bitstreams. The idea is +to specify a dense encoding for a class of records once, then use that encoding +to emit many records. It takes space to emit the encoding into the file, but +the space is recouped (hopefully plus some) when the records that use it are +emitted. +

+ +

+Abbreviations can be determined dynamically per client, per file. Since the +abbreviations are stored in the bitstream itself, different streams of the same +format can contain different sets of abbreviations if the specific stream does +not need it. As a concrete example, LLVM IR files usually emit an abbreviation +for binary operators. If a specific LLVM module contained no or few binary +operators, the abbreviation does not need to be emitted. +

+
+ + +
DEFINE_ABBREV + Encoding
+ +
+ +

[DEFINE_ABBREV, numabbrevopsvbr5, abbrevop0, abbrevop1, + ...]

+ +

A DEFINE_ABBREV record adds an abbreviation to the list of currently +defined abbreviations in the scope of this block. This definition only +exists inside this immediate block -- it is not visible in subblocks or +enclosing blocks. +Abbreviations are implicitly assigned IDs +sequentially starting from 4 (the first application-defined abbreviation ID). +Any abbreviations defined in a BLOCKINFO record receive IDs first, in order, +followed by any abbreviations defined within the block itself. +Abbreviated data records reference this ID to indicate what abbreviation +they are invoking.

+ +

An abbreviation definition consists of the DEFINE_ABBREV abbrevid followed +by a VBR that specifies the number of abbrev operands, then the abbrev +operands themselves. Abbreviation operands come in three forms. They all start +with a single bit that indicates whether the abbrev operand is a literal operand +(when the bit is 1) or an encoding operand (when the bit is 0).

+ +
    +
  1. Literal operands - [11, litvaluevbr8] - +Literal operands specify that the value in the result +is always a single specific value. This specific value is emitted as a vbr8 +after the bit indicating that it is a literal operand.
  2. +
  3. Encoding info without data - [01, encoding3] + - Operand encodings that do not have extra data are just emitted as their code. +
  4. +
  5. Encoding info with data - [01, encoding3, +valuevbr5] - Operand encodings that do have extra data are +emitted as their code, followed by the extra data. +
  6. +
+ +

The possible operand encodings are:

+ + + +

For example, target triples in LLVM modules are encoded as a record of the +form [TRIPLE, 'a', 'b', 'c', 'd']. Consider if the bitstream emitted +the following abbrev entry:

+ + + +

When emitting a record with this abbreviation, the above entry would be +emitted as:

+ +

[4abbrevwidth, 24, 4vbr6, + 06, 16, 26, 36]

+ +

These values are:

+ +
    +
  1. The first value, 4, is the abbreviation ID for this abbreviation.
  2. +
  3. The second value, 2, is the code for TRIPLE in LLVM IR files.
  4. +
  5. The third value, 4, is the length of the array.
  6. +
  7. The rest of the values are the char6 encoded values for "abcd".
  8. +
+ +

With this abbreviation, the triple is emitted with only 37 bits (assuming a +abbrev id width of 3). Without the abbreviation, significantly more space would +be required to emit the target triple. Also, since the TRIPLE value is not +emitted as a literal in the abbreviation, the abbreviation can also be used for +any other string value. +

+ +
+ + +
Standard Blocks +
+ +
+ +

+In addition to the basic block structure and record encodings, the bitstream +also defines specific builtin block types. These block types specify how the +stream is to be decoded or other metadata. In the future, new standard blocks +may be added. Block IDs 0-7 are reserved for standard blocks. +

+ +
+ + +
#0 - BLOCKINFO +Block
+ +
+ +

The BLOCKINFO block allows the description of metadata for other blocks. The + currently specified records are:

+ + + +

+The SETBID record indicates which block ID is being described. SETBID +records can occur multiple times throughout the block to change which +block ID is being described. There must be a SETBID record prior to +any other records. +

+ +

+Standard DEFINE_ABBREV records can occur inside BLOCKINFO blocks, but unlike +their occurrence in normal blocks, the abbreviation is defined for blocks +matching the block ID we are describing, not the BLOCKINFO block itself. +The abbreviations defined in BLOCKINFO blocks receive abbreviation ids +as described in DEFINE_ABBREV. +

+ +

+Note that although the data in BLOCKINFO blocks is described as "metadata," the +abbreviations they contain are essential for parsing records from the +corresponding blocks. It is not safe to skip them. +

+ +
+ + +
LLVM IR Encoding
+ + +
+ +

LLVM IR is encoded into a bitstream by defining blocks and records. It uses +blocks for things like constant pools, functions, symbol tables, etc. It uses +records for things like instructions, global variable descriptors, type +descriptions, etc. This document does not describe the set of abbreviations +that the writer uses, as these are fully self-described in the file, and the +reader is not allowed to build in any knowledge of this.

+ +
+ + +
Basics +
+ + +
LLVM IR Magic Number
+ +
+ +

+The magic number for LLVM IR files is: +

+ +

[0x04, 0xC4, 0xE4, 0xD4]

+ +

When combined with the bitcode magic number and viewed as bytes, this is "BC 0xC0DE".

+ +
+ + +
Signed VBRs
+ +
+ +

+Variable Width Integers are an efficient way to +encode arbitrary sized unsigned values, but is an extremely inefficient way to +encode signed values (as signed values are otherwise treated as maximally large +unsigned values).

+ +

As such, signed vbr values of a specific width are emitted as follows:

+ + + +

With this encoding, small positive and small negative values can both be +emitted efficiently.

+ +
+ + + +
LLVM IR Blocks
+ +
+ +

+LLVM IR is defined with the following blocks: +

+ + + +
+ + +
MODULE_BLOCK Contents +
+ +
+ +

+

+
+ +
Valid CSS! Valid HTML 4.01! -Reid Spencer and Chris Lattner
+ Chris Lattner
The LLVM Compiler Infrastructure
Last modified: $Date$