docs/BitCodeFormat.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   2                       "http://www.w3.org/TR/html4/strict.dtd">
   3 <html>
   4 <head>
   5   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   6   <title>LLVM Bitcode File Format</title>
   7   <link rel="stylesheet" href="llvm.css" type="text/css">
   8 </head>
   9 <body>
  10 <div class="doc_title"> LLVM Bitcode File Format </div>
  11 <ol>
  12   <li><a href="#abstract">Abstract</a></li>
  13   <li><a href="#overview">Overview</a></li>
  14   <li><a href="#bitstream">Bitstream Format</a>
  15     <ol>
  16     <li><a href="#magic">Magic Numbers</a></li>
  17     <li><a href="#primitives">Primitives</a></li>
  18     <li><a href="#abbrevid">Abbreviation IDs</a></li>
  19     <li><a href="#blocks">Blocks</a></li>
  20     <li><a href="#datarecord">Data Records</a></li>
  21     <li><a href="#abbreviations">Abbreviations</a></li>
  22     </ol>
  23   </li>
  24   <li><a href="#llvmir">LLVM IR Encoding</a></li>
  25 </ol>
  26 <div class="doc_author">
  27   <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a>.
  28 </p>
  29 </div>
  30
  31 <!-- *********************************************************************** -->
  32 <div class="doc_section"> <a name="abstract">Abstract</a></div>
  33 <!-- *********************************************************************** -->
  34
  35 <div class="doc_text">
  36
  37 <p>This document describes the LLVM bitstream file format and the encoding of
  38 the LLVM IR into it.</p>
  39
  40 </div>
  41
  42 <!-- *********************************************************************** -->
  43 <div class="doc_section"> <a name="overview">Overview</a></div>
  44 <!-- *********************************************************************** -->
  45
  46 <div class="doc_text">
  47
  48 <p>
  49 What is commonly known as the LLVM bitcode file format (also, sometimes
  50 anachronistically known as bytecode) is actually two things: a <a
  51 href="#bitstream">bitstream container format</a>
  52 and an <a href="#llvmir">encoding of LLVM IR</a> into the container format.</p>
  53
  54 <p>
  55 The bitstream format is an abstract encoding of structured data, very
  56 similar to XML in some ways.  Like XML, bitstream files contain tags, and nested
  57 structures, and you can parse the file without having to understand the tags.
  58 Unlike XML, the bitstream format is a binary encoding, and unlike XML it
  59 provides a mechanism for the file to self-describe "abbreviations", which are
  60 effectively size optimizations for the content.</p>
  61
  62 <p>This document first describes the LLVM bitstream format, then describes the
  63 record structure used by LLVM IR files.
  64 </p>
  65
  66 </div>
  67
  68 <!-- *********************************************************************** -->
  69 <div class="doc_section"> <a name="bitstream">Bitstream Format</a></div>
  70 <!-- *********************************************************************** -->
  71
  72 <div class="doc_text">
  73
  74 <p>
  75 The bitstream format is literally a stream of bits, with a very simple
  76 structure.  This structure consists of the following concepts:
  77 </p>
  78
  79 <ul>
  80 <li>A "<a href="#magic">magic number</a>" that identifies the contents of
  81     the stream.</li>
  82 <li>Encoding <a href="#primitives">primitives</a> like variable bit-rate
  83     integers.</li>
  84 <li><a href="#blocks">Blocks</a>, which define nested content.</li>
  85 <li><a href="#datarecord">Data Records</a>, which describe entities within the
  86     file.</li>
  87 <li>Abbreviations, which specify compression optimizations for the file.</li>
  88 </ul>
  89
  90 <p>Note that the <a
  91 href="CommandGuide/html/llvm-bcanalyzer.html">llvm-bcanalyzer</a> tool can be
  92 used to dump and inspect arbitrary bitstreams, which is very useful for
  93 understanding the encoding.</p>
  94
  95 </div>
  96
  97 <!-- ======================================================================= -->
  98 <div class="doc_subsection"><a name="magic">Magic Numbers</a>
  99 </div>
 100
 101 <div class="doc_text">
 102
 103 <p>The first four bytes of the stream identify the encoding of the file.  This
 104 is used by a reader to know what is contained in the file.</p>
 105
 106 </div>
 107
 108 <!-- ======================================================================= -->
 109 <div class="doc_subsection"><a name="primitives">Primitives</a>
 110 </div>
 111
 112 <div class="doc_text">
 113
 114 <p>
 115 A bitstream literally consists of a stream of bits.  This stream is made up of a
 116 number of primitive values that encode a stream of integer values.  These
 117 integers are are encoded in two ways: either as <a href="#fixedwidth">Fixed
 118 Width Integers</a> or as <a href="#variablewidth">Variable Width
 119 Integers</a>.
 120 </p>
 121
 122 </div>
 123
 124 <!-- _______________________________________________________________________ -->
 125 <div class="doc_subsubsection"> <a name="fixedwidth">Fixed Width Integers</a>
 126 </div>
 127
 128 <div class="doc_text">
 129
 130 <p>Fixed-width integer values have their low bits emitted directly to the file.
 131    For example, a 3-bit integer value encodes 1 as 001.  Fixed width integers
 132    are used when there are a well-known number of options for a field.  For
 133    example, boolean values are usually encoded with a 1-bit wide integer.
 134 </p>
 135
 136 </div>
 137
 138 <!-- _______________________________________________________________________ -->
 139 <div class="doc_subsubsection"> <a name="variablewidth">Variable Width
 140 Integers</a></div>
 141
 142 <div class="doc_text">
 143
 144 <p>Variable-width integer (VBR) values encode values of arbitrary size,
 145 optimizing for the case where the values are small.  Given a 4-bit VBR field,
 146 any 3-bit value (0 through 7) is encoded directly, with the high bit set to
 147 zero.  Values larger than N-1 bits emit their bits in a series of N-1 bit
 148 chunks, where all but the last set the high bit.</p>
 149
 150 <p>For example, the value 27 (0x1B) is encoded as 1011 0011 when emitted as a
 151 vbr4 value.  The first set of four bits indicates the value 3 (011) with a
 152 continuation piece (indicated by a high bit of 1).  The next word indicates a
 153 value of 24 (011 << 3) with no continuation.  The sum (3+24) yields the value
 154 27.
 155 </p>
 156
 157 </div>
 158
 159 <!-- _______________________________________________________________________ -->
 160 <div class="doc_subsubsection"> <a name="char6">6-bit characters</a></div>
 161
 162 <div class="doc_text">
 163
 164 <p>6-bit characters encode common characters into a fixed 6-bit field.  They
 165 represent the following characters with the following 6-bit values:</p>
 166
 167 <ul>
 168 <li>'a' .. 'z' - 0 .. 25</li>
 169 <li>'A' .. 'Z' - 26 .. 52</li>
 170 <li>'0' .. '9' - 53 .. 61</li>
 171 <li>'.' - 62</li>
 172 <li>'_' - 63</li>
 173 </ul>
 174
 175 <p>This encoding is only suitable for encoding characters and strings that
 176 consist only of the above characters.  It is completely incapable of encoding
 177 characters not in the set.</p>
 178
 179 </div>
 180
 181 <!-- _______________________________________________________________________ -->
 182 <div class="doc_subsubsection"> <a name="wordalign">Word Alignment</a></div>
 183
 184 <div class="doc_text">
 185
 186 <p>Occasionally, it is useful to emit zero bits until the bitstream is a
 187 multiple of 32 bits.  This ensures that the bit position in the stream can be
 188 represented as a multiple of 32-bit words.</p>
 189
 190 </div>
 191
 192
 193 <!-- ======================================================================= -->
 194 <div class="doc_subsection"><a name="abbrevid">Abbreviation IDs</a>
 195 </div>
 196
 197 <div class="doc_text">
 198
 199 <p>
 200 A bitstream is a sequential series of <a href="#blocks">Blocks</a> and
 201 <a href="#datarecord">Data Records</a>.  Both of these start with an
 202 abbreviation ID encoded as a fixed-bitwidth field.  The width is specified by
 203 the current block, as described below.  The value of the abbreviation ID
 204 specifies either a builtin ID (which have special meanings, defined below) or
 205 one of the abbreviation IDs defined by the stream itself.
 206 </p>
 207
 208 <p>
 209 The set of builtin abbrev IDs is:
 210 </p>
 211
 212 <ul>
 213 <li>0 - <a href="#END_BLOCK">END_BLOCK</a> - This abbrev ID marks the end of the
 214     current block.</li>
 215 <li>1 - <a href="#ENTER_SUBBLOCK">ENTER_SUBBLOCK</a> - This abbrev ID marks the
 216     beginning of a new block.</li>
 217 <li>2 - <a href="#DEFINE_ABBREV">DEFINE_ABBREV</a> - This defines a new
 218     abbreviation.</li>
 219 <li>3 - <a href="#UNABBREV_RECORD">UNABBREV_RECORD</a> - This ID specifies the
 220     definition of an unabbreviated record.</li>
 221 </ul>
 222
 223 <p>Abbreviation IDs 4 and above are defined by the stream itself, and specify
 224 an <a href="#abbrev_records">abbreviated record encoding</a>.</p>
 225
 226 </div>
 227
 228 <!-- ======================================================================= -->
 229 <div class="doc_subsection"><a name="blocks">Blocks</a>
 230 </div>
 231
 232 <div class="doc_text">
 233
 234 <p>
 235 Blocks in a bitstream denote nested regions of the stream, and are identified by
 236 a content-specific id number (for example, LLVM IR uses an ID of 12 to represent
 237 function bodies).  Nested blocks capture the hierachical structure of the data
 238 encoded in it, and various properties are associated with blocks as the file is
 239 parsed.  Block definitions allow the reader to efficiently skip blocks
 240 in constant time if the reader wants a summary of blocks, or if it wants to
 241 efficiently skip data they do not understand.  The LLVM IR reader uses this
 242 mechanism to skip function bodies, lazily reading them on demand.
 243 </p>
 244
 245 <p>
 246 When reading and encoding the stream, several properties are maintained for the
 247 block.  In particular, each block maintains:
 248 </p>
 249
 250 <ol>
 251 <li>A current abbrev id width.  This value starts at 2, and is set every time a
 252     block record is entered.  The block entry specifies the abbrev id width for
 253     the body of the block.</li>
 254
 255 <li>A set of abbreviations.  Abbreviations may be defined within a block, or
 256     they may be associated with all blocks of a particular ID.
 257 </li>
 258 </ol>
 259
 260 <p>As sub blocks are entered, these properties are saved and the new sub-block
 261 has its own set of abbreviations, and its own abbrev id width.  When a sub-block
 262 is popped, the saved values are restored.</p>
 263
 264 </div>
 265
 266 <!-- _______________________________________________________________________ -->
 267 <div class="doc_subsubsection"> <a name="ENTER_SUBBLOCK">ENTER_SUBBLOCK
 268 Encoding</a></div>
 269
 270 <div class="doc_text">
 271
 272 <p><tt>[ENTER_SUBBLOCK, blockid<sub>vbr8</sub>, newabbrevlen<sub>vbr4</sub>,
 273      &lt;align32bits&gt;, blocklen<sub>32</sub>]</tt></p>
 274
 275 <p>
 276 The ENTER_SUBBLOCK abbreviation ID specifies the start of a new block record.
 277 The <tt>blockid</tt> value is encoded as a 8-bit VBR identifier, and indicates
 278 the type of block being entered (which is application specific).  The
 279 <tt>newabbrevlen</tt> value is a 4-bit VBR which specifies the
 280 abbrev id width for the sub-block.  The <tt>blocklen</tt> is a 32-bit aligned
 281 value that specifies the size of the subblock, in 32-bit words.  This value
 282 allows the reader to skip over the entire block in one jump.
 283 </p>
 284
 285 </div>
 286
 287 <!-- _______________________________________________________________________ -->
 288 <div class="doc_subsubsection"> <a name="END_BLOCK">END_BLOCK
 289 Encoding</a></div>
 290
 291 <div class="doc_text">
 292
 293 <p><tt>[END_BLOCK, &lt;align32bits&gt;]</tt></p>
 294
 295 <p>
 296 The END_BLOCK abbreviation ID specifies the end of the current block record.
 297 Its end is aligned to 32-bits to ensure that the size of the block is an even
 298 multiple of 32-bits.</p>
 299
 300 </div>
 301
 302
 303
 304 <!-- ======================================================================= -->
 305 <div class="doc_subsection"><a name="datarecord">Data Records</a>
 306 </div>
 307
 308 <div class="doc_text">
 309 <p>
 310 Data records consist of a record code and a number of (up to) 64-bit integer
 311 values.  The interpretation of the code and values is application specific and
 312 there are multiple different ways to encode a record (with an unabbrev record
 313 or with an abbreviation).  In the LLVM IR format, for example, there is a record
 314 which encodes the target triple of a module.  The code is MODULE_CODE_TRIPLE,
 315 and the values of the record are the ascii codes for the characters in the
 316 string.</p>
 317
 318 </div>
 319
 320 <!-- _______________________________________________________________________ -->
 321 <div class="doc_subsubsection"> <a name="UNABBREV_RECORD">UNABBREV_RECORD
 322 Encoding</a></div>
 323
 324 <div class="doc_text">
 325
 326 <p><tt>[UNABBREV_RECORD, code<sub>vbr6</sub>, numops<sub>vbr6</sub>,
 327        op0<sub>vbr6</sub>, op1<sub>vbr6</sub>, ...]</tt></p>
 328
 329 <p>An UNABBREV_RECORD provides a default fallback encoding, which is both
 330 completely general and also extremely inefficient.  It can describe an arbitrary
 331 record, by emitting the code and operands as vbrs.</p>
 332
 333 <p>For example, emitting an LLVM IR target triple as an unabbreviated record
 334 requires emitting the UNABBREV_RECORD abbrevid, a vbr6 for the
 335 MODULE_CODE_TRIPLE code, a vbr6 for the length of the string (which is equal to
 336 the number of operands), and a vbr6 for each character.  Since there are no
 337 letters with value less than 32, each letter would need to be emitted as at
 338 least a two-part VBR, which means that each letter would require at least 12
 339 bits.  This is not an efficient encoding, but it is fully general.</p>
 340
 341 </div>
 342
 343 <!-- _______________________________________________________________________ -->
 344 <div class="doc_subsubsection"> <a name="abbrev_records">Abbreviated Record
 345 Encoding</a></div>
 346
 347 <div class="doc_text">
 348
 349 <p><tt>[&lt;abbrevid&gt;, fields...]</tt></p>
 350
 351 <p>An abbreviated record is a abbreviation id followed by a set of fields that
 352 are encoded according to the <a href="#abbreviations">abbreviation
 353 definition</a>.  This allows records to be encoded significantly more densely
 354 than records encoded with the <a href="#UNABBREV_RECORD">UNABBREV_RECORD</a>
 355 type, and allows the abbreviation types to be specified in the stream itself,
 356 which allows the files to be completely self describing.  The actual encoding
 357 of abbreviations is defined below.
 358 </p>
 359
 360 </div>
 361
 362 <!-- ======================================================================= -->
 363 <div class="doc_subsection"><a name="abbreviations">Abbreviations</a>
 364 </div>
 365
 366 <div class="doc_text">
 367 <p>
 368 Abbreviations are an important form of compression for bitstreams.  The idea is
 369 to specify a dense encoding for a class of records once, then use that encoding
 370 to emit many records.  It takes space to emit the encoding into the file, but
 371 the space is recouped (hopefully plus some) when the records that use it are
 372 emitted.
 373 </p>
 374
 375 <p>
 376 Abbreviations can be determined dynamically per client, per file.  Since the
 377 abbreviations are stored in the bitstream itself, different streams of the same
 378 format can contain different sets of abbreviations if the specific stream does
 379 not need it.  As a concrete example, LLVM IR files usually emit an abbreviation
 380 for binary operators.  If a specific LLVM module contained no or few binary
 381 operators, the abbreviation does not need to be emitted.
 382 </p>
 383 </div>
 384
 385 <!-- _______________________________________________________________________ -->
 386 <div class="doc_subsubsection"><a name="DEFINE_ABBREV">DEFINE_ABBREV
 387  Encoding</a></div>
 388
 389 <div class="doc_text">
 390
 391 <p><tt>[DEFINE_ABBREV, numabbrevops<sub>vbr5</sub>, abbrevop0, abbrevop1,
 392  ...]</tt></p>
 393
 394 <p>An abbreviation definition consists of the DEFINE_ABBREV abbrevid followed
 395 by a VBR that specifies the number of abbrev operands, then the abbrev
 396 operands themselves.  Abbreviation operands come in three forms.  They all start
 397 with a single bit that indicates whether the abbrev operand is a literal operand
 398 (when the bit is 1) or an encoding operand (when the bit is 0).</p>
 399
 400 <ol>
 401 <li>Literal operands - <tt>[1<sub>1</sub>, litvalue<sub>vbr8</sub>]</tt> -
 402 Literal operands specify that the value in the result
 403 is always a single specific value.  This specific value is emitted as a vbr8
 404 after the bit indicating that it is a literal operand.</li>
 405 <li>Encoding info without data - <tt>[0<sub>1</sub>, encoding<sub>3</sub>]</tt>
 406  - blah
 407 </li>
 408 <li>Encoding info with data - <tt>[0<sub>1</sub>, encoding<sub>3</sub>,
 409 value<sub>vbr5</sub>]</tt> -
 410
 411 </li>
 412 </ol>
 413
 414 </div>
 415
 416
 417 <!-- *********************************************************************** -->
 418 <div class="doc_section"> <a name="llvmir">LLVM IR Encoding</a></div>
 419 <!-- *********************************************************************** -->
 420
 421 <div class="doc_text">
 422
 423 <p></p>
 424
 425 </div>
 426
 427
 428 <!-- *********************************************************************** -->
 429 <hr>
 430 <address> <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
 431  src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
 432 <a href="http://validator.w3.org/check/referer"><img
 433  src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
 434  <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
 435 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
 436 Last modified: $Date$
 437 </address>
 438 </body>
 439 </html>