docs/tutorial/LangImpl2.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   2                       "http://www.w3.org/TR/html4/strict.dtd">
   3
   4 <html>
   5 <head>
   6   <title>Kaleidoscope: Implementing a Parser and AST</title>
   7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   8   <meta name="author" content="Chris Lattner">
   9   <link rel="stylesheet" href="../llvm.css" type="text/css">
  10 </head>
  11
  12 <body>
  13
  14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
  15
  16 <ul>
  17 <li><a href="index.html">Up to Tutorial Index</a></li>
  18 <li>Chapter 2
  19   <ol>
  20     <li><a href="#intro">Chapter 2 Introduction</a></li>
  21     <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
  22     <li><a href="#parserbasics">Parser Basics</a></li>
  23     <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
  24     <li><a href="#parserbinops">Binary Expression Parsing</a></li>
  25     <li><a href="#parsertop">Parsing the Rest</a></li>
  26     <li><a href="#driver">The Driver</a></li>
  27     <li><a href="#conclusions">Conclusions</a></li>
  28     <li><a href="#code">Full Code Listing</a></li>
  29   </ol>
  30 </li>
  31 <li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
  32 </ul>
  33
  34 <div class="doc_author">
  35   <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
  36 </div>
  37
  38 <!-- *********************************************************************** -->
  39 <div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
  40 <!-- *********************************************************************** -->
  41
  42 <div class="doc_text">
  43
  44 <p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
  45 with LLVM</a>" tutorial.  This chapter shows you how to use the <a
  46 href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
  47 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
  48 our Kaleidoscope language.  Once we have a parser, we'll define and build an <a
  49 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
  50 Tree</a> (AST).</p>
  51
  52 <p>The parser we will build uses a combination of <a
  53 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
  54 Parsing</a> and <a href=
  55 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
  56 Parsing</a> to parse the Kaleidoscope language (the latter for binary expression
  57 and the former for everything else).  Before we get to parsing though, lets talk
  58 about the output of the parser: the Abstract Syntax Tree.</p>
  59
  60 </div>
  61
  62 <!-- *********************************************************************** -->
  63 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
  64 <!-- *********************************************************************** -->
  65
  66 <div class="doc_text">
  67
  68 <p>The AST for a program captures its behavior in a way that it is easy for
  69 later stages of the compiler (e.g. code generation) to interpret.  We basically
  70 want one object for each construct in the language, and the AST should closely
  71 model the language.  In Kaleidoscope, we have expressions, a prototype, and a
  72 function object.  We'll start with expressions first:</p>
  73
  74 <div class="doc_code">
  75 <pre>
  76 /// ExprAST - Base class for all expression nodes.
  77 class ExprAST {
  78 public:
  79   virtual ~ExprAST() {}
  80 };
  81
  82 /// NumberExprAST - Expression class for numeric literals like "1.0".
  83 class NumberExprAST : public ExprAST {
  84   double Val;
  85 public:
  86   explicit NumberExprAST(double val) : Val(val) {}
  87 };
  88 </pre>
  89 </div>
  90
  91 <p>The code above shows the definition of the base ExprAST class and one
  92 subclass which we use for numeric literals.  The important thing about this is
  93 that the NumberExprAST class captures the numeric value of the literal in the
  94 class, so that later phases of the compiler can know what it is.</p>
  95
  96 <p>Right now we only create the AST,  so there are no useful accessor methods on
  97 them.  It would be very easy to add a virtual method to pretty print the code,
  98 for example.  Here are the other expression AST node definitions that we'll use
  99 in the basic form of the Kaleidoscope language.
 100 </p>
 101
 102 <div class="doc_code">
 103 <pre>
 104 /// VariableExprAST - Expression class for referencing a variable, like "a".
 105 class VariableExprAST : public ExprAST {
 106   std::string Name;
 107 public:
 108   explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
 109 };
 110
 111 /// BinaryExprAST - Expression class for a binary operator.
 112 class BinaryExprAST : public ExprAST {
 113   char Op;
 114   ExprAST *LHS, *RHS;
 115 public:
 116   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 117     : Op(op), LHS(lhs), RHS(rhs) {}
 118 };
 119
 120 /// CallExprAST - Expression class for function calls.
 121 class CallExprAST : public ExprAST {
 122   std::string Callee;
 123   std::vector&lt;ExprAST*&gt; Args;
 124 public:
 125   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 126     : Callee(callee), Args(args) {}
 127 };
 128 </pre>
 129 </div>
 130
 131 <p>This is all (intentially) rather straight-forward: variables capture the
 132 variable name, binary operators capture their opcode (e.g. '+'), and calls
 133 capture a function name and list of argument expressions.  One thing that is
 134 nice about our AST is that it captures the language features without talking
 135 about the syntax of the language.  Note that there is no discussion about
 136 precedence of binary operators, lexical structure etc.</p>
 137
 138 <p>For our basic language, these are all of the expression nodes we'll define.
 139 Because it doesn't have conditional control flow, it isn't Turing-complete;
 140 we'll fix that in a later installment.  The two things we need next are a way
 141 to talk about the interface to a function, and a way to talk about functions
 142 themselves:</p>
 143
 144 <div class="doc_code">
 145 <pre>
 146 /// PrototypeAST - This class represents the "prototype" for a function,
 147 /// which captures its name, and its argument names (thus implicitly the number
 148 /// of arguments the function takes).
 149 class PrototypeAST {
 150   std::string Name;
 151   std::vector&lt;std::string&gt; Args;
 152 public:
 153   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 154     : Name(name), Args(args) {}
 155 };
 156
 157 /// FunctionAST - This class represents a function definition itself.
 158 class FunctionAST {
 159   PrototypeAST *Proto;
 160   ExprAST *Body;
 161 public:
 162   FunctionAST(PrototypeAST *proto, ExprAST *body)
 163     : Proto(proto), Body(body) {}
 164 };
 165 </pre>
 166 </div>
 167
 168 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
 169 Since all values are double precision floating point, the type of each argument
 170 doesn't need to be stored anywhere.  In a more aggressive and realistic
 171 language, the "ExprAST" class would probably have a type field.</p>
 172
 173 <p>With this scaffolding, we can now talk about parsing expressions and function
 174 bodies in Kaleidoscope.</p>
 175
 176 </div>
 177
 178 <!-- *********************************************************************** -->
 179 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
 180 <!-- *********************************************************************** -->
 181
 182 <div class="doc_text">
 183
 184 <p>Now that we have an AST to build, we need to define the parser code to build
 185 it.  The idea here is that we want to parse something like "x+y" (which is
 186 returned as three tokens by the lexer) into an AST that could be generated with
 187 calls like this:</p>
 188
 189 <div class="doc_code">
 190 <pre>
 191   ExprAST *X = new VariableExprAST("x");
 192   ExprAST *Y = new VariableExprAST("y");
 193   ExprAST *Result = new BinaryExprAST('+', X, Y);
 194 </pre>
 195 </div>
 196
 197 <p>In order to do this, we'll start by defining some basic helper routines:</p>
 198
 199 <div class="doc_code">
 200 <pre>
 201 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 202 /// token the parser it looking at.  getNextToken reads another token from the
 203 /// lexer and updates CurTok with its results.
 204 static int CurTok;
 205 static int getNextToken() {
 206   return CurTok = gettok();
 207 }
 208 </pre>
 209 </div>
 210
 211 <p>
 212 This implements a simple token buffer around the lexer.  This allows
 213 us to look one token ahead at what the lexer is returning.  Every function in
 214 our parser will assume that CurTok is the current token that needs to be
 215 parsed.</p>
 216
 217 <div class="doc_code">
 218 <pre>
 219
 220 /// Error* - These are little helper functions for error handling.
 221 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 222 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 223 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 224 </pre>
 225 </div>
 226
 227 <p>
 228 The <tt>Error</tt> routines are simple helper routines that our parser will use
 229 to handle errors.  The error recovery in our parser will not be the best and
 230 is not particular user-friendly, but it will be enough for our tutorial.  These
 231 routines make it easier to handle errors in routines that have various return
 232 types: they always return null.</p>
 233
 234 <p>With these basic helper functions implemented, we can implement the first
 235 piece of our grammar: we'll start with numeric literals.</p>
 236
 237 </div>
 238
 239 <!-- *********************************************************************** -->
 240 <div class="doc_section"><a name="parserprimexprs">Basic Expression
 241  Parsing</a></div>
 242 <!-- *********************************************************************** -->
 243
 244 <div class="doc_text">
 245
 246 <p>We start with numeric literals, because they are the simplest to process.
 247 For each production in our grammar, we'll define a function which parses that
 248 production.  For numeric literals, we have:
 249 </p>
 250
 251 <div class="doc_code">
 252 <pre>
 253 /// numberexpr ::= number
 254 static ExprAST *ParseNumberExpr() {
 255   ExprAST *Result = new NumberExprAST(NumVal);
 256   getNextToken(); // consume the number
 257   return Result;
 258 }
 259 </pre>
 260 </div>
 261
 262 <p>This routine is very simple: it expects to be called when the current token
 263 is a <tt>tok_number</tt> token.  It takes the current number value, creates
 264 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
 265 returns.</p>
 266
 267 <p>There are some interesting aspects of this.  The most important one is that
 268 this routine eats all of the tokens that correspond to the production, and
 269 returns the lexer buffer with the next token (which is not part of the grammar
 270 production) ready to go.  This is a fairly standard way to go for recursive
 271 descent parsers.  For a better example, the parenthesis operator is defined like
 272 this:</p>
 273
 274 <div class="doc_code">
 275 <pre>
 276 /// parenexpr ::= '(' expression ')'
 277 static ExprAST *ParseParenExpr() {
 278   getNextToken();  // eat (.
 279   ExprAST *V = ParseExpression();
 280   if (!V) return 0;
 281
 282   if (CurTok != ')')
 283     return Error("expected ')'");
 284   getNextToken();  // eat ).
 285   return V;
 286 }
 287 </pre>
 288 </div>
 289
 290 <p>This function illustrates a number of interesting things about the parser:
 291 1) it shows how we use the Error routines.  When called, this function expects
 292 that the current token is a '(' token, but after parsing the subexpression, it
 293 is possible that there is no ')' waiting.  For example, if the user types in
 294 "(4 x" instead of "(4)", the parser should emit an error.  Because errors can
 295 occur, the parser needs a way to indicate that they happened: in our parser, we
 296 return null on an error.</p>
 297
 298 <p>Another interesting aspect of this function is that it uses recursion by
 299 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
 300 <tt>ParseParenExpr</tt>).  This is powerful because it allows us to handle
 301 recursive grammars, and keeps each production very simple.  Note that
 302 parentheses do not cause construction of AST nodes themselves.  While we could
 303 do this, the most important role of parens are to guide the parser and provide
 304 grouping.  Once the parser constructs the AST, parens are not needed.</p>
 305
 306 <p>The next simple production is for handling variable references and function
 307 calls:</p>
 308
 309 <div class="doc_code">
 310 <pre>
 311 /// identifierexpr
 312 ///   ::= identifier
 313 ///   ::= identifier '(' expression* ')'
 314 static ExprAST *ParseIdentifierExpr() {
 315   std::string IdName = IdentifierStr;
 316
 317   getNextToken();  // eat identifier.
 318
 319   if (CurTok != '(') // Simple variable ref.
 320     return new VariableExprAST(IdName);
 321
 322   // Call.
 323   getNextToken();  // eat (
 324   std::vector&lt;ExprAST*&gt; Args;
 325   if (CurTok != ')') {
 326     while (1) {
 327       ExprAST *Arg = ParseExpression();
 328       if (!Arg) return 0;
 329       Args.push_back(Arg);
 330
 331       if (CurTok == ')') break;
 332
 333       if (CurTok != ',')
 334         return Error("Expected ')'");
 335       getNextToken();
 336     }
 337   }
 338
 339   // Eat the ')'.
 340   getNextToken();
 341
 342   return new CallExprAST(IdName, Args);
 343 }
 344 </pre>
 345 </div>
 346
 347 <p>This routine follows the same style as the other routines (it expects to be
 348 called if the current token is a <tt>tok_identifier</tt> token).  It also has
 349 recursion and error handling.  One interesting aspect of this is that it uses
 350 <em>look-ahead</em> to determine if the current identifier is a stand alone
 351 variable reference or if it is a function call expression.  It handles this by
 352 checking to see if the token after the identifier is a '(' token, and constructs
 353 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
 354 </p>
 355
 356 <p>Now that we have all of our simple expression parsing logic in place, we can
 357 define a helper function to wrap it together into one entry-point.  We call this
 358 class of expressions "primary" expressions, for reasons that will become more
 359 clear <a href="LangImpl6.html#unary">later in the tutorial</a>.  In order to
 360 parse an arbitrary primary expression, we need to determine what sort of
 361 specific expression it is:</p>
 362
 363 <div class="doc_code">
 364 <pre>
 365 /// primary
 366 ///   ::= identifierexpr
 367 ///   ::= numberexpr
 368 ///   ::= parenexpr
 369 static ExprAST *ParsePrimary() {
 370   switch (CurTok) {
 371   default: return Error("unknown token when expecting an expression");
 372   case tok_identifier: return ParseIdentifierExpr();
 373   case tok_number:     return ParseNumberExpr();
 374   case '(':            return ParseParenExpr();
 375   }
 376 }
 377 </pre>
 378 </div>
 379
 380 <p>Now that you see the definition of this function, it makes it more obvious
 381 why we can assume the state of CurTok in the various functions.  This uses
 382 look-ahead to determine which sort of expression is being inspected, and parses
 383 it with a function call.</p>
 384
 385 <p>Now that basic expressions are handled, we need to handle binary expressions,
 386 which are a bit more complex.</p>
 387
 388 </div>
 389
 390 <!-- *********************************************************************** -->
 391 <div class="doc_section"><a name="parserbinops">Binary Expression
 392  Parsing</a></div>
 393 <!-- *********************************************************************** -->
 394
 395 <div class="doc_text">
 396
 397 <p>Binary expressions are significantly harder to parse because they are often
 398 ambiguous.  For example, when given the string "x+y*z", the parser can choose
 399 to parse it as either "(x+y)*z" or "x+(y*z)".  With common definitions from
 400 mathematics, we expect the later parse, because "*" (multiplication) has
 401 higher <em>precedence</em> than "+" (addition).</p>
 402
 403 <p>There are many ways to handle this, but an elegant and efficient way is to
 404 use <a href=
 405 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
 406 Parsing</a>.  This parsing technique uses the precedence of binary operators to
 407 guide recursion.  To start with, we need a table of precedences:</p>
 408
 409 <div class="doc_code">
 410 <pre>
 411 /// BinopPrecedence - This holds the precedence for each binary operator that is
 412 /// defined.
 413 static std::map&lt;char, int&gt; BinopPrecedence;
 414
 415 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 416 static int GetTokPrecedence() {
 417   if (!isascii(CurTok))
 418     return -1;
 419
 420   // Make sure it's a declared binop.
 421   int TokPrec = BinopPrecedence[CurTok];
 422   if (TokPrec &lt;= 0) return -1;
 423   return TokPrec;
 424 }
 425
 426 int main() {
 427   // Install standard binary operators.
 428   // 1 is lowest precedence.
 429   BinopPrecedence['&lt;'] = 10;
 430   BinopPrecedence['+'] = 20;
 431   BinopPrecedence['-'] = 20;
 432   BinopPrecedence['*'] = 40;  // highest.
 433   ...
 434 }
 435 </pre>
 436 </div>
 437
 438 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
 439 (this can obviously be extended by you, our brave and intrepid reader).  The
 440 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
 441 or -1 if the token is not a binary operator.  Having a map makes it easy to add
 442 new operators and makes it clear that the algorithm doesn't depend on the
 443 specific operators involved, but it would be easy enough to eliminate the map
 444 and do the comparisons in the <tt>GetTokPrecedence</tt> function (or just use
 445 a fixed-size array).</p>
 446
 447 <p>With the helper above defined, we can now start parsing binary expressions.
 448 The basic idea of operator precedence parsing is to break down an expression
 449 with potentially ambiguous binary operators into pieces.  Consider for example
 450 the expression "a+b+(c+d)*e*f+g".  Operator precedence parsing considers this
 451 as a stream of primary expressions separated by binary operators.  As such,
 452 it will first parse the leading primary expression "a", then it will see the
 453 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g].  Note that because parentheses
 454 are primary expressions, the binary expression parser doesn't need to worry
 455 about nested subexpressions like (c+d) at all.
 456 </p>
 457
 458 <p>
 459 To start, an expression is a primary expression potentially followed by a
 460 sequence of [binop,primaryexpr] pairs:</p>
 461
 462 <div class="doc_code">
 463 <pre>
 464 /// expression
 465 ///   ::= primary binoprhs
 466 ///
 467 static ExprAST *ParseExpression() {
 468   ExprAST *LHS = ParsePrimary();
 469   if (!LHS) return 0;
 470
 471   return ParseBinOpRHS(0, LHS);
 472 }
 473 </pre>
 474 </div>
 475
 476 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
 477 us.  It takes a precedence and a pointer to an expression for the part parsed
 478 so far.   Note that "x" is a perfectly valid expression: As such, "binoprhs" is
 479 allowed to be empty, in which case it returns the expression that is passed into
 480 it. In our example above, the code passes the expression for "a" into
 481 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
 482
 483 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
 484 minimal operator precedence</em> that the function is allowed to eat.  For
 485 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
 486 passed in a precedence of 40, it will not consume any tokens (because the
 487 precedence of '+' is only 20).  With this in mind, <tt>ParseBinOpRHS</tt> starts
 488 with:</p>
 489
 490 <div class="doc_code">
 491 <pre>
 492 /// binoprhs
 493 ///   ::= ('+' primary)*
 494 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 495   // If this is a binop, find its precedence.
 496   while (1) {
 497     int TokPrec = GetTokPrecedence();
 498
 499     // If this is a binop that binds at least as tightly as the current binop,
 500     // consume it, otherwise we are done.
 501     if (TokPrec &lt; ExprPrec)
 502       return LHS;
 503 </pre>
 504 </div>
 505
 506 <p>This code gets the precedence of the current token and checks to see if if is
 507 too low.  Because we defined invalid tokens to have a precedence of -1, this
 508 check implicitly knows that the pair-stream ends when the token stream runs out
 509 of binary operators.  If this check succeeds, we know that the token is a binary
 510 operator and that it will be included in this expression:</p>
 511
 512 <div class="doc_code">
 513 <pre>
 514     // Okay, we know this is a binop.
 515     int BinOp = CurTok;
 516     getNextToken();  // eat binop
 517
 518     // Parse the primary expression after the binary operator.
 519     ExprAST *RHS = ParsePrimary();
 520     if (!RHS) return 0;
 521 </pre>
 522 </div>
 523
 524 <p>As such, this code eats (and remembers) the binary operator and then parses
 525 the following primary expression.  This builds up the whole pair, the first of
 526 which is [+, b] for the running example.</p>
 527
 528 <p>Now that we parsed the left-hand side of an expression and one pair of the
 529 RHS sequence, we have to decide which way the expression associates.  In
 530 particular, we could have "(a+b) binop unparsed"  or "a + (b binop unparsed)".
 531 To determine this, we look ahead at "binop" to determine its precedence and
 532 compare it to BinOp's precedence (which is '+' in this case):</p>
 533
 534 <div class="doc_code">
 535 <pre>
 536     // If BinOp binds less tightly with RHS than the operator after RHS, let
 537     // the pending operator take RHS as its LHS.
 538     int NextPrec = GetTokPrecedence();
 539     if (TokPrec &lt; NextPrec) {
 540 </pre>
 541 </div>
 542
 543 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
 544 precedence of our current operator, then we know that the parentheses associate
 545 as "(a+b) binop ...".  In our example, since the next operator is "+" and so is
 546 our current one, we know that they have the same precedence.  In this case we'll
 547 create the AST node for "a+b", and then continue parsing:</p>
 548
 549 <div class="doc_code">
 550 <pre>
 551       ... if body omitted ...
 552     }
 553
 554     // Merge LHS/RHS.
 555     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 556   }  // loop around to the top of the while loop.
 557 }
 558 </pre>
 559 </div>
 560
 561 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
 562 iteration of the loop, with "+" as the current token.  The code above will eat
 563 and remember it and parse "(c+d)" as the primary expression, which makes the
 564 current pair be [+, (c+d)].  It will then enter the 'if' above with "*" as the
 565 binop to the right of the primary.  In this case, the precedence of "*" is
 566 higher than the precedence of "+" so the if condition will be entered.</p>
 567
 568 <p>The critical question left here is "how can the if condition parse the right
 569 hand side in full"?  In particular, to build the AST correctly for our example,
 570 it needs to get all of "(c+d)*e*f" as the RHS expression variable.  The code to
 571 do this is surprisingly simple (code from the above two blocks duplicated for
 572 context):</p>
 573
 574 <div class="doc_code">
 575 <pre>
 576     // If BinOp binds less tightly with RHS than the operator after RHS, let
 577     // the pending operator take RHS as its LHS.
 578     int NextPrec = GetTokPrecedence();
 579     if (TokPrec &lt; NextPrec) {
 580       <b>RHS = ParseBinOpRHS(TokPrec+1, RHS);
 581       if (RHS == 0) return 0;</b>
 582     }
 583     // Merge LHS/RHS.
 584     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 585   }  // loop around to the top of the while loop.
 586 }
 587 </pre>
 588 </div>
 589
 590 <p>At this point, we know that the binary operator to the RHS of our primary
 591 has higher precedence than the binop we are currently parsing.  As such, we know
 592 that any sequence of pairs whose operators are all higher precedence than "+"
 593 should be parsed together and returned as "RHS".  To do this, we recursively
 594 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
 595 precedence required for it to continue.  In our example above, this will cause
 596 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
 597 of the '+' expression.</p>
 598
 599 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
 600 and added to the AST.  With this little bit of code (14 non-trivial lines), we
 601 correctly handle fully general binary expression parsing in a very elegant way.
 602 This was a whirlwind tour of this code, and it is somewhat subtle.  I recommend
 603 running through it with a few tough examples to see how it works.
 604 </p>
 605
 606 <p>This wraps up handling of expressions.  At this point, we can point the
 607 parser at an arbitrary token stream and build an expression from them, stopping
 608 at the first token that is not part of the expression.  Next up we need to
 609 handle function definitions etc.</p>
 610
 611 </div>
 612
 613 <!-- *********************************************************************** -->
 614 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
 615 <!-- *********************************************************************** -->
 616
 617 <div class="doc_text">
 618
 619 <p>
 620 The next thing missing is handling of function prototypes.  In Kaleidoscope,
 621 these are used both for 'extern' function declarations as well as function body
 622 definitions.  The code to do this is straight-forward and not very interesting
 623 (once you've survived expressions):
 624 </p>
 625
 626 <div class="doc_code">
 627 <pre>
 628 /// prototype
 629 ///   ::= id '(' id* ')'
 630 static PrototypeAST *ParsePrototype() {
 631   if (CurTok != tok_identifier)
 632     return ErrorP("Expected function name in prototype");
 633
 634   std::string FnName = IdentifierStr;
 635   getNextToken();
 636
 637   if (CurTok != '(')
 638     return ErrorP("Expected '(' in prototype");
 639
 640   // Read the list of argument names.
 641   std::vector&lt;std::string&gt; ArgNames;
 642   while (getNextToken() == tok_identifier)
 643     ArgNames.push_back(IdentifierStr);
 644   if (CurTok != ')')
 645     return ErrorP("Expected ')' in prototype");
 646
 647   // success.
 648   getNextToken();  // eat ')'.
 649
 650   return new PrototypeAST(FnName, ArgNames);
 651 }
 652 </pre>
 653 </div>
 654
 655 <p>Given this, a function definition is very simple, just a prototype plus
 656 an expression to implement the body:</p>
 657
 658 <div class="doc_code">
 659 <pre>
 660 /// definition ::= 'def' prototype expression
 661 static FunctionAST *ParseDefinition() {
 662   getNextToken();  // eat def.
 663   PrototypeAST *Proto = ParsePrototype();
 664   if (Proto == 0) return 0;
 665
 666   if (ExprAST *E = ParseExpression())
 667     return new FunctionAST(Proto, E);
 668   return 0;
 669 }
 670 </pre>
 671 </div>
 672
 673 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
 674 well as to support forward declaration of user functions.  'externs' are just
 675 prototypes with no body:</p>
 676
 677 <div class="doc_code">
 678 <pre>
 679 /// external ::= 'extern' prototype
 680 static PrototypeAST *ParseExtern() {
 681   getNextToken();  // eat extern.
 682   return ParsePrototype();
 683 }
 684 </pre>
 685 </div>
 686
 687 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
 688 evaluate them on the fly.  We will handle this by defining anonymous nullary
 689 (zero argument) functions for them:</p>
 690
 691 <div class="doc_code">
 692 <pre>
 693 /// toplevelexpr ::= expression
 694 static FunctionAST *ParseTopLevelExpr() {
 695   if (ExprAST *E = ParseExpression()) {
 696     // Make an anonymous proto.
 697     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
 698     return new FunctionAST(Proto, E);
 699   }
 700   return 0;
 701 }
 702 </pre>
 703 </div>
 704
 705 <p>Now that we have all the pieces, lets build a little driver that will let us
 706 actually <em>execute</em> this code we've built!</p>
 707
 708 </div>
 709
 710 <!-- *********************************************************************** -->
 711 <div class="doc_section"><a name="driver">The Driver</a></div>
 712 <!-- *********************************************************************** -->
 713
 714 <div class="doc_text">
 715
 716 <p>The driver for this simply invokes all of the parsing pieces with a top-level
 717 dispatch loop.  There isn't much interesting here, so I'll just include the
 718 top-level loop.  See <a href="#code">below</a> for full code in the "Top-Level
 719 Parsing" section.</p>
 720
 721 <div class="doc_code">
 722 <pre>
 723 /// top ::= definition | external | expression | ';'
 724 static void MainLoop() {
 725   while (1) {
 726     fprintf(stderr, "ready&gt; ");
 727     switch (CurTok) {
 728     case tok_eof:    return;
 729     case ';':        getNextToken(); break;  // ignore top level semicolons.
 730     case tok_def:    HandleDefinition(); break;
 731     case tok_extern: HandleExtern(); break;
 732     default:         HandleTopLevelExpression(); break;
 733     }
 734   }
 735 }
 736 </pre>
 737 </div>
 738
 739 <p>The most interesting part of this is that we ignore top-level semi colons.
 740 Why is this, you ask?  The basic reason is that if you type "4 + 5" at the
 741 command line, the parser doesn't know that that is the end of what you will
 742 type.  For example, on the next line you could type "def foo..." in which case
 743 4+5 is the end of a top-level expression.  Alternatively you could type "* 6",
 744 which would continue the expression.  Having top-level semicolons allows you to
 745 type "4+5;" and the parser will know you are done.</p>
 746
 747 </div>
 748
 749 <!-- *********************************************************************** -->
 750 <div class="doc_section"><a name="conclusions">Conclusions</a></div>
 751 <!-- *********************************************************************** -->
 752
 753 <div class="doc_text">
 754
 755 <p>With just under 400 lines of commented code (240 lines of non-comment,
 756 non-blank code), we fully defined our minimal language, including a lexer,
 757 parser and AST builder.  With this done, the executable will validate
 758 Kaleidoscope code and tell us if it is gramatically invalid.  For
 759 example, here is a sample interaction:</p>
 760
 761 <div class="doc_code">
 762 <pre>
 763 $ <b>./a.out</b>
 764 ready&gt; <b>def foo(x y) x+foo(y, 4.0);</b>
 765 Parsed a function definition.
 766 ready&gt; <b>def foo(x y) x+y y;</b>
 767 Parsed a function definition.
 768 Parsed a top-level expr
 769 ready&gt; <b>def foo(x y) x+y );</b>
 770 Parsed a function definition.
 771 Error: unknown token when expecting an expression
 772 ready&gt; <b>extern sin(a);</b>
 773 ready&gt; Parsed an extern
 774 ready&gt; <b>^D</b>
 775 $
 776 </pre>
 777 </div>
 778
 779 <p>There is a lot of room for extension here.  You can define new AST nodes,
 780 extend the language in many ways, etc.  In the <a href="LangImpl3.html">next
 781 installment</a>, we will describe how to generate LLVM IR from the AST.</p>
 782
 783 </div>
 784
 785 <!-- *********************************************************************** -->
 786 <div class="doc_section"><a name="code">Full Code Listing</a></div>
 787 <!-- *********************************************************************** -->
 788
 789 <div class="doc_text">
 790
 791 <p>
 792 Here is the complete code listing for this and the previous chapter.
 793 Note that it is fully self-contained: you don't need LLVM or any external
 794 libraries at all for this (other than the C and C++ standard libraries of
 795 course).  To build this, just compile with:</p>
 796
 797 <div class="doc_code">
 798 <pre>
 799    # Compile
 800    g++ -g -O3 toy.cpp
 801    # Run
 802    ./a.out
 803 </pre>
 804 </div>
 805
 806 <p>Here is the code:</p>
 807
 808 <div class="doc_code">
 809 <pre>
 810 #include &lt;cstdio&gt;
 811 #include &lt;string&gt;
 812 #include &lt;map&gt;
 813 #include &lt;vector&gt;
 814
 815 //===----------------------------------------------------------------------===//
 816 // Lexer
 817 //===----------------------------------------------------------------------===//
 818
 819 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
 820 // of these for known things.
 821 enum Token {
 822   tok_eof = -1,
 823
 824   // commands
 825   tok_def = -2, tok_extern = -3,
 826
 827   // primary
 828   tok_identifier = -4, tok_number = -5,
 829 };
 830
 831 static std::string IdentifierStr;  // Filled in if tok_identifier
 832 static double NumVal;              // Filled in if tok_number
 833
 834 /// gettok - Return the next token from standard input.
 835 static int gettok() {
 836   static int LastChar = ' ';
 837
 838   // Skip any whitespace.
 839   while (isspace(LastChar))
 840     LastChar = getchar();
 841
 842   if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
 843     IdentifierStr = LastChar;
 844     while (isalnum((LastChar = getchar())))
 845       IdentifierStr += LastChar;
 846
 847     if (IdentifierStr == "def") return tok_def;
 848     if (IdentifierStr == "extern") return tok_extern;
 849     return tok_identifier;
 850   }
 851
 852   if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
 853     std::string NumStr;
 854     do {
 855       NumStr += LastChar;
 856       LastChar = getchar();
 857     } while (isdigit(LastChar) || LastChar == '.');
 858
 859     NumVal = strtod(NumStr.c_str(), 0);
 860     return tok_number;
 861   }
 862
 863   if (LastChar == '#') {
 864     // Comment until end of line.
 865     do LastChar = getchar();
 866     while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
 867
 868     if (LastChar != EOF)
 869       return gettok();
 870   }
 871
 872   // Check for end of file.  Don't eat the EOF.
 873   if (LastChar == EOF)
 874     return tok_eof;
 875
 876   // Otherwise, just return the character as its ascii value.
 877   int ThisChar = LastChar;
 878   LastChar = getchar();
 879   return ThisChar;
 880 }
 881
 882 //===----------------------------------------------------------------------===//
 883 // Abstract Syntax Tree (aka Parse Tree)
 884 //===----------------------------------------------------------------------===//
 885
 886 /// ExprAST - Base class for all expression nodes.
 887 class ExprAST {
 888 public:
 889   virtual ~ExprAST() {}
 890 };
 891
 892 /// NumberExprAST - Expression class for numeric literals like "1.0".
 893 class NumberExprAST : public ExprAST {
 894   double Val;
 895 public:
 896   explicit NumberExprAST(double val) : Val(val) {}
 897 };
 898
 899 /// VariableExprAST - Expression class for referencing a variable, like "a".
 900 class VariableExprAST : public ExprAST {
 901   std::string Name;
 902 public:
 903   explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
 904 };
 905
 906 /// BinaryExprAST - Expression class for a binary operator.
 907 class BinaryExprAST : public ExprAST {
 908   char Op;
 909   ExprAST *LHS, *RHS;
 910 public:
 911   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 912     : Op(op), LHS(lhs), RHS(rhs) {}
 913 };
 914
 915 /// CallExprAST - Expression class for function calls.
 916 class CallExprAST : public ExprAST {
 917   std::string Callee;
 918   std::vector&lt;ExprAST*&gt; Args;
 919 public:
 920   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 921     : Callee(callee), Args(args) {}
 922 };
 923
 924 /// PrototypeAST - This class represents the "prototype" for a function,
 925 /// which captures its name, and its argument names (thus implicitly the number
 926 /// of arguments the function takes).
 927 class PrototypeAST {
 928   std::string Name;
 929   std::vector&lt; Args;
 930 public:
 931   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 932     : Name(name), Args(args) {}
 933
 934 };
 935
 936 /// FunctionAST - This class represents a function definition itself.
 937 class FunctionAST {
 938   PrototypeAST *Proto;
 939   ExprAST *Body;
 940 public:
 941   FunctionAST(PrototypeAST *proto, ExprAST *body)
 942     : Proto(proto), Body(body) {}
 943
 944 };
 945
 946 //===----------------------------------------------------------------------===//
 947 // Parser
 948 //===----------------------------------------------------------------------===//
 949
 950 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 951 /// token the parser it looking at.  getNextToken reads another token from the
 952 /// lexer and updates CurTok with its results.
 953 static int CurTok;
 954 static int getNextToken() {
 955   return CurTok = gettok();
 956 }
 957
 958 /// BinopPrecedence - This holds the precedence for each binary operator that is
 959 /// defined.
 960 static std::map&lt;char, int&gt; BinopPrecedence;
 961
 962 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 963 static int GetTokPrecedence() {
 964   if (!isascii(CurTok))
 965     return -1;
 966
 967   // Make sure it's a declared binop.
 968   int TokPrec = BinopPrecedence[CurTok];
 969   if (TokPrec &lt;= 0) return -1;
 970   return TokPrec;
 971 }
 972
 973 /// Error* - These are little helper functions for error handling.
 974 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 975 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 976 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 977
 978 static ExprAST *ParseExpression();
 979
 980 /// identifierexpr
 981 ///   ::= identifier
 982 ///   ::= identifier '(' expression* ')'
 983 static ExprAST *ParseIdentifierExpr() {
 984   std::string IdName = IdentifierStr;
 985
 986   getNextToken();  // eat identifier.
 987
 988   if (CurTok != '(') // Simple variable ref.
 989     return new VariableExprAST(IdName);
 990
 991   // Call.
 992   getNextToken();  // eat (
 993   std::vector&lt;ExprAST*&gt; Args;
 994   if (CurTok != ')') {
 995     while (1) {
 996       ExprAST *Arg = ParseExpression();
 997       if (!Arg) return 0;
 998       Args.push_back(Arg);
 999
1000       if (CurTok == ')') break;
1001
1002       if (CurTok != ',')
1003         return Error("Expected ')'");
1004       getNextToken();
1005     }
1006   }
1007
1008   // Eat the ')'.
1009   getNextToken();
1010
1011   return new CallExprAST(IdName, Args);
1012 }
1013
1014 /// numberexpr ::= number
1015 static ExprAST *ParseNumberExpr() {
1016   ExprAST *Result = new NumberExprAST(NumVal);
1017   getNextToken(); // consume the number
1018   return Result;
1019 }
1020
1021 /// parenexpr ::= '(' expression ')'
1022 static ExprAST *ParseParenExpr() {
1023   getNextToken();  // eat (.
1024   ExprAST *V = ParseExpression();
1025   if (!V) return 0;
1026
1027   if (CurTok != ')')
1028     return Error("expected ')'");
1029   getNextToken();  // eat ).
1030   return V;
1031 }
1032
1033 /// primary
1034 ///   ::= identifierexpr
1035 ///   ::= numberexpr
1036 ///   ::= parenexpr
1037 static ExprAST *ParsePrimary() {
1038   switch (CurTok) {
1039   default: return Error("unknown token when expecting an expression");
1040   case tok_identifier: return ParseIdentifierExpr();
1041   case tok_number:     return ParseNumberExpr();
1042   case '(':            return ParseParenExpr();
1043   }
1044 }
1045
1046 /// binoprhs
1047 ///   ::= ('+' primary)*
1048 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1049   // If this is a binop, find its precedence.
1050   while (1) {
1051     int TokPrec = GetTokPrecedence();
1052
1053     // If this is a binop that binds at least as tightly as the current binop,
1054     // consume it, otherwise we are done.
1055     if (TokPrec &lt; ExprPrec)
1056       return LHS;
1057
1058     // Okay, we know this is a binop.
1059     int BinOp = CurTok;
1060     getNextToken();  // eat binop
1061
1062     // Parse the primary expression after the binary operator.
1063     ExprAST *RHS = ParsePrimary();
1064     if (!RHS) return 0;
1065
1066     // If BinOp binds less tightly with RHS than the operator after RHS, let
1067     // the pending operator take RHS as its LHS.
1068     int NextPrec = GetTokPrecedence();
1069     if (TokPrec &lt; NextPrec) {
1070       RHS = ParseBinOpRHS(TokPrec+1, RHS);
1071       if (RHS == 0) return 0;
1072     }
1073
1074     // Merge LHS/RHS.
1075     LHS = new BinaryExprAST(BinOp, LHS, RHS);
1076   }
1077 }
1078
1079 /// expression
1080 ///   ::= primary binoprhs
1081 ///
1082 static ExprAST *ParseExpression() {
1083   ExprAST *LHS = ParsePrimary();
1084   if (!LHS) return 0;
1085
1086   return ParseBinOpRHS(0, LHS);
1087 }
1088
1089 /// prototype
1090 ///   ::= id '(' id* ')'
1091 static PrototypeAST *ParsePrototype() {
1092   if (CurTok != tok_identifier)
1093     return ErrorP("Expected function name in prototype");
1094
1095   std::string FnName = IdentifierStr;
1096   getNextToken();
1097
1098   if (CurTok != '(')
1099     return ErrorP("Expected '(' in prototype");
1100
1101   std::vector&lt;std::string&gt; ArgNames;
1102   while (getNextToken() == tok_identifier)
1103     ArgNames.push_back(IdentifierStr);
1104   if (CurTok != ')')
1105     return ErrorP("Expected ')' in prototype");
1106
1107   // success.
1108   getNextToken();  // eat ')'.
1109
1110   return new PrototypeAST(FnName, ArgNames);
1111 }
1112
1113 /// definition ::= 'def' prototype expression
1114 static FunctionAST *ParseDefinition() {
1115   getNextToken();  // eat def.
1116   PrototypeAST *Proto = ParsePrototype();
1117   if (Proto == 0) return 0;
1118
1119   if (ExprAST *E = ParseExpression())
1120     return new FunctionAST(Proto, E);
1121   return 0;
1122 }
1123
1124 /// toplevelexpr ::= expression
1125 static FunctionAST *ParseTopLevelExpr() {
1126   if (ExprAST *E = ParseExpression()) {
1127     // Make an anonymous proto.
1128     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1129     return new FunctionAST(Proto, E);
1130   }
1131   return 0;
1132 }
1133
1134 /// external ::= 'extern' prototype
1135 static PrototypeAST *ParseExtern() {
1136   getNextToken();  // eat extern.
1137   return ParsePrototype();
1138 }
1139
1140 //===----------------------------------------------------------------------===//
1141 // Top-Level parsing
1142 //===----------------------------------------------------------------------===//
1143
1144 static void HandleDefinition() {
1145   if (FunctionAST *F = ParseDefinition()) {
1146     fprintf(stderr, "Parsed a function definition.\n");
1147   } else {
1148     // Skip token for error recovery.
1149     getNextToken();
1150   }
1151 }
1152
1153 static void HandleExtern() {
1154   if (PrototypeAST *P = ParseExtern()) {
1155     fprintf(stderr, "Parsed an extern\n");
1156   } else {
1157     // Skip token for error recovery.
1158     getNextToken();
1159   }
1160 }
1161
1162 static void HandleTopLevelExpression() {
1163   // Evaluate a top level expression into an anonymous function.
1164   if (FunctionAST *F = ParseTopLevelExpr()) {
1165     fprintf(stderr, "Parsed a top-level expr\n");
1166   } else {
1167     // Skip token for error recovery.
1168     getNextToken();
1169   }
1170 }
1171
1172 /// top ::= definition | external | expression | ';'
1173 static void MainLoop() {
1174   while (1) {
1175     fprintf(stderr, "ready&gt; ");
1176     switch (CurTok) {
1177     case tok_eof:    return;
1178     case ';':        getNextToken(); break;  // ignore top level semicolons.
1179     case tok_def:    HandleDefinition(); break;
1180     case tok_extern: HandleExtern(); break;
1181     default:         HandleTopLevelExpression(); break;
1182     }
1183   }
1184 }
1185
1186 //===----------------------------------------------------------------------===//
1187 // Main driver code.
1188 //===----------------------------------------------------------------------===//
1189
1190 int main() {
1191   // Install standard binary operators.
1192   // 1 is lowest precedence.
1193   BinopPrecedence['&lt;'] = 10;
1194   BinopPrecedence['+'] = 20;
1195   BinopPrecedence['-'] = 20;
1196   BinopPrecedence['*'] = 40;  // highest.
1197
1198   // Prime the first token.
1199   fprintf(stderr, "ready&gt; ");
1200   getNextToken();
1201
1202   MainLoop();
1203   return 0;
1204 }
1205 </pre>
1206 </div>
1207 </div>
1208
1209 <!-- *********************************************************************** -->
1210 <hr>
1211 <address>
1212   <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1213   src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1214   <a href="http://validator.w3.org/check/referer"><img
1215   src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
1216
1217   <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1218   <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1219   Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1220 </address>
1221 </body>
1222 </html>