docs/tutorial/LangImpl2.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   2                       "http://www.w3.org/TR/html4/strict.dtd">
   3
   4 <html>
   5 <head>
   6   <title>Kaleidoscope: Implementing a Parser and AST</title>
   7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   8   <meta name="author" content="Chris Lattner">
   9   <link rel="stylesheet" href="../llvm.css" type="text/css">
  10 </head>
  11
  12 <body>
  13
  14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
  15
  16 <ul>
  17 <li>Chapter 2
  18   <ol>
  19     <li><a href="#intro">Chapter 2 Introduction</a></li>
  20     <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
  21     <li><a href="#parserbasics">Parser Basics</a></li>
  22     <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
  23     <li><a href="#parserbinops">Binary Expression Parsing</a></li>
  24     <li><a href="#parsertop">Parsing the Rest</a></li>
  25     <li><a href="#driver">The Driver</a></li>
  26     <li><a href="#conclusions">Conclusions</a></li>
  27     <li><a href="#code">Full Code Listing</a></li>
  28   </ol>
  29 </li>
  30 </ul>
  31
  32 <div class="doc_author">
  33   <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
  34 </div>
  35
  36 <!-- *********************************************************************** -->
  37 <div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
  38 <!-- *********************************************************************** -->
  39
  40 <div class="doc_text">
  41
  42 <p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
  43 with LLVM</a>" tutorial.  This chapter shows you how to use the <a
  44 href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
  45 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
  46 our Kaleidoscope language and build an <a
  47 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
  48 Tree</a> (AST).</p>
  49
  50 <p>The parser we will build uses a combination of <a
  51 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
  52 Parsing</a> and <a href=
  53 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
  54 Parsing</a> to parse the Kaleidoscope language (the later for binary expression
  55 and the former for everything else).  Before we get to parsing though, lets talk
  56 about the output of the parser: the Abstract Syntax Tree.</p>
  57
  58 </div>
  59
  60 <!-- *********************************************************************** -->
  61 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
  62 <!-- *********************************************************************** -->
  63
  64 <div class="doc_text">
  65
  66 <p>The AST for a program captures its behavior in a way that it is easy for
  67 later stages of the compiler (e.g. code generation) to interpret.  We basically
  68 want one object for each construct in the language, and the AST should closely
  69 model the language.  In Kaleidoscope, we have expressions, a prototype, and a
  70 function object.  We'll start with expressions first:</p>
  71
  72 <div class="doc_code">
  73 <pre>
  74 /// ExprAST - Base class for all expression nodes.
  75 class ExprAST {
  76 public:
  77   virtual ~ExprAST() {}
  78 };
  79
  80 /// NumberExprAST - Expression class for numeric literals like "1.0".
  81 class NumberExprAST : public ExprAST {
  82   double Val;
  83 public:
  84   explicit NumberExprAST(double val) : Val(val) {}
  85 };
  86 </pre>
  87 </div>
  88
  89 <p>The code above shows the definition of the base ExprAST class and one
  90 subclass which we use for numeric literals.  The important thing about this is
  91 that the NumberExprAST class captures the numeric value of the literal in the
  92 class, so that later phases of the compiler can know what it is.</p>
  93
  94 <p>Right now we only create the AST,  so there are no useful accessor methods on
  95 them.  It would be very easy to add a virtual method to pretty print the code,
  96 for example.  Here are the other expression AST node definitions that we'll use
  97 in the basic form of the Kaleidoscope language.
  98 </p>
  99
 100 <div class="doc_code">
 101 <pre>
 102 /// VariableExprAST - Expression class for referencing a variable, like "a".
 103 class VariableExprAST : public ExprAST {
 104   std::string Name;
 105 public:
 106   explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
 107 };
 108
 109 /// BinaryExprAST - Expression class for a binary operator.
 110 class BinaryExprAST : public ExprAST {
 111   char Op;
 112   ExprAST *LHS, *RHS;
 113 public:
 114   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 115     : Op(op), LHS(lhs), RHS(rhs) {}
 116 };
 117
 118 /// CallExprAST - Expression class for function calls.
 119 class CallExprAST : public ExprAST {
 120   std::string Callee;
 121   std::vector&lt;ExprAST*&gt; Args;
 122 public:
 123   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 124     : Callee(callee), Args(args) {}
 125 };
 126 </pre>
 127 </div>
 128
 129 <p>This is all (intentially) rather straight-forward: variables capture the
 130 variable name, binary operators capture their opcode (e.g. '+'), and calls
 131 capture a function name and list of argument expressions.  One thing that is
 132 nice about our AST is that it captures the language features without talking
 133 about the syntax of the language.  Note that there is no discussion about
 134 precedence of binary operators, lexical structure etc.</p>
 135
 136 <p>For our basic language, these are all of the expression nodes we'll define.
 137 Because it doesn't have conditional control flow, it isn't Turing-complete;
 138 we'll fix that in a later installment.  The two things we need next are a way
 139 to talk about the interface to a function, and a way to talk about functions
 140 themselves:</p>
 141
 142 <div class="doc_code">
 143 <pre>
 144 /// PrototypeAST - This class represents the "prototype" for a function,
 145 /// which captures its argument names as well as if it is an operator.
 146 class PrototypeAST {
 147   std::string Name;
 148   std::vector&lt;std::string&gt; Args;
 149 public:
 150   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 151     : Name(name), Args(args) {}
 152 };
 153
 154 /// FunctionAST - This class represents a function definition itself.
 155 class FunctionAST {
 156   PrototypeAST *Proto;
 157   ExprAST *Body;
 158 public:
 159   FunctionAST(PrototypeAST *proto, ExprAST *body)
 160     : Proto(proto), Body(body) {}
 161 };
 162 </pre>
 163 </div>
 164
 165 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
 166 Since all values are double precision floating point, this fact doesn't need to
 167 be captured anywhere.  In a more aggressive and realistic language, the
 168 "ExprAST" class would probably have a type field.</p>
 169
 170 <p>With this scaffolding, we can now talk about parsing expressions and function
 171 bodies in Kaleidoscope.</p>
 172
 173 </div>
 174
 175 <!-- *********************************************************************** -->
 176 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
 177 <!-- *********************************************************************** -->
 178
 179 <div class="doc_text">
 180
 181 <p>Now that we have an AST to build, we need to define the parser code to build
 182 it.  The idea here is that we want to parse something like "x+y" (which is
 183 returned as three tokens by the lexer) into an AST that could be generated with
 184 calls like this:</p>
 185
 186 <div class="doc_code">
 187 <pre>
 188   ExprAST *X = new VariableExprAST("x");
 189   ExprAST *Y = new VariableExprAST("y");
 190   ExprAST *Result = new BinaryExprAST('+', X, Y);
 191 </pre>
 192 </div>
 193
 194 <p>In order to do this, we'll start by defining some basic helper routines:</p>
 195
 196 <div class="doc_code">
 197 <pre>
 198 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 199 /// token the parser it looking at.  getNextToken reads another token from the
 200 /// lexer and updates CurTok with its results.
 201 static int CurTok;
 202 static int getNextToken() {
 203   return CurTok = gettok();
 204 }
 205 </pre>
 206 </div>
 207
 208 <p>
 209 This implements a simple token buffer around the lexer.  This allows
 210 us to look one token ahead at what the lexer is returning.  Every function in
 211 our parser will assume that CurTok is the current token that needs to be
 212 parsed.</p>
 213
 214 <p>Again, we define these with global variables; it would be better design to
 215 wrap the entire parser in a class and use instance variables for these.
 216 </p>
 217
 218 <div class="doc_code">
 219 <pre>
 220
 221 /// Error* - These are little helper functions for error handling.
 222 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 223 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 224 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 225 </pre>
 226 </div>
 227
 228 <p>
 229 The <tt>Error</tt> routines are simple helper routines that our parser will use
 230 to handle errors.  The error recovery in our parser will not be the best and
 231 is not particular user-friendly, but it will be enough for our tutorial.  These
 232 routines make it easier to handle errors in routines that have various return
 233 types: they always return null.</p>
 234
 235 <p>With these basic helper functions implemented, we can implement the first
 236 piece of our grammar: we'll start with numeric literals.</p>
 237
 238 </div>
 239
 240 <!-- *********************************************************************** -->
 241 <div class="doc_section"><a name="parserprimexprs">Basic Expression
 242  Parsing</a></div>
 243 <!-- *********************************************************************** -->
 244
 245 <div class="doc_text">
 246
 247 <p>We start with numeric literals, because they are the simplest to process.
 248 For each production in our grammar, we'll define a function which parses that
 249 production.  For numeric literals, we have:
 250 </p>
 251
 252 <div class="doc_code">
 253 <pre>
 254 /// numberexpr ::= number
 255 static ExprAST *ParseNumberExpr() {
 256   ExprAST *Result = new NumberExprAST(NumVal);
 257   getNextToken(); // consume the number
 258   return Result;
 259 }
 260 </pre>
 261 </div>
 262
 263 <p>This routine is very simple: it expects to be called when the current token
 264 is a <tt>tok_number</tt> token.  It takes the current number value, creates
 265 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
 266 returns.</p>
 267
 268 <p>There are some interesting aspects of this.  The most important one is that
 269 this routine eats all of the tokens that correspond to the production, and
 270 returns the lexer buffer with the next token (which is not part of the grammar
 271 production) ready to go.  This is a fairly standard way to go for recursive
 272 descent parsers.  For a better example, the parenthesis operator is defined like
 273 this:</p>
 274
 275 <div class="doc_code">
 276 <pre>
 277 /// parenexpr ::= '(' expression ')'
 278 static ExprAST *ParseParenExpr() {
 279   getNextToken();  // eat (.
 280   ExprAST *V = ParseExpression();
 281   if (!V) return 0;
 282
 283   if (CurTok != ')')
 284     return Error("expected ')'");
 285   getNextToken();  // eat ).
 286   return V;
 287 }
 288 </pre>
 289 </div>
 290
 291 <p>This function illustrates a number of interesting things about the parser:
 292 1) it shows how we use the Error routines.  When called, this function expects
 293 that the current token is a '(' token, but after parsing the subexpression, it
 294 is possible that there is not a ')' waiting.  For example, if the user types in
 295 "(4 x" instead of "(4)", the parser should emit an error.  Because errors can
 296 occur, the parser needs a way to indicate that they happened: in our parser, we
 297 return null on an error.</p>
 298
 299 <p>Another interesting aspect of this function is that it uses recursion by
 300 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
 301 <tt>ParseParenExpr</tt>).  This is powerful because it allows us to handle
 302 recursive grammars, and keeps each production very simple.  Note that
 303 parentheses do not cause construction of AST nodes themselves.  While we could
 304 do this, the most important role of parens are to guide the parser and provide
 305 grouping.  Once the parser constructs the AST, parens are not needed.</p>
 306
 307 <p>The next simple production is for handling variable references and function
 308 calls:</p>
 309
 310 <div class="doc_code">
 311 <pre>
 312 /// identifierexpr
 313 ///   ::= identifier
 314 ///   ::= identifier '(' expression* ')'
 315 static ExprAST *ParseIdentifierExpr() {
 316   std::string IdName = IdentifierStr;
 317
 318   getNextToken();  // eat identifier.
 319
 320   if (CurTok != '(') // Simple variable ref.
 321     return new VariableExprAST(IdName);
 322
 323   // Call.
 324   getNextToken();  // eat (
 325   std::vector&lt;ExprAST*&gt; Args;
 326   while (1) {
 327     ExprAST *Arg = ParseExpression();
 328     if (!Arg) return 0;
 329     Args.push_back(Arg);
 330
 331     if (CurTok == ')') break;
 332
 333     if (CurTok != ',')
 334       return Error("Expected ')'");
 335     getNextToken();
 336   }
 337
 338   // Eat the ')'.
 339   getNextToken();
 340
 341   return new CallExprAST(IdName, Args);
 342 }
 343 </pre>
 344 </div>
 345
 346 <p>This routine follows the same style as the other routines (it expects to be
 347 called if the current token is a <tt>tok_identifier</tt> token).  It also has
 348 recursion and error handling.  One interesting aspect of this is that it uses
 349 <em>look-ahead</em> to determine if the current identifier is a stand alone
 350 variable reference or if it is a function call expression.  It handles this by
 351 checking to see if the token after the identifier is a '(' token, and constructs
 352 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
 353 </p>
 354
 355 <p>Now that we have all of our simple expression parsing logic in place, we can
 356 define a helper function to wrap them up in a class.  We call this class of
 357 expressions "primary" expressions, for reasons that will become more clear
 358 later.  In order to parse a primary expression, we need to determine what sort
 359 of expression it is:</p>
 360
 361 <div class="doc_code">
 362 <pre>
 363 /// primary
 364 ///   ::= identifierexpr
 365 ///   ::= numberexpr
 366 ///   ::= parenexpr
 367 static ExprAST *ParsePrimary() {
 368   switch (CurTok) {
 369   default: return Error("unknown token when expecting an expression");
 370   case tok_identifier: return ParseIdentifierExpr();
 371   case tok_number:     return ParseNumberExpr();
 372   case '(':            return ParseParenExpr();
 373   }
 374 }
 375 </pre>
 376 </div>
 377
 378 <p>Now that you see the definition of this function, it makes it more obvious
 379 why we can assume the state of CurTok in the various functions.  This uses
 380 look-ahead to determine which sort of expression is being inspected, and parses
 381 it with a function call.</p>
 382
 383 <p>Now that basic expressions are handled, we need to handle binary expressions,
 384 which are a bit more complex.</p>
 385
 386 </div>
 387
 388 <!-- *********************************************************************** -->
 389 <div class="doc_section"><a name="parserbinops">Binary Expression
 390  Parsing</a></div>
 391 <!-- *********************************************************************** -->
 392
 393 <div class="doc_text">
 394
 395 <p>Binary expressions are significantly harder to parse because they are often
 396 ambiguous.  For example, when given the string "x+y*z", the parser can choose
 397 to parse it as either "(x+y)*z" or "x+(y*z)".  With common definitions from
 398 mathematics, we expect the later parse, because "*" (multiplication) has
 399 higher <em>precedence</em> than "+" (addition).</p>
 400
 401 <p>There are many ways to handle this, but an elegant and efficient way is to
 402 use <a href=
 403 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
 404 Parsing</a>.  This parsing technique uses the precedence of binary operators to
 405 guide recursion.  To start with, we need a table of precedences:</p>
 406
 407 <div class="doc_code">
 408 <pre>
 409 /// BinopPrecedence - This holds the precedence for each binary operator that is
 410 /// defined.
 411 static std::map&lt;char, int&gt; BinopPrecedence;
 412
 413 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 414 static int GetTokPrecedence() {
 415   if (!isascii(CurTok))
 416     return -1;
 417
 418   // Make sure it's a declared binop.
 419   int TokPrec = BinopPrecedence[CurTok];
 420   if (TokPrec &lt;= 0) return -1;
 421   return TokPrec;
 422 }
 423
 424 int main() {
 425   // Install standard binary operators.
 426   // 1 is lowest precedence.
 427   BinopPrecedence['&lt;'] = 10;
 428   BinopPrecedence['+'] = 20;
 429   BinopPrecedence['-'] = 20;
 430   BinopPrecedence['*'] = 40;  // highest.
 431   ...
 432 }
 433 </pre>
 434 </div>
 435
 436 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
 437 (this can obviously be extended by you, the reader).  The
 438 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
 439 or -1 if the token is not a binary operator.  Having a map makes it easy to add
 440 new operators and makes it clear that the algorithm doesn't depend on the
 441 specific operators involved, but it would be easy enough to eliminate the map
 442 and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
 443
 444 <p>With the helper above defined, we can now start parsing binary expressions.
 445 The basic idea of operator precedence parsing is to break down an expression
 446 with potentially ambiguous binary operators into pieces.  Consider for example
 447 the expression "a+b+(c+d)*e*f+g".  Operator precedence parsing considers this
 448 as a stream of primary expressions separated by binary operators.  As such,
 449 it will first parse the leading primary expression "a", then it will see the
 450 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g].  Note that because parentheses
 451 are primary expressions, the binary expression parser doesn't need to worry
 452 about nested subexpressions like (c+d) at all.
 453 </p>
 454
 455 <p>
 456 To start, an expression is a primary expression potentially followed by a
 457 sequence of [binop,primaryexpr] pairs:</p>
 458
 459 <div class="doc_code">
 460 <pre>
 461 /// expression
 462 ///   ::= primary binoprhs
 463 ///
 464 static ExprAST *ParseExpression() {
 465   ExprAST *LHS = ParsePrimary();
 466   if (!LHS) return 0;
 467
 468   return ParseBinOpRHS(0, LHS);
 469 }
 470 </pre>
 471 </div>
 472
 473 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
 474 us.  It takes a precedence and a pointer to an expression for the part parsed
 475 so far.   Note that "x" is a perfectly valid expression: As such, "binoprhs" is
 476 allowed to be empty, in which case it returns the expression that is passed into
 477 it. In our example above, the code passes the expression for "a" into
 478 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
 479
 480 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
 481 minimal operator precedence</em> that the function is allowed to eat.  For
 482 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
 483 passed in a precedence of 40, it will not consume any tokens (because the
 484 precedence of '+' is only 20).  With this in mind, <tt>ParseBinOpRHS</tt> starts
 485 with:</p>
 486
 487 <div class="doc_code">
 488 <pre>
 489 /// binoprhs
 490 ///   ::= ('+' primary)*
 491 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 492   // If this is a binop, find its precedence.
 493   while (1) {
 494     int TokPrec = GetTokPrecedence();
 495
 496     // If this is a binop that binds at least as tightly as the current binop,
 497     // consume it, otherwise we are done.
 498     if (TokPrec &lt; ExprPrec)
 499       return LHS;
 500 </pre>
 501 </div>
 502
 503 <p>This code gets the precedence of the current token and checks to see if if is
 504 too low.  Because we defined invalid tokens to have a precedence of -1, this
 505 check implicitly knows that the pair-stream ends when the token stream runs out
 506 of binary operators.  If this check succeeds, we know that the token is a binary
 507 operator and that it will be included in this expression:</p>
 508
 509 <div class="doc_code">
 510 <pre>
 511     // Okay, we know this is a binop.
 512     int BinOp = CurTok;
 513     getNextToken();  // eat binop
 514
 515     // Parse the primary expression after the binary operator.
 516     ExprAST *RHS = ParsePrimary();
 517     if (!RHS) return 0;
 518 </pre>
 519 </div>
 520
 521 <p>As such, this code eats (and remembers) the binary operator and then parses
 522 the following primary expression.  This builds up the whole pair, the first of
 523 which is [+, b] for the running example.</p>
 524
 525 <p>Now that we parsed the left-hand side of an expression and one pair of the
 526 RHS sequence, we have to decide which way the expression associates.  In
 527 particular, we could have "(a+b) binop unparsed"  or "a + (b binop unparsed)".
 528 To determine this, we look ahead at "binop" to determine its precedence and
 529 compare it to BinOp's precedence (which is '+' in this case):</p>
 530
 531 <div class="doc_code">
 532 <pre>
 533     // If BinOp binds less tightly with RHS than the operator after RHS, let
 534     // the pending operator take RHS as its LHS.
 535     int NextPrec = GetTokPrecedence();
 536     if (TokPrec &lt; NextPrec) {
 537 </pre>
 538 </div>
 539
 540 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
 541 precedence of our current operator, then we know that the parentheses associate
 542 as "(a+b) binop ...".  In our example, since the next operator is "+" and so is
 543 our current one, we know that they have the same precedence.  In this case we'll
 544 create the AST node for "a+b", and then continue parsing:</p>
 545
 546 <div class="doc_code">
 547 <pre>
 548       ... if body omitted ...
 549     }
 550
 551     // Merge LHS/RHS.
 552     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 553   }  // loop around to the top of the while loop.
 554 }
 555 </pre>
 556 </div>
 557
 558 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
 559 iteration of the loop, with "+" as the current token.  The code above will eat
 560 and remember it and parse "(c+d)" as the primary expression, which makes the
 561 current pair be [+, (c+d)].  It will then enter the 'if' above with "*" as the
 562 binop to the right of the primary.  In this case, the precedence of "*" is
 563 higher than the precedence of "+" so the if condition will be entered.</p>
 564
 565 <p>The critical question left here is "how can the if condition parse the right
 566 hand side in full"?  In particular, to build the AST correctly for our example,
 567 it needs to get all of "(c+d)*e*f" as the RHS expression variable.  The code to
 568 do this is surprisingly simple (code from the above two blocks duplicated for
 569 context):</p>
 570
 571 <div class="doc_code">
 572 <pre>
 573     // If BinOp binds less tightly with RHS than the operator after RHS, let
 574     // the pending operator take RHS as its LHS.
 575     int NextPrec = GetTokPrecedence();
 576     if (TokPrec &lt; NextPrec) {
 577       RHS = ParseBinOpRHS(TokPrec+1, RHS);
 578       if (RHS == 0) return 0;
 579     }
 580     // Merge LHS/RHS.
 581     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 582   }  // loop around to the top of the while loop.
 583 }
 584 </pre>
 585 </div>
 586
 587 <p>At this point, we know that the binary operator to the RHS of our primary
 588 has higher precedence than the binop we are currently parsing.  As such, we know
 589 that any sequence of pairs whose operators are all higher precedence than "+"
 590 should be parsed together and returned as "RHS".  To do this, we recursively
 591 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
 592 precedence required for it to continue.  In our example above, this will cause
 593 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
 594 of the '+' expression.</p>
 595
 596 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
 597 and added to the AST.  With this little bit of code (14 non-trivial lines), we
 598 correctly handle fully general binary expression parsing in a very elegant way.
 599 </p>
 600
 601 <p>This wraps up handling of expressions.  At this point, we can point the
 602 parser at an arbitrary token stream and build an expression from them, stopping
 603 at the first token that is not part of the expression.  Next up we need to
 604 handle function definitions etc.</p>
 605
 606 </div>
 607
 608 <!-- *********************************************************************** -->
 609 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
 610 <!-- *********************************************************************** -->
 611
 612 <div class="doc_text">
 613
 614 <p>
 615 The first basic thing missing is that of function prototypes.  In Kaleidoscope,
 616 these are used both for 'extern' function declarations as well as function body
 617 definitions.  The code to do this is straight-forward and not very interesting
 618 (once you've survived expressions):
 619 </p>
 620
 621 <div class="doc_code">
 622 <pre>
 623 /// prototype
 624 ///   ::= id '(' id* ')'
 625 static PrototypeAST *ParsePrototype() {
 626   if (CurTok != tok_identifier)
 627     return ErrorP("Expected function name in prototype");
 628
 629   std::string FnName = IdentifierStr;
 630   getNextToken();
 631
 632   if (CurTok != '(')
 633     return ErrorP("Expected '(' in prototype");
 634
 635   std::vector&lt;std::string&gt; ArgNames;
 636   while (getNextToken() == tok_identifier)
 637     ArgNames.push_back(IdentifierStr);
 638   if (CurTok != ')')
 639     return ErrorP("Expected ')' in prototype");
 640
 641   // success.
 642   getNextToken();  // eat ')'.
 643
 644   return new PrototypeAST(FnName, ArgNames);
 645 }
 646 </pre>
 647 </div>
 648
 649 <p>Given this, a function definition is very simple, just a prototype plus
 650 an expression to implement the body:</p>
 651
 652 <div class="doc_code">
 653 <pre>
 654 /// definition ::= 'def' prototype expression
 655 static FunctionAST *ParseDefinition() {
 656   getNextToken();  // eat def.
 657   PrototypeAST *Proto = ParsePrototype();
 658   if (Proto == 0) return 0;
 659
 660   if (ExprAST *E = ParseExpression())
 661     return new FunctionAST(Proto, E);
 662   return 0;
 663 }
 664 </pre>
 665 </div>
 666
 667 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
 668 well as to support forward declaration of user functions.  'externs' are just
 669 prototypes with no body:</p>
 670
 671 <div class="doc_code">
 672 <pre>
 673 /// external ::= 'extern' prototype
 674 static PrototypeAST *ParseExtern() {
 675   getNextToken();  // eat extern.
 676   return ParsePrototype();
 677 }
 678 </pre>
 679 </div>
 680
 681 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
 682 evaluate them on the fly.  We will handle this by defining anonymous nullary
 683 (zero argument) functions for them:</p>
 684
 685 <div class="doc_code">
 686 <pre>
 687 /// toplevelexpr ::= expression
 688 static FunctionAST *ParseTopLevelExpr() {
 689   if (ExprAST *E = ParseExpression()) {
 690     // Make an anonymous proto.
 691     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
 692     return new FunctionAST(Proto, E);
 693   }
 694   return 0;
 695 }
 696 </pre>
 697 </div>
 698
 699 <p>Now that we have all the pieces, lets build a little driver that will let us
 700 actually <em>execute</em> this code we've built!</p>
 701
 702 </div>
 703
 704 <!-- *********************************************************************** -->
 705 <div class="doc_section"><a name="driver">The Driver</a></div>
 706 <!-- *********************************************************************** -->
 707
 708 <div class="doc_text">
 709
 710 <p>The driver for this simply invokes all of the parsing pieces with a top-level
 711 dispatch loop.  There isn't much interesting here, so I'll just include the
 712 top-level loop.  See <a href="#code">below</a> for full code in the "Top-Level
 713 Parsing" section.</p>
 714
 715 <div class="doc_code">
 716 <pre>
 717 /// top ::= definition | external | expression | ';'
 718 static void MainLoop() {
 719   while (1) {
 720     fprintf(stderr, "ready&gt; ");
 721     switch (CurTok) {
 722     case tok_eof:    return;
 723     case ';':        getNextToken(); break;  // ignore top level semicolons.
 724     case tok_def:    HandleDefinition(); break;
 725     case tok_extern: HandleExtern(); break;
 726     default:         HandleTopLevelExpression(); break;
 727     }
 728   }
 729 }
 730 </pre>
 731 </div>
 732
 733 <p>The most interesting part of this is that we ignore top-level semi colons.
 734 Why is this, you ask?  The basic reason is that if you type "4 + 5" at the
 735 command line, the parser doesn't know that that is the end of what you will
 736 type.  For example, on the next line you could type "def foo..." in which case
 737 4+5 is the end of a top-level expression.  Alternatively you could type "* 6",
 738 which would continue the expression.  Having top-level semicolons allows you to
 739 type "4+5;" and the parser will know you are done.</p>
 740
 741 </div>
 742
 743 <!-- *********************************************************************** -->
 744 <div class="doc_section"><a name="conclusions">Conclusions</a></div>
 745 <!-- *********************************************************************** -->
 746
 747 <div class="doc_text">
 748
 749 <p>With just under 400 lines of commented code, we fully defined our minimal
 750 language, including a lexer, parser and AST builder.  With this done, the
 751 executable will validate code and tell us if it is gramatically invalid.  For
 752 example, here is a sample interaction:</p>
 753
 754 <div class="doc_code">
 755 <pre>
 756 $ ./a.out
 757 ready&gt; def foo(x y) x+foo(y, 4.0);
 758 ready&gt; Parsed a function definition.
 759 ready&gt; def foo(x y) x+y y;
 760 ready&gt; Parsed a function definition.
 761 ready&gt; Parsed a top-level expr
 762 ready&gt; def foo(x y) x+y );
 763 ready&gt; Parsed a function definition.
 764 ready&gt; Error: unknown token when expecting an expression
 765 ready&gt; extern sin(a);
 766 ready&gt; Parsed an extern
 767 ready&gt; ^D
 768 $
 769 </pre>
 770 </div>
 771
 772 <p>There is a lot of room for extension here.  You can define new AST nodes,
 773 extend the language in many ways, etc.  In the <a href="LangImpl3.html">next
 774 installment</a>, we will describe how to generate LLVM IR from the AST.</p>
 775
 776 </div>
 777
 778 <!-- *********************************************************************** -->
 779 <div class="doc_section"><a name="code">Full Code Listing</a></div>
 780 <!-- *********************************************************************** -->
 781
 782 <div class="doc_text">
 783
 784 <p>
 785 Here is the complete code listing for this and the previous chapter.
 786 Note that it is fully self-contained: you don't need LLVM or any external
 787 libraries at all for this (other than the C and C++ standard libraries of
 788 course).  To build this, just compile with:</p>
 789
 790 <div class="doc_code">
 791 <pre>
 792    # Compile
 793    g++ -g toy.cpp
 794    # Run
 795    ./a.out
 796 </pre>
 797 </div>
 798
 799 <p>Here is the code:</p>
 800
 801 <div class="doc_code">
 802 <pre>
 803 #include &lt;cstdio&gt;
 804 #include &lt;string&gt;
 805 #include &lt;map&gt;
 806 #include &lt;vector&gt;
 807
 808 //===----------------------------------------------------------------------===//
 809 // Lexer
 810 //===----------------------------------------------------------------------===//
 811
 812 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
 813 // of these for known things.
 814 enum Token {
 815   tok_eof = -1,
 816
 817   // commands
 818   tok_def = -2, tok_extern = -3,
 819
 820   // primary
 821   tok_identifier = -4, tok_number = -5,
 822 };
 823
 824 static std::string IdentifierStr;  // Filled in if tok_identifier
 825 static double NumVal;              // Filled in if tok_number
 826
 827 /// gettok - Return the next token from standard input.
 828 static int gettok() {
 829   static int LastChar = ' ';
 830
 831   // Skip any whitespace.
 832   while (isspace(LastChar))
 833     LastChar = getchar();
 834
 835   if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
 836     IdentifierStr = LastChar;
 837     while (isalnum((LastChar = getchar())))
 838       IdentifierStr += LastChar;
 839
 840     if (IdentifierStr == "def") return tok_def;
 841     if (IdentifierStr == "extern") return tok_extern;
 842     return tok_identifier;
 843   }
 844
 845   if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
 846     std::string NumStr;
 847     do {
 848       NumStr += LastChar;
 849       LastChar = getchar();
 850     } while (isdigit(LastChar) || LastChar == '.');
 851
 852     NumVal = strtod(NumStr.c_str(), 0);
 853     return tok_number;
 854   }
 855
 856   if (LastChar == '#') {
 857     // Comment until end of line.
 858     do LastChar = getchar();
 859     while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
 860
 861     if (LastChar != EOF)
 862       return gettok();
 863   }
 864
 865   // Check for end of file.  Don't eat the EOF.
 866   if (LastChar == EOF)
 867     return tok_eof;
 868
 869   // Otherwise, just return the character as its ascii value.
 870   int ThisChar = LastChar;
 871   LastChar = getchar();
 872   return ThisChar;
 873 }
 874
 875 //===----------------------------------------------------------------------===//
 876 // Abstract Syntax Tree (aka Parse Tree)
 877 //===----------------------------------------------------------------------===//
 878
 879 /// ExprAST - Base class for all expression nodes.
 880 class ExprAST {
 881 public:
 882   virtual ~ExprAST() {}
 883 };
 884
 885 /// NumberExprAST - Expression class for numeric literals like "1.0".
 886 class NumberExprAST : public ExprAST {
 887   double Val;
 888 public:
 889   explicit NumberExprAST(double val) : Val(val) {}
 890 };
 891
 892 /// VariableExprAST - Expression class for referencing a variable, like "a".
 893 class VariableExprAST : public ExprAST {
 894   std::string Name;
 895 public:
 896   explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
 897 };
 898
 899 /// BinaryExprAST - Expression class for a binary operator.
 900 class BinaryExprAST : public ExprAST {
 901   char Op;
 902   ExprAST *LHS, *RHS;
 903 public:
 904   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 905     : Op(op), LHS(lhs), RHS(rhs) {}
 906 };
 907
 908 /// CallExprAST - Expression class for function calls.
 909 class CallExprAST : public ExprAST {
 910   std::string Callee;
 911   std::vector&lt;ExprAST*&gt; Args;
 912 public:
 913   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 914     : Callee(callee), Args(args) {}
 915 };
 916
 917 /// PrototypeAST - This class represents the "prototype" for a function,
 918 /// which captures its argument names as well as if it is an operator.
 919 class PrototypeAST {
 920   std::string Name;
 921   std::vector&lt; Args;
 922 public:
 923   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 924     : Name(name), Args(args) {}
 925
 926 };
 927
 928 /// FunctionAST - This class represents a function definition itself.
 929 class FunctionAST {
 930   PrototypeAST *Proto;
 931   ExprAST *Body;
 932 public:
 933   FunctionAST(PrototypeAST *proto, ExprAST *body)
 934     : Proto(proto), Body(body) {}
 935
 936 };
 937
 938 //===----------------------------------------------------------------------===//
 939 // Parser
 940 //===----------------------------------------------------------------------===//
 941
 942 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 943 /// token the parser it looking at.  getNextToken reads another token from the
 944 /// lexer and updates CurTok with its results.
 945 static int CurTok;
 946 static int getNextToken() {
 947   return CurTok = gettok();
 948 }
 949
 950 /// BinopPrecedence - This holds the precedence for each binary operator that is
 951 /// defined.
 952 static std::map&lt;char, int&gt; BinopPrecedence;
 953
 954 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 955 static int GetTokPrecedence() {
 956   if (!isascii(CurTok))
 957     return -1;
 958
 959   // Make sure it's a declared binop.
 960   int TokPrec = BinopPrecedence[CurTok];
 961   if (TokPrec &lt;= 0) return -1;
 962   return TokPrec;
 963 }
 964
 965 /// Error* - These are little helper functions for error handling.
 966 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 967 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 968 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 969
 970 static ExprAST *ParseExpression();
 971
 972 /// identifierexpr
 973 ///   ::= identifier
 974 ///   ::= identifier '(' expression* ')'
 975 static ExprAST *ParseIdentifierExpr() {
 976   std::string IdName = IdentifierStr;
 977
 978   getNextToken();  // eat identifier.
 979
 980   if (CurTok != '(') // Simple variable ref.
 981     return new VariableExprAST(IdName);
 982
 983   // Call.
 984   getNextToken();  // eat (
 985   std::vector&lt;ExprAST*&gt; Args;
 986   while (1) {
 987     ExprAST *Arg = ParseExpression();
 988     if (!Arg) return 0;
 989     Args.push_back(Arg);
 990
 991     if (CurTok == ')') break;
 992
 993     if (CurTok != ',')
 994       return Error("Expected ')'");
 995     getNextToken();
 996   }
 997
 998   // Eat the ')'.
 999   getNextToken();
1000
1001   return new CallExprAST(IdName, Args);
1002 }
1003
1004 /// numberexpr ::= number
1005 static ExprAST *ParseNumberExpr() {
1006   ExprAST *Result = new NumberExprAST(NumVal);
1007   getNextToken(); // consume the number
1008   return Result;
1009 }
1010
1011 /// parenexpr ::= '(' expression ')'
1012 static ExprAST *ParseParenExpr() {
1013   getNextToken();  // eat (.
1014   ExprAST *V = ParseExpression();
1015   if (!V) return 0;
1016
1017   if (CurTok != ')')
1018     return Error("expected ')'");
1019   getNextToken();  // eat ).
1020   return V;
1021 }
1022
1023 /// primary
1024 ///   ::= identifierexpr
1025 ///   ::= numberexpr
1026 ///   ::= parenexpr
1027 static ExprAST *ParsePrimary() {
1028   switch (CurTok) {
1029   default: return Error("unknown token when expecting an expression");
1030   case tok_identifier: return ParseIdentifierExpr();
1031   case tok_number:     return ParseNumberExpr();
1032   case '(':            return ParseParenExpr();
1033   }
1034 }
1035
1036 /// binoprhs
1037 ///   ::= ('+' primary)*
1038 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1039   // If this is a binop, find its precedence.
1040   while (1) {
1041     int TokPrec = GetTokPrecedence();
1042
1043     // If this is a binop that binds at least as tightly as the current binop,
1044     // consume it, otherwise we are done.
1045     if (TokPrec &lt; ExprPrec)
1046       return LHS;
1047
1048     // Okay, we know this is a binop.
1049     int BinOp = CurTok;
1050     getNextToken();  // eat binop
1051
1052     // Parse the primary expression after the binary operator.
1053     ExprAST *RHS = ParsePrimary();
1054     if (!RHS) return 0;
1055
1056     // If BinOp binds less tightly with RHS than the operator after RHS, let
1057     // the pending operator take RHS as its LHS.
1058     int NextPrec = GetTokPrecedence();
1059     if (TokPrec &lt; NextPrec) {
1060       RHS = ParseBinOpRHS(TokPrec+1, RHS);
1061       if (RHS == 0) return 0;
1062     }
1063
1064     // Merge LHS/RHS.
1065     LHS = new BinaryExprAST(BinOp, LHS, RHS);
1066   }
1067 }
1068
1069 /// expression
1070 ///   ::= primary binoprhs
1071 ///
1072 static ExprAST *ParseExpression() {
1073   ExprAST *LHS = ParsePrimary();
1074   if (!LHS) return 0;
1075
1076   return ParseBinOpRHS(0, LHS);
1077 }
1078
1079 /// prototype
1080 ///   ::= id '(' id* ')'
1081 static PrototypeAST *ParsePrototype() {
1082   if (CurTok != tok_identifier)
1083     return ErrorP("Expected function name in prototype");
1084
1085   std::string FnName = IdentifierStr;
1086   getNextToken();
1087
1088   if (CurTok != '(')
1089     return ErrorP("Expected '(' in prototype");
1090
1091   std::vector&lt;std::string&gt; ArgNames;
1092   while (getNextToken() == tok_identifier)
1093     ArgNames.push_back(IdentifierStr);
1094   if (CurTok != ')')
1095     return ErrorP("Expected ')' in prototype");
1096
1097   // success.
1098   getNextToken();  // eat ')'.
1099
1100   return new PrototypeAST(FnName, ArgNames);
1101 }
1102
1103 /// definition ::= 'def' prototype expression
1104 static FunctionAST *ParseDefinition() {
1105   getNextToken();  // eat def.
1106   PrototypeAST *Proto = ParsePrototype();
1107   if (Proto == 0) return 0;
1108
1109   if (ExprAST *E = ParseExpression())
1110     return new FunctionAST(Proto, E);
1111   return 0;
1112 }
1113
1114 /// toplevelexpr ::= expression
1115 static FunctionAST *ParseTopLevelExpr() {
1116   if (ExprAST *E = ParseExpression()) {
1117     // Make an anonymous proto.
1118     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1119     return new FunctionAST(Proto, E);
1120   }
1121   return 0;
1122 }
1123
1124 /// external ::= 'extern' prototype
1125 static PrototypeAST *ParseExtern() {
1126   getNextToken();  // eat extern.
1127   return ParsePrototype();
1128 }
1129
1130 //===----------------------------------------------------------------------===//
1131 // Top-Level parsing
1132 //===----------------------------------------------------------------------===//
1133
1134 static void HandleDefinition() {
1135   if (FunctionAST *F = ParseDefinition()) {
1136     fprintf(stderr, "Parsed a function definition.\n");
1137   } else {
1138     // Skip token for error recovery.
1139     getNextToken();
1140   }
1141 }
1142
1143 static void HandleExtern() {
1144   if (PrototypeAST *P = ParseExtern()) {
1145     fprintf(stderr, "Parsed an extern\n");
1146   } else {
1147     // Skip token for error recovery.
1148     getNextToken();
1149   }
1150 }
1151
1152 static void HandleTopLevelExpression() {
1153   // Evaluate a top level expression into an anonymous function.
1154   if (FunctionAST *F = ParseTopLevelExpr()) {
1155     fprintf(stderr, "Parsed a top-level expr\n");
1156   } else {
1157     // Skip token for error recovery.
1158     getNextToken();
1159   }
1160 }
1161
1162 /// top ::= definition | external | expression | ';'
1163 static void MainLoop() {
1164   while (1) {
1165     fprintf(stderr, "ready&gt; ");
1166     switch (CurTok) {
1167     case tok_eof:    return;
1168     case ';':        getNextToken(); break;  // ignore top level semicolons.
1169     case tok_def:    HandleDefinition(); break;
1170     case tok_extern: HandleExtern(); break;
1171     default:         HandleTopLevelExpression(); break;
1172     }
1173   }
1174 }
1175
1176 //===----------------------------------------------------------------------===//
1177 // Main driver code.
1178 //===----------------------------------------------------------------------===//
1179
1180 int main() {
1181   // Install standard binary operators.
1182   // 1 is lowest precedence.
1183   BinopPrecedence['&lt;'] = 10;
1184   BinopPrecedence['+'] = 20;
1185   BinopPrecedence['-'] = 20;
1186   BinopPrecedence['*'] = 40;  // highest.
1187
1188   // Prime the first token.
1189   fprintf(stderr, "ready&gt; ");
1190   getNextToken();
1191
1192   MainLoop();
1193   return 0;
1194 }
1195 </pre>
1196 </div>
1197 </div>
1198
1199 <!-- *********************************************************************** -->
1200 <hr>
1201 <address>
1202   <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1203   src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1204   <a href="http://validator.w3.org/check/referer"><img
1205   src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
1206
1207   <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1208   <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1209   Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1210 </address>
1211 </body>
1212 </html>