docs/tutorial/LangImpl2.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   2                       "http://www.w3.org/TR/html4/strict.dtd">
   3
   4 <html>
   5 <head>
   6   <title>Kaleidoscope: Implementing a Parser and AST</title>
   7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   8   <meta name="author" content="Chris Lattner">
   9   <link rel="stylesheet" href="../llvm.css" type="text/css">
  10 </head>
  11
  12 <body>
  13
  14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
  15
  16 <ul>
  17 <li><a href="index.html">Up to Tutorial Index</a></li>
  18 <li>Chapter 2
  19   <ol>
  20     <li><a href="#intro">Chapter 2 Introduction</a></li>
  21     <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
  22     <li><a href="#parserbasics">Parser Basics</a></li>
  23     <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
  24     <li><a href="#parserbinops">Binary Expression Parsing</a></li>
  25     <li><a href="#parsertop">Parsing the Rest</a></li>
  26     <li><a href="#driver">The Driver</a></li>
  27     <li><a href="#conclusions">Conclusions</a></li>
  28     <li><a href="#code">Full Code Listing</a></li>
  29   </ol>
  30 </li>
  31 <li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
  32 </ul>
  33
  34 <div class="doc_author">
  35   <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
  36 </div>
  37
  38 <!-- *********************************************************************** -->
  39 <div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
  40 <!-- *********************************************************************** -->
  41
  42 <div class="doc_text">
  43
  44 <p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
  45 with LLVM</a>" tutorial.  This chapter shows you how to use the <a
  46 href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
  47 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
  48 our Kaleidoscope language and build an <a
  49 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
  50 Tree</a> (AST).</p>
  51
  52 <p>The parser we will build uses a combination of <a
  53 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
  54 Parsing</a> and <a href=
  55 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
  56 Parsing</a> to parse the Kaleidoscope language (the later for binary expression
  57 and the former for everything else).  Before we get to parsing though, lets talk
  58 about the output of the parser: the Abstract Syntax Tree.</p>
  59
  60 </div>
  61
  62 <!-- *********************************************************************** -->
  63 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
  64 <!-- *********************************************************************** -->
  65
  66 <div class="doc_text">
  67
  68 <p>The AST for a program captures its behavior in a way that it is easy for
  69 later stages of the compiler (e.g. code generation) to interpret.  We basically
  70 want one object for each construct in the language, and the AST should closely
  71 model the language.  In Kaleidoscope, we have expressions, a prototype, and a
  72 function object.  We'll start with expressions first:</p>
  73
  74 <div class="doc_code">
  75 <pre>
  76 /// ExprAST - Base class for all expression nodes.
  77 class ExprAST {
  78 public:
  79   virtual ~ExprAST() {}
  80 };
  81
  82 /// NumberExprAST - Expression class for numeric literals like "1.0".
  83 class NumberExprAST : public ExprAST {
  84   double Val;
  85 public:
  86   explicit NumberExprAST(double val) : Val(val) {}
  87 };
  88 </pre>
  89 </div>
  90
  91 <p>The code above shows the definition of the base ExprAST class and one
  92 subclass which we use for numeric literals.  The important thing about this is
  93 that the NumberExprAST class captures the numeric value of the literal in the
  94 class, so that later phases of the compiler can know what it is.</p>
  95
  96 <p>Right now we only create the AST,  so there are no useful accessor methods on
  97 them.  It would be very easy to add a virtual method to pretty print the code,
  98 for example.  Here are the other expression AST node definitions that we'll use
  99 in the basic form of the Kaleidoscope language.
 100 </p>
 101
 102 <div class="doc_code">
 103 <pre>
 104 /// VariableExprAST - Expression class for referencing a variable, like "a".
 105 class VariableExprAST : public ExprAST {
 106   std::string Name;
 107 public:
 108   explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
 109 };
 110
 111 /// BinaryExprAST - Expression class for a binary operator.
 112 class BinaryExprAST : public ExprAST {
 113   char Op;
 114   ExprAST *LHS, *RHS;
 115 public:
 116   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 117     : Op(op), LHS(lhs), RHS(rhs) {}
 118 };
 119
 120 /// CallExprAST - Expression class for function calls.
 121 class CallExprAST : public ExprAST {
 122   std::string Callee;
 123   std::vector&lt;ExprAST*&gt; Args;
 124 public:
 125   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 126     : Callee(callee), Args(args) {}
 127 };
 128 </pre>
 129 </div>
 130
 131 <p>This is all (intentially) rather straight-forward: variables capture the
 132 variable name, binary operators capture their opcode (e.g. '+'), and calls
 133 capture a function name and list of argument expressions.  One thing that is
 134 nice about our AST is that it captures the language features without talking
 135 about the syntax of the language.  Note that there is no discussion about
 136 precedence of binary operators, lexical structure etc.</p>
 137
 138 <p>For our basic language, these are all of the expression nodes we'll define.
 139 Because it doesn't have conditional control flow, it isn't Turing-complete;
 140 we'll fix that in a later installment.  The two things we need next are a way
 141 to talk about the interface to a function, and a way to talk about functions
 142 themselves:</p>
 143
 144 <div class="doc_code">
 145 <pre>
 146 /// PrototypeAST - This class represents the "prototype" for a function,
 147 /// which captures its argument names as well as if it is an operator.
 148 class PrototypeAST {
 149   std::string Name;
 150   std::vector&lt;std::string&gt; Args;
 151 public:
 152   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 153     : Name(name), Args(args) {}
 154 };
 155
 156 /// FunctionAST - This class represents a function definition itself.
 157 class FunctionAST {
 158   PrototypeAST *Proto;
 159   ExprAST *Body;
 160 public:
 161   FunctionAST(PrototypeAST *proto, ExprAST *body)
 162     : Proto(proto), Body(body) {}
 163 };
 164 </pre>
 165 </div>
 166
 167 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
 168 Since all values are double precision floating point, this fact doesn't need to
 169 be captured anywhere.  In a more aggressive and realistic language, the
 170 "ExprAST" class would probably have a type field.</p>
 171
 172 <p>With this scaffolding, we can now talk about parsing expressions and function
 173 bodies in Kaleidoscope.</p>
 174
 175 </div>
 176
 177 <!-- *********************************************************************** -->
 178 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
 179 <!-- *********************************************************************** -->
 180
 181 <div class="doc_text">
 182
 183 <p>Now that we have an AST to build, we need to define the parser code to build
 184 it.  The idea here is that we want to parse something like "x+y" (which is
 185 returned as three tokens by the lexer) into an AST that could be generated with
 186 calls like this:</p>
 187
 188 <div class="doc_code">
 189 <pre>
 190   ExprAST *X = new VariableExprAST("x");
 191   ExprAST *Y = new VariableExprAST("y");
 192   ExprAST *Result = new BinaryExprAST('+', X, Y);
 193 </pre>
 194 </div>
 195
 196 <p>In order to do this, we'll start by defining some basic helper routines:</p>
 197
 198 <div class="doc_code">
 199 <pre>
 200 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 201 /// token the parser it looking at.  getNextToken reads another token from the
 202 /// lexer and updates CurTok with its results.
 203 static int CurTok;
 204 static int getNextToken() {
 205   return CurTok = gettok();
 206 }
 207 </pre>
 208 </div>
 209
 210 <p>
 211 This implements a simple token buffer around the lexer.  This allows
 212 us to look one token ahead at what the lexer is returning.  Every function in
 213 our parser will assume that CurTok is the current token that needs to be
 214 parsed.</p>
 215
 216 <p>Again, we define these with global variables; it would be better design to
 217 wrap the entire parser in a class and use instance variables for these.
 218 </p>
 219
 220 <div class="doc_code">
 221 <pre>
 222
 223 /// Error* - These are little helper functions for error handling.
 224 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 225 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 226 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 227 </pre>
 228 </div>
 229
 230 <p>
 231 The <tt>Error</tt> routines are simple helper routines that our parser will use
 232 to handle errors.  The error recovery in our parser will not be the best and
 233 is not particular user-friendly, but it will be enough for our tutorial.  These
 234 routines make it easier to handle errors in routines that have various return
 235 types: they always return null.</p>
 236
 237 <p>With these basic helper functions implemented, we can implement the first
 238 piece of our grammar: we'll start with numeric literals.</p>
 239
 240 </div>
 241
 242 <!-- *********************************************************************** -->
 243 <div class="doc_section"><a name="parserprimexprs">Basic Expression
 244  Parsing</a></div>
 245 <!-- *********************************************************************** -->
 246
 247 <div class="doc_text">
 248
 249 <p>We start with numeric literals, because they are the simplest to process.
 250 For each production in our grammar, we'll define a function which parses that
 251 production.  For numeric literals, we have:
 252 </p>
 253
 254 <div class="doc_code">
 255 <pre>
 256 /// numberexpr ::= number
 257 static ExprAST *ParseNumberExpr() {
 258   ExprAST *Result = new NumberExprAST(NumVal);
 259   getNextToken(); // consume the number
 260   return Result;
 261 }
 262 </pre>
 263 </div>
 264
 265 <p>This routine is very simple: it expects to be called when the current token
 266 is a <tt>tok_number</tt> token.  It takes the current number value, creates
 267 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
 268 returns.</p>
 269
 270 <p>There are some interesting aspects of this.  The most important one is that
 271 this routine eats all of the tokens that correspond to the production, and
 272 returns the lexer buffer with the next token (which is not part of the grammar
 273 production) ready to go.  This is a fairly standard way to go for recursive
 274 descent parsers.  For a better example, the parenthesis operator is defined like
 275 this:</p>
 276
 277 <div class="doc_code">
 278 <pre>
 279 /// parenexpr ::= '(' expression ')'
 280 static ExprAST *ParseParenExpr() {
 281   getNextToken();  // eat (.
 282   ExprAST *V = ParseExpression();
 283   if (!V) return 0;
 284
 285   if (CurTok != ')')
 286     return Error("expected ')'");
 287   getNextToken();  // eat ).
 288   return V;
 289 }
 290 </pre>
 291 </div>
 292
 293 <p>This function illustrates a number of interesting things about the parser:
 294 1) it shows how we use the Error routines.  When called, this function expects
 295 that the current token is a '(' token, but after parsing the subexpression, it
 296 is possible that there is not a ')' waiting.  For example, if the user types in
 297 "(4 x" instead of "(4)", the parser should emit an error.  Because errors can
 298 occur, the parser needs a way to indicate that they happened: in our parser, we
 299 return null on an error.</p>
 300
 301 <p>Another interesting aspect of this function is that it uses recursion by
 302 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
 303 <tt>ParseParenExpr</tt>).  This is powerful because it allows us to handle
 304 recursive grammars, and keeps each production very simple.  Note that
 305 parentheses do not cause construction of AST nodes themselves.  While we could
 306 do this, the most important role of parens are to guide the parser and provide
 307 grouping.  Once the parser constructs the AST, parens are not needed.</p>
 308
 309 <p>The next simple production is for handling variable references and function
 310 calls:</p>
 311
 312 <div class="doc_code">
 313 <pre>
 314 /// identifierexpr
 315 ///   ::= identifier
 316 ///   ::= identifier '(' expression* ')'
 317 static ExprAST *ParseIdentifierExpr() {
 318   std::string IdName = IdentifierStr;
 319
 320   getNextToken();  // eat identifier.
 321
 322   if (CurTok != '(') // Simple variable ref.
 323     return new VariableExprAST(IdName);
 324
 325   // Call.
 326   getNextToken();  // eat (
 327   std::vector&lt;ExprAST*&gt; Args;
 328   while (1) {
 329     ExprAST *Arg = ParseExpression();
 330     if (!Arg) return 0;
 331     Args.push_back(Arg);
 332
 333     if (CurTok == ')') break;
 334
 335     if (CurTok != ',')
 336       return Error("Expected ')'");
 337     getNextToken();
 338   }
 339
 340   // Eat the ')'.
 341   getNextToken();
 342
 343   return new CallExprAST(IdName, Args);
 344 }
 345 </pre>
 346 </div>
 347
 348 <p>This routine follows the same style as the other routines (it expects to be
 349 called if the current token is a <tt>tok_identifier</tt> token).  It also has
 350 recursion and error handling.  One interesting aspect of this is that it uses
 351 <em>look-ahead</em> to determine if the current identifier is a stand alone
 352 variable reference or if it is a function call expression.  It handles this by
 353 checking to see if the token after the identifier is a '(' token, and constructs
 354 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
 355 </p>
 356
 357 <p>Now that we have all of our simple expression parsing logic in place, we can
 358 define a helper function to wrap them up in a class.  We call this class of
 359 expressions "primary" expressions, for reasons that will become more clear
 360 later.  In order to parse a primary expression, we need to determine what sort
 361 of expression it is:</p>
 362
 363 <div class="doc_code">
 364 <pre>
 365 /// primary
 366 ///   ::= identifierexpr
 367 ///   ::= numberexpr
 368 ///   ::= parenexpr
 369 static ExprAST *ParsePrimary() {
 370   switch (CurTok) {
 371   default: return Error("unknown token when expecting an expression");
 372   case tok_identifier: return ParseIdentifierExpr();
 373   case tok_number:     return ParseNumberExpr();
 374   case '(':            return ParseParenExpr();
 375   }
 376 }
 377 </pre>
 378 </div>
 379
 380 <p>Now that you see the definition of this function, it makes it more obvious
 381 why we can assume the state of CurTok in the various functions.  This uses
 382 look-ahead to determine which sort of expression is being inspected, and parses
 383 it with a function call.</p>
 384
 385 <p>Now that basic expressions are handled, we need to handle binary expressions,
 386 which are a bit more complex.</p>
 387
 388 </div>
 389
 390 <!-- *********************************************************************** -->
 391 <div class="doc_section"><a name="parserbinops">Binary Expression
 392  Parsing</a></div>
 393 <!-- *********************************************************************** -->
 394
 395 <div class="doc_text">
 396
 397 <p>Binary expressions are significantly harder to parse because they are often
 398 ambiguous.  For example, when given the string "x+y*z", the parser can choose
 399 to parse it as either "(x+y)*z" or "x+(y*z)".  With common definitions from
 400 mathematics, we expect the later parse, because "*" (multiplication) has
 401 higher <em>precedence</em> than "+" (addition).</p>
 402
 403 <p>There are many ways to handle this, but an elegant and efficient way is to
 404 use <a href=
 405 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
 406 Parsing</a>.  This parsing technique uses the precedence of binary operators to
 407 guide recursion.  To start with, we need a table of precedences:</p>
 408
 409 <div class="doc_code">
 410 <pre>
 411 /// BinopPrecedence - This holds the precedence for each binary operator that is
 412 /// defined.
 413 static std::map&lt;char, int&gt; BinopPrecedence;
 414
 415 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 416 static int GetTokPrecedence() {
 417   if (!isascii(CurTok))
 418     return -1;
 419
 420   // Make sure it's a declared binop.
 421   int TokPrec = BinopPrecedence[CurTok];
 422   if (TokPrec &lt;= 0) return -1;
 423   return TokPrec;
 424 }
 425
 426 int main() {
 427   // Install standard binary operators.
 428   // 1 is lowest precedence.
 429   BinopPrecedence['&lt;'] = 10;
 430   BinopPrecedence['+'] = 20;
 431   BinopPrecedence['-'] = 20;
 432   BinopPrecedence['*'] = 40;  // highest.
 433   ...
 434 }
 435 </pre>
 436 </div>
 437
 438 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
 439 (this can obviously be extended by you, the reader).  The
 440 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
 441 or -1 if the token is not a binary operator.  Having a map makes it easy to add
 442 new operators and makes it clear that the algorithm doesn't depend on the
 443 specific operators involved, but it would be easy enough to eliminate the map
 444 and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
 445
 446 <p>With the helper above defined, we can now start parsing binary expressions.
 447 The basic idea of operator precedence parsing is to break down an expression
 448 with potentially ambiguous binary operators into pieces.  Consider for example
 449 the expression "a+b+(c+d)*e*f+g".  Operator precedence parsing considers this
 450 as a stream of primary expressions separated by binary operators.  As such,
 451 it will first parse the leading primary expression "a", then it will see the
 452 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g].  Note that because parentheses
 453 are primary expressions, the binary expression parser doesn't need to worry
 454 about nested subexpressions like (c+d) at all.
 455 </p>
 456
 457 <p>
 458 To start, an expression is a primary expression potentially followed by a
 459 sequence of [binop,primaryexpr] pairs:</p>
 460
 461 <div class="doc_code">
 462 <pre>
 463 /// expression
 464 ///   ::= primary binoprhs
 465 ///
 466 static ExprAST *ParseExpression() {
 467   ExprAST *LHS = ParsePrimary();
 468   if (!LHS) return 0;
 469
 470   return ParseBinOpRHS(0, LHS);
 471 }
 472 </pre>
 473 </div>
 474
 475 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
 476 us.  It takes a precedence and a pointer to an expression for the part parsed
 477 so far.   Note that "x" is a perfectly valid expression: As such, "binoprhs" is
 478 allowed to be empty, in which case it returns the expression that is passed into
 479 it. In our example above, the code passes the expression for "a" into
 480 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
 481
 482 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
 483 minimal operator precedence</em> that the function is allowed to eat.  For
 484 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
 485 passed in a precedence of 40, it will not consume any tokens (because the
 486 precedence of '+' is only 20).  With this in mind, <tt>ParseBinOpRHS</tt> starts
 487 with:</p>
 488
 489 <div class="doc_code">
 490 <pre>
 491 /// binoprhs
 492 ///   ::= ('+' primary)*
 493 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 494   // If this is a binop, find its precedence.
 495   while (1) {
 496     int TokPrec = GetTokPrecedence();
 497
 498     // If this is a binop that binds at least as tightly as the current binop,
 499     // consume it, otherwise we are done.
 500     if (TokPrec &lt; ExprPrec)
 501       return LHS;
 502 </pre>
 503 </div>
 504
 505 <p>This code gets the precedence of the current token and checks to see if if is
 506 too low.  Because we defined invalid tokens to have a precedence of -1, this
 507 check implicitly knows that the pair-stream ends when the token stream runs out
 508 of binary operators.  If this check succeeds, we know that the token is a binary
 509 operator and that it will be included in this expression:</p>
 510
 511 <div class="doc_code">
 512 <pre>
 513     // Okay, we know this is a binop.
 514     int BinOp = CurTok;
 515     getNextToken();  // eat binop
 516
 517     // Parse the primary expression after the binary operator.
 518     ExprAST *RHS = ParsePrimary();
 519     if (!RHS) return 0;
 520 </pre>
 521 </div>
 522
 523 <p>As such, this code eats (and remembers) the binary operator and then parses
 524 the following primary expression.  This builds up the whole pair, the first of
 525 which is [+, b] for the running example.</p>
 526
 527 <p>Now that we parsed the left-hand side of an expression and one pair of the
 528 RHS sequence, we have to decide which way the expression associates.  In
 529 particular, we could have "(a+b) binop unparsed"  or "a + (b binop unparsed)".
 530 To determine this, we look ahead at "binop" to determine its precedence and
 531 compare it to BinOp's precedence (which is '+' in this case):</p>
 532
 533 <div class="doc_code">
 534 <pre>
 535     // If BinOp binds less tightly with RHS than the operator after RHS, let
 536     // the pending operator take RHS as its LHS.
 537     int NextPrec = GetTokPrecedence();
 538     if (TokPrec &lt; NextPrec) {
 539 </pre>
 540 </div>
 541
 542 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
 543 precedence of our current operator, then we know that the parentheses associate
 544 as "(a+b) binop ...".  In our example, since the next operator is "+" and so is
 545 our current one, we know that they have the same precedence.  In this case we'll
 546 create the AST node for "a+b", and then continue parsing:</p>
 547
 548 <div class="doc_code">
 549 <pre>
 550       ... if body omitted ...
 551     }
 552
 553     // Merge LHS/RHS.
 554     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 555   }  // loop around to the top of the while loop.
 556 }
 557 </pre>
 558 </div>
 559
 560 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
 561 iteration of the loop, with "+" as the current token.  The code above will eat
 562 and remember it and parse "(c+d)" as the primary expression, which makes the
 563 current pair be [+, (c+d)].  It will then enter the 'if' above with "*" as the
 564 binop to the right of the primary.  In this case, the precedence of "*" is
 565 higher than the precedence of "+" so the if condition will be entered.</p>
 566
 567 <p>The critical question left here is "how can the if condition parse the right
 568 hand side in full"?  In particular, to build the AST correctly for our example,
 569 it needs to get all of "(c+d)*e*f" as the RHS expression variable.  The code to
 570 do this is surprisingly simple (code from the above two blocks duplicated for
 571 context):</p>
 572
 573 <div class="doc_code">
 574 <pre>
 575     // If BinOp binds less tightly with RHS than the operator after RHS, let
 576     // the pending operator take RHS as its LHS.
 577     int NextPrec = GetTokPrecedence();
 578     if (TokPrec &lt; NextPrec) {
 579       RHS = ParseBinOpRHS(TokPrec+1, RHS);
 580       if (RHS == 0) return 0;
 581     }
 582     // Merge LHS/RHS.
 583     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 584   }  // loop around to the top of the while loop.
 585 }
 586 </pre>
 587 </div>
 588
 589 <p>At this point, we know that the binary operator to the RHS of our primary
 590 has higher precedence than the binop we are currently parsing.  As such, we know
 591 that any sequence of pairs whose operators are all higher precedence than "+"
 592 should be parsed together and returned as "RHS".  To do this, we recursively
 593 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
 594 precedence required for it to continue.  In our example above, this will cause
 595 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
 596 of the '+' expression.</p>
 597
 598 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
 599 and added to the AST.  With this little bit of code (14 non-trivial lines), we
 600 correctly handle fully general binary expression parsing in a very elegant way.
 601 </p>
 602
 603 <p>This wraps up handling of expressions.  At this point, we can point the
 604 parser at an arbitrary token stream and build an expression from them, stopping
 605 at the first token that is not part of the expression.  Next up we need to
 606 handle function definitions etc.</p>
 607
 608 </div>
 609
 610 <!-- *********************************************************************** -->
 611 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
 612 <!-- *********************************************************************** -->
 613
 614 <div class="doc_text">
 615
 616 <p>
 617 The first basic thing missing is that of function prototypes.  In Kaleidoscope,
 618 these are used both for 'extern' function declarations as well as function body
 619 definitions.  The code to do this is straight-forward and not very interesting
 620 (once you've survived expressions):
 621 </p>
 622
 623 <div class="doc_code">
 624 <pre>
 625 /// prototype
 626 ///   ::= id '(' id* ')'
 627 static PrototypeAST *ParsePrototype() {
 628   if (CurTok != tok_identifier)
 629     return ErrorP("Expected function name in prototype");
 630
 631   std::string FnName = IdentifierStr;
 632   getNextToken();
 633
 634   if (CurTok != '(')
 635     return ErrorP("Expected '(' in prototype");
 636
 637   std::vector&lt;std::string&gt; ArgNames;
 638   while (getNextToken() == tok_identifier)
 639     ArgNames.push_back(IdentifierStr);
 640   if (CurTok != ')')
 641     return ErrorP("Expected ')' in prototype");
 642
 643   // success.
 644   getNextToken();  // eat ')'.
 645
 646   return new PrototypeAST(FnName, ArgNames);
 647 }
 648 </pre>
 649 </div>
 650
 651 <p>Given this, a function definition is very simple, just a prototype plus
 652 an expression to implement the body:</p>
 653
 654 <div class="doc_code">
 655 <pre>
 656 /// definition ::= 'def' prototype expression
 657 static FunctionAST *ParseDefinition() {
 658   getNextToken();  // eat def.
 659   PrototypeAST *Proto = ParsePrototype();
 660   if (Proto == 0) return 0;
 661
 662   if (ExprAST *E = ParseExpression())
 663     return new FunctionAST(Proto, E);
 664   return 0;
 665 }
 666 </pre>
 667 </div>
 668
 669 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
 670 well as to support forward declaration of user functions.  'externs' are just
 671 prototypes with no body:</p>
 672
 673 <div class="doc_code">
 674 <pre>
 675 /// external ::= 'extern' prototype
 676 static PrototypeAST *ParseExtern() {
 677   getNextToken();  // eat extern.
 678   return ParsePrototype();
 679 }
 680 </pre>
 681 </div>
 682
 683 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
 684 evaluate them on the fly.  We will handle this by defining anonymous nullary
 685 (zero argument) functions for them:</p>
 686
 687 <div class="doc_code">
 688 <pre>
 689 /// toplevelexpr ::= expression
 690 static FunctionAST *ParseTopLevelExpr() {
 691   if (ExprAST *E = ParseExpression()) {
 692     // Make an anonymous proto.
 693     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
 694     return new FunctionAST(Proto, E);
 695   }
 696   return 0;
 697 }
 698 </pre>
 699 </div>
 700
 701 <p>Now that we have all the pieces, lets build a little driver that will let us
 702 actually <em>execute</em> this code we've built!</p>
 703
 704 </div>
 705
 706 <!-- *********************************************************************** -->
 707 <div class="doc_section"><a name="driver">The Driver</a></div>
 708 <!-- *********************************************************************** -->
 709
 710 <div class="doc_text">
 711
 712 <p>The driver for this simply invokes all of the parsing pieces with a top-level
 713 dispatch loop.  There isn't much interesting here, so I'll just include the
 714 top-level loop.  See <a href="#code">below</a> for full code in the "Top-Level
 715 Parsing" section.</p>
 716
 717 <div class="doc_code">
 718 <pre>
 719 /// top ::= definition | external | expression | ';'
 720 static void MainLoop() {
 721   while (1) {
 722     fprintf(stderr, "ready&gt; ");
 723     switch (CurTok) {
 724     case tok_eof:    return;
 725     case ';':        getNextToken(); break;  // ignore top level semicolons.
 726     case tok_def:    HandleDefinition(); break;
 727     case tok_extern: HandleExtern(); break;
 728     default:         HandleTopLevelExpression(); break;
 729     }
 730   }
 731 }
 732 </pre>
 733 </div>
 734
 735 <p>The most interesting part of this is that we ignore top-level semi colons.
 736 Why is this, you ask?  The basic reason is that if you type "4 + 5" at the
 737 command line, the parser doesn't know that that is the end of what you will
 738 type.  For example, on the next line you could type "def foo..." in which case
 739 4+5 is the end of a top-level expression.  Alternatively you could type "* 6",
 740 which would continue the expression.  Having top-level semicolons allows you to
 741 type "4+5;" and the parser will know you are done.</p>
 742
 743 </div>
 744
 745 <!-- *********************************************************************** -->
 746 <div class="doc_section"><a name="conclusions">Conclusions</a></div>
 747 <!-- *********************************************************************** -->
 748
 749 <div class="doc_text">
 750
 751 <p>With just under 400 lines of commented code, we fully defined our minimal
 752 language, including a lexer, parser and AST builder.  With this done, the
 753 executable will validate code and tell us if it is gramatically invalid.  For
 754 example, here is a sample interaction:</p>
 755
 756 <div class="doc_code">
 757 <pre>
 758 $ ./a.out
 759 ready&gt; def foo(x y) x+foo(y, 4.0);
 760 ready&gt; Parsed a function definition.
 761 ready&gt; def foo(x y) x+y y;
 762 ready&gt; Parsed a function definition.
 763 ready&gt; Parsed a top-level expr
 764 ready&gt; def foo(x y) x+y );
 765 ready&gt; Parsed a function definition.
 766 ready&gt; Error: unknown token when expecting an expression
 767 ready&gt; extern sin(a);
 768 ready&gt; Parsed an extern
 769 ready&gt; ^D
 770 $
 771 </pre>
 772 </div>
 773
 774 <p>There is a lot of room for extension here.  You can define new AST nodes,
 775 extend the language in many ways, etc.  In the <a href="LangImpl3.html">next
 776 installment</a>, we will describe how to generate LLVM IR from the AST.</p>
 777
 778 </div>
 779
 780 <!-- *********************************************************************** -->
 781 <div class="doc_section"><a name="code">Full Code Listing</a></div>
 782 <!-- *********************************************************************** -->
 783
 784 <div class="doc_text">
 785
 786 <p>
 787 Here is the complete code listing for this and the previous chapter.
 788 Note that it is fully self-contained: you don't need LLVM or any external
 789 libraries at all for this (other than the C and C++ standard libraries of
 790 course).  To build this, just compile with:</p>
 791
 792 <div class="doc_code">
 793 <pre>
 794    # Compile
 795    g++ -g toy.cpp
 796    # Run
 797    ./a.out
 798 </pre>
 799 </div>
 800
 801 <p>Here is the code:</p>
 802
 803 <div class="doc_code">
 804 <pre>
 805 #include &lt;cstdio&gt;
 806 #include &lt;string&gt;
 807 #include &lt;map&gt;
 808 #include &lt;vector&gt;
 809
 810 //===----------------------------------------------------------------------===//
 811 // Lexer
 812 //===----------------------------------------------------------------------===//
 813
 814 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
 815 // of these for known things.
 816 enum Token {
 817   tok_eof = -1,
 818
 819   // commands
 820   tok_def = -2, tok_extern = -3,
 821
 822   // primary
 823   tok_identifier = -4, tok_number = -5,
 824 };
 825
 826 static std::string IdentifierStr;  // Filled in if tok_identifier
 827 static double NumVal;              // Filled in if tok_number
 828
 829 /// gettok - Return the next token from standard input.
 830 static int gettok() {
 831   static int LastChar = ' ';
 832
 833   // Skip any whitespace.
 834   while (isspace(LastChar))
 835     LastChar = getchar();
 836
 837   if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
 838     IdentifierStr = LastChar;
 839     while (isalnum((LastChar = getchar())))
 840       IdentifierStr += LastChar;
 841
 842     if (IdentifierStr == "def") return tok_def;
 843     if (IdentifierStr == "extern") return tok_extern;
 844     return tok_identifier;
 845   }
 846
 847   if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
 848     std::string NumStr;
 849     do {
 850       NumStr += LastChar;
 851       LastChar = getchar();
 852     } while (isdigit(LastChar) || LastChar == '.');
 853
 854     NumVal = strtod(NumStr.c_str(), 0);
 855     return tok_number;
 856   }
 857
 858   if (LastChar == '#') {
 859     // Comment until end of line.
 860     do LastChar = getchar();
 861     while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
 862
 863     if (LastChar != EOF)
 864       return gettok();
 865   }
 866
 867   // Check for end of file.  Don't eat the EOF.
 868   if (LastChar == EOF)
 869     return tok_eof;
 870
 871   // Otherwise, just return the character as its ascii value.
 872   int ThisChar = LastChar;
 873   LastChar = getchar();
 874   return ThisChar;
 875 }
 876
 877 //===----------------------------------------------------------------------===//
 878 // Abstract Syntax Tree (aka Parse Tree)
 879 //===----------------------------------------------------------------------===//
 880
 881 /// ExprAST - Base class for all expression nodes.
 882 class ExprAST {
 883 public:
 884   virtual ~ExprAST() {}
 885 };
 886
 887 /// NumberExprAST - Expression class for numeric literals like "1.0".
 888 class NumberExprAST : public ExprAST {
 889   double Val;
 890 public:
 891   explicit NumberExprAST(double val) : Val(val) {}
 892 };
 893
 894 /// VariableExprAST - Expression class for referencing a variable, like "a".
 895 class VariableExprAST : public ExprAST {
 896   std::string Name;
 897 public:
 898   explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
 899 };
 900
 901 /// BinaryExprAST - Expression class for a binary operator.
 902 class BinaryExprAST : public ExprAST {
 903   char Op;
 904   ExprAST *LHS, *RHS;
 905 public:
 906   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 907     : Op(op), LHS(lhs), RHS(rhs) {}
 908 };
 909
 910 /// CallExprAST - Expression class for function calls.
 911 class CallExprAST : public ExprAST {
 912   std::string Callee;
 913   std::vector&lt;ExprAST*&gt; Args;
 914 public:
 915   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 916     : Callee(callee), Args(args) {}
 917 };
 918
 919 /// PrototypeAST - This class represents the "prototype" for a function,
 920 /// which captures its argument names as well as if it is an operator.
 921 class PrototypeAST {
 922   std::string Name;
 923   std::vector&lt; Args;
 924 public:
 925   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 926     : Name(name), Args(args) {}
 927
 928 };
 929
 930 /// FunctionAST - This class represents a function definition itself.
 931 class FunctionAST {
 932   PrototypeAST *Proto;
 933   ExprAST *Body;
 934 public:
 935   FunctionAST(PrototypeAST *proto, ExprAST *body)
 936     : Proto(proto), Body(body) {}
 937
 938 };
 939
 940 //===----------------------------------------------------------------------===//
 941 // Parser
 942 //===----------------------------------------------------------------------===//
 943
 944 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 945 /// token the parser it looking at.  getNextToken reads another token from the
 946 /// lexer and updates CurTok with its results.
 947 static int CurTok;
 948 static int getNextToken() {
 949   return CurTok = gettok();
 950 }
 951
 952 /// BinopPrecedence - This holds the precedence for each binary operator that is
 953 /// defined.
 954 static std::map&lt;char, int&gt; BinopPrecedence;
 955
 956 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 957 static int GetTokPrecedence() {
 958   if (!isascii(CurTok))
 959     return -1;
 960
 961   // Make sure it's a declared binop.
 962   int TokPrec = BinopPrecedence[CurTok];
 963   if (TokPrec &lt;= 0) return -1;
 964   return TokPrec;
 965 }
 966
 967 /// Error* - These are little helper functions for error handling.
 968 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 969 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 970 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 971
 972 static ExprAST *ParseExpression();
 973
 974 /// identifierexpr
 975 ///   ::= identifier
 976 ///   ::= identifier '(' expression* ')'
 977 static ExprAST *ParseIdentifierExpr() {
 978   std::string IdName = IdentifierStr;
 979
 980   getNextToken();  // eat identifier.
 981
 982   if (CurTok != '(') // Simple variable ref.
 983     return new VariableExprAST(IdName);
 984
 985   // Call.
 986   getNextToken();  // eat (
 987   std::vector&lt;ExprAST*&gt; Args;
 988   while (1) {
 989     ExprAST *Arg = ParseExpression();
 990     if (!Arg) return 0;
 991     Args.push_back(Arg);
 992
 993     if (CurTok == ')') break;
 994
 995     if (CurTok != ',')
 996       return Error("Expected ')'");
 997     getNextToken();
 998   }
 999
1000   // Eat the ')'.
1001   getNextToken();
1002
1003   return new CallExprAST(IdName, Args);
1004 }
1005
1006 /// numberexpr ::= number
1007 static ExprAST *ParseNumberExpr() {
1008   ExprAST *Result = new NumberExprAST(NumVal);
1009   getNextToken(); // consume the number
1010   return Result;
1011 }
1012
1013 /// parenexpr ::= '(' expression ')'
1014 static ExprAST *ParseParenExpr() {
1015   getNextToken();  // eat (.
1016   ExprAST *V = ParseExpression();
1017   if (!V) return 0;
1018
1019   if (CurTok != ')')
1020     return Error("expected ')'");
1021   getNextToken();  // eat ).
1022   return V;
1023 }
1024
1025 /// primary
1026 ///   ::= identifierexpr
1027 ///   ::= numberexpr
1028 ///   ::= parenexpr
1029 static ExprAST *ParsePrimary() {
1030   switch (CurTok) {
1031   default: return Error("unknown token when expecting an expression");
1032   case tok_identifier: return ParseIdentifierExpr();
1033   case tok_number:     return ParseNumberExpr();
1034   case '(':            return ParseParenExpr();
1035   }
1036 }
1037
1038 /// binoprhs
1039 ///   ::= ('+' primary)*
1040 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1041   // If this is a binop, find its precedence.
1042   while (1) {
1043     int TokPrec = GetTokPrecedence();
1044
1045     // If this is a binop that binds at least as tightly as the current binop,
1046     // consume it, otherwise we are done.
1047     if (TokPrec &lt; ExprPrec)
1048       return LHS;
1049
1050     // Okay, we know this is a binop.
1051     int BinOp = CurTok;
1052     getNextToken();  // eat binop
1053
1054     // Parse the primary expression after the binary operator.
1055     ExprAST *RHS = ParsePrimary();
1056     if (!RHS) return 0;
1057
1058     // If BinOp binds less tightly with RHS than the operator after RHS, let
1059     // the pending operator take RHS as its LHS.
1060     int NextPrec = GetTokPrecedence();
1061     if (TokPrec &lt; NextPrec) {
1062       RHS = ParseBinOpRHS(TokPrec+1, RHS);
1063       if (RHS == 0) return 0;
1064     }
1065
1066     // Merge LHS/RHS.
1067     LHS = new BinaryExprAST(BinOp, LHS, RHS);
1068   }
1069 }
1070
1071 /// expression
1072 ///   ::= primary binoprhs
1073 ///
1074 static ExprAST *ParseExpression() {
1075   ExprAST *LHS = ParsePrimary();
1076   if (!LHS) return 0;
1077
1078   return ParseBinOpRHS(0, LHS);
1079 }
1080
1081 /// prototype
1082 ///   ::= id '(' id* ')'
1083 static PrototypeAST *ParsePrototype() {
1084   if (CurTok != tok_identifier)
1085     return ErrorP("Expected function name in prototype");
1086
1087   std::string FnName = IdentifierStr;
1088   getNextToken();
1089
1090   if (CurTok != '(')
1091     return ErrorP("Expected '(' in prototype");
1092
1093   std::vector&lt;std::string&gt; ArgNames;
1094   while (getNextToken() == tok_identifier)
1095     ArgNames.push_back(IdentifierStr);
1096   if (CurTok != ')')
1097     return ErrorP("Expected ')' in prototype");
1098
1099   // success.
1100   getNextToken();  // eat ')'.
1101
1102   return new PrototypeAST(FnName, ArgNames);
1103 }
1104
1105 /// definition ::= 'def' prototype expression
1106 static FunctionAST *ParseDefinition() {
1107   getNextToken();  // eat def.
1108   PrototypeAST *Proto = ParsePrototype();
1109   if (Proto == 0) return 0;
1110
1111   if (ExprAST *E = ParseExpression())
1112     return new FunctionAST(Proto, E);
1113   return 0;
1114 }
1115
1116 /// toplevelexpr ::= expression
1117 static FunctionAST *ParseTopLevelExpr() {
1118   if (ExprAST *E = ParseExpression()) {
1119     // Make an anonymous proto.
1120     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1121     return new FunctionAST(Proto, E);
1122   }
1123   return 0;
1124 }
1125
1126 /// external ::= 'extern' prototype
1127 static PrototypeAST *ParseExtern() {
1128   getNextToken();  // eat extern.
1129   return ParsePrototype();
1130 }
1131
1132 //===----------------------------------------------------------------------===//
1133 // Top-Level parsing
1134 //===----------------------------------------------------------------------===//
1135
1136 static void HandleDefinition() {
1137   if (FunctionAST *F = ParseDefinition()) {
1138     fprintf(stderr, "Parsed a function definition.\n");
1139   } else {
1140     // Skip token for error recovery.
1141     getNextToken();
1142   }
1143 }
1144
1145 static void HandleExtern() {
1146   if (PrototypeAST *P = ParseExtern()) {
1147     fprintf(stderr, "Parsed an extern\n");
1148   } else {
1149     // Skip token for error recovery.
1150     getNextToken();
1151   }
1152 }
1153
1154 static void HandleTopLevelExpression() {
1155   // Evaluate a top level expression into an anonymous function.
1156   if (FunctionAST *F = ParseTopLevelExpr()) {
1157     fprintf(stderr, "Parsed a top-level expr\n");
1158   } else {
1159     // Skip token for error recovery.
1160     getNextToken();
1161   }
1162 }
1163
1164 /// top ::= definition | external | expression | ';'
1165 static void MainLoop() {
1166   while (1) {
1167     fprintf(stderr, "ready&gt; ");
1168     switch (CurTok) {
1169     case tok_eof:    return;
1170     case ';':        getNextToken(); break;  // ignore top level semicolons.
1171     case tok_def:    HandleDefinition(); break;
1172     case tok_extern: HandleExtern(); break;
1173     default:         HandleTopLevelExpression(); break;
1174     }
1175   }
1176 }
1177
1178 //===----------------------------------------------------------------------===//
1179 // Main driver code.
1180 //===----------------------------------------------------------------------===//
1181
1182 int main() {
1183   // Install standard binary operators.
1184   // 1 is lowest precedence.
1185   BinopPrecedence['&lt;'] = 10;
1186   BinopPrecedence['+'] = 20;
1187   BinopPrecedence['-'] = 20;
1188   BinopPrecedence['*'] = 40;  // highest.
1189
1190   // Prime the first token.
1191   fprintf(stderr, "ready&gt; ");
1192   getNextToken();
1193
1194   MainLoop();
1195   return 0;
1196 }
1197 </pre>
1198 </div>
1199 </div>
1200
1201 <!-- *********************************************************************** -->
1202 <hr>
1203 <address>
1204   <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1205   src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1206   <a href="http://validator.w3.org/check/referer"><img
1207   src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
1208
1209   <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1210   <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1211   Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1212 </address>
1213 </body>
1214 </html>