docs/tutorial/LangImpl2.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   2                       "http://www.w3.org/TR/html4/strict.dtd">
   3
   4 <html>
   5 <head>
   6   <title>Kaleidoscope: Implementing a Parser and AST</title>
   7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   8   <meta name="author" content="Chris Lattner">
   9   <link rel="stylesheet" href="../llvm.css" type="text/css">
  10 </head>
  11
  12 <body>
  13
  14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
  15
  16 <div class="doc_author">
  17   <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
  18 </div>
  19
  20 <!-- *********************************************************************** -->
  21 <div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
  22 <!-- *********************************************************************** -->
  23
  24 <div class="doc_text">
  25
  26 <p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
  27 LLVM</a>" tutorial.  This chapter shows you how to use the <a
  28 href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
  29 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
  30 our Kaleidoscope language and build an <a
  31 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
  32 Tree</a> (AST).</p>
  33
  34 <p>The parser we will build uses a combination of <a
  35 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
  36 Parsing</a> and <a href=
  37 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
  38 Parsing</a> to parse the Kaleidoscope language (the later for binary expression
  39 and the former for everything else).  Before we get to parsing though, lets talk
  40 about the output of the parser: the Abstract Syntax Tree.</p>
  41
  42 </div>
  43
  44 <!-- *********************************************************************** -->
  45 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
  46 <!-- *********************************************************************** -->
  47
  48 <div class="doc_text">
  49
  50 <p>The AST for a program captures its behavior in a way that it is easy for
  51 later stages of the compiler (e.g. code generation) to interpret.  We basically
  52 want one object for each construct in the language, and the AST should closely
  53 model the language.  In Kaleidoscope, we have expressions, a prototype, and a
  54 function object.  We'll start with expressions first:</p>
  55
  56 <div class="doc_code">
  57 <pre>
  58 /// ExprAST - Base class for all expression nodes.
  59 class ExprAST {
  60 public:
  61   virtual ~ExprAST() {}
  62 };
  63
  64 /// NumberExprAST - Expression class for numeric literals like "1.0".
  65 class NumberExprAST : public ExprAST {
  66   double Val;
  67 public:
  68   NumberExprAST(double val) : Val(val) {}
  69 };
  70 </pre>
  71 </div>
  72
  73 <p>The code above shows the definition of the base ExprAST class and one
  74 subclass which we use for numeric literals.  The important thing about this is
  75 that the NumberExprAST class captures the numeric value of the literal in the
  76 class, so that later phases of the compiler can know what it is.</p>
  77
  78 <p>Right now we only create the AST,  so there are no useful accessor methods on
  79 them.  It would be very easy to add a virtual method to pretty print the code,
  80 for example.  Here are the other expression AST node definitions that we'll use
  81 in the basic form of the Kaleidoscope language.
  82 </p>
  83
  84 <div class="doc_code">
  85 <pre>
  86 /// VariableExprAST - Expression class for referencing a variable, like "a".
  87 class VariableExprAST : public ExprAST {
  88   std::string Name;
  89 public:
  90   VariableExprAST(const std::string &amp;name) : Name(name) {}
  91 };
  92
  93 /// BinaryExprAST - Expression class for a binary operator.
  94 class BinaryExprAST : public ExprAST {
  95   char Op;
  96   ExprAST *LHS, *RHS;
  97 public:
  98   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
  99     : Op(op), LHS(lhs), RHS(rhs) {}
 100 };
 101
 102 /// CallExprAST - Expression class for function calls.
 103 class CallExprAST : public ExprAST {
 104   std::string Callee;
 105   std::vector&lt;ExprAST*&gt; Args;
 106 public:
 107   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 108     : Callee(callee), Args(args) {}
 109 };
 110 </pre>
 111 </div>
 112
 113 <p>This is all (intentially) rather straight-forward: variables capture the
 114 variable name, binary operators capture their opcode (e.g. '+'), and calls
 115 capture a function name and list of argument expressions.  One thing that is
 116 nice about our AST is that it captures the language features without talking
 117 about the syntax of the language.  Note that there is no discussion about
 118 precedence of binary operators, lexical structure etc.</p>
 119
 120 <p>For our basic language, these are all of the expression nodes we'll define.
 121 because it doesn't have conditional control flow, it isn't turing complete:
 122 we'll fix that in a later installment.  The two things we need next are a way
 123 to talk about the interface to a function, and a way to talk about functions
 124 themselves:</p>
 125
 126 <div class="doc_code">
 127 <pre>
 128 /// PrototypeAST - This class represents the "prototype" for a function,
 129 /// which captures its argument names as well as if it is an operator.
 130 class PrototypeAST {
 131   std::string Name;
 132   std::vector&lt;std::string&gt; Args;
 133 public:
 134   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 135     : Name(name), Args(args) {}
 136 };
 137
 138 /// FunctionAST - This class represents a function definition itself.
 139 class FunctionAST {
 140   PrototypeAST *Proto;
 141   ExprAST *Body;
 142 public:
 143   FunctionAST(PrototypeAST *proto, ExprAST *body)
 144     : Proto(proto), Body(body) {}
 145 };
 146 </pre>
 147 </div>
 148
 149 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
 150 Since all values are double precision floating point, this fact doesn't need to
 151 be captured anywhere.  In a more aggressive and realistic language, the
 152 "ExprAST" class would probably have a type field.</p>
 153
 154 <p>With this scaffolding, we can now talk about parsing expressions and function
 155 bodies in Kaleidoscope.</p>
 156
 157 </div>
 158
 159 <!-- *********************************************************************** -->
 160 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
 161 <!-- *********************************************************************** -->
 162
 163 <div class="doc_text">
 164
 165 <p>Now that we have an AST to build, we need to define the parser code to build
 166 it.  The idea here is that we want to parse something like "x+y" (which is
 167 returned as three tokens by the lexer) into an AST that could be generated with
 168 calls like this:</p>
 169
 170 <div class="doc_code">
 171 <pre>
 172   ExprAST *X = new VariableExprAST("x");
 173   ExprAST *Y = new VariableExprAST("y");
 174   ExprAST *Result = new BinaryExprAST('+', X, Y);
 175 </pre>
 176 </div>
 177
 178 <p>In order to do this, we'll start by defining some basic helper routines:</p>
 179
 180 <div class="doc_code">
 181 <pre>
 182 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 183 /// token the parser it looking at.  getNextToken reads another token from the
 184 /// lexer and updates CurTok with its results.
 185 static int CurTok;
 186 static int getNextToken() {
 187   return CurTok = gettok();
 188 }
 189 </pre>
 190 </div>
 191
 192 <p>
 193 This implements a simple token buffer around the lexer.  This allows
 194 us to look one token ahead at what the lexer is returning.  Every function in
 195 our lexer will assume that CurTok is the current token that needs to be
 196 parsed.</p>
 197
 198 <p>Again, we define
 199 these with global variables: it would be better design to wrap the entire parser
 200 in a class and use instance variables for these.
 201 </p>
 202
 203 <div class="doc_code">
 204 <pre>
 205
 206 /// Error* - These are little helper functions for error handling.
 207 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 208 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 209 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 210 </pre>
 211 </div>
 212
 213 <p>
 214 The <tt>Error</tt> routines are simple helper routines that our parser will use
 215 to handle errors.  The error recovery in our parser will not be the best and
 216 are not particular user-friendly, but it will be enough for our tutorial.  These
 217 routines make it easier to handle errors in routines that have various return
 218 types: they always return null.</p>
 219
 220 <p>With these basic helper functions implemented, we can implement the first
 221 piece of our grammar: we'll start with numeric literals.</p>
 222
 223 </div>
 224
 225 <!-- *********************************************************************** -->
 226 <div class="doc_section"><a name="parserprimexprs">Basic Expression
 227  Parsing</a></div>
 228 <!-- *********************************************************************** -->
 229
 230 <div class="doc_text">
 231
 232 <p>We start with numeric literals, because they are the simplest to process.
 233 For each production in our grammar, we'll define a function which parses that
 234 production.  For numeric literals, we have:
 235 </p>
 236
 237 <div class="doc_code">
 238 <pre>
 239 /// numberexpr ::= number
 240 static ExprAST *ParseNumberExpr() {
 241   ExprAST *Result = new NumberExprAST(NumVal);
 242   getNextToken(); // consume the number
 243   return Result;
 244 }
 245 </pre>
 246 </div>
 247
 248 <p>This routine is very simple: it expects to be called when the current token
 249 is a <tt>tok_number</tt> token.  It takes the current number value, creates
 250 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
 251 returns.</p>
 252
 253 <p>There are some interesting aspects of this.  The most important one is that
 254 this routine eats all of the tokens that correspond to the production, and
 255 returns the lexer buffer with the next token (which is not part of the grammar
 256 production) ready to go.  This is a fairly standard way to go for recursive
 257 descent parsers.  For a better example, the parenthesis operator is defined like
 258 this:</p>
 259
 260 <div class="doc_code">
 261 <pre>
 262 /// parenexpr ::= '(' expression ')'
 263 static ExprAST *ParseParenExpr() {
 264   getNextToken();  // eat (.
 265   ExprAST *V = ParseExpression();
 266   if (!V) return 0;
 267
 268   if (CurTok != ')')
 269     return Error("expected ')'");
 270   getNextToken();  // eat ).
 271   return V;
 272 }
 273 </pre>
 274 </div>
 275
 276 <p>This function illustrates a number of interesting things about the parser:
 277 1) it shows how we use the Error routines.  When called, this function expects
 278 that the current token is a '(' token, but after parsing the subexpression, it
 279 is possible that there is not a ')' waiting.  For example, if the user types in
 280 "(4 x" instead of "(4)", the parser should emit an error.  Because errors can
 281 occur, the parser needs a way to indicate that they happened: in our parser, we
 282 return null on an error.</p>
 283
 284 <p>Another interesting aspect of this function is that it uses recursion by
 285 calling <tt>ParseExpression</tt> (we will soon see that ParseExpression can call
 286 <tt>ParseParenExpr</tt>).  This is powerful because it allows us to handle
 287 recursive grammars, and keeps each production very simple.  Note that
 288 parenthesis do not cause construction of AST nodes themselves.  While we could
 289 do this, the most important role of parens are to guide the parser and provide
 290 grouping.  Once the parser constructs the AST, parens are not needed.</p>
 291
 292 <p>The next simple production is for handling variable references and function
 293 calls:</p>
 294
 295 <div class="doc_code">
 296 <pre>
 297 /// identifierexpr
 298 ///   ::= identifer
 299 ///   ::= identifer '(' expression* ')'
 300 static ExprAST *ParseIdentifierExpr() {
 301   std::string IdName = IdentifierStr;
 302
 303   getNextToken();  // eat identifer.
 304
 305   if (CurTok != '(') // Simple variable ref.
 306     return new VariableExprAST(IdName);
 307
 308   // Call.
 309   getNextToken();  // eat (
 310   std::vector&lt;ExprAST*&gt; Args;
 311   while (1) {
 312     ExprAST *Arg = ParseExpression();
 313     if (!Arg) return 0;
 314     Args.push_back(Arg);
 315
 316     if (CurTok == ')') break;
 317
 318     if (CurTok != ',')
 319       return Error("Expected ')'");
 320     getNextToken();
 321   }
 322
 323   // Eat the ')'.
 324   getNextToken();
 325
 326   return new CallExprAST(IdName, Args);
 327 }
 328 </pre>
 329 </div>
 330
 331 <p>This routine follows the same style as the other routines (it expects to be
 332 called if the current token is a <tt>tok_identifier</tt> token).  It also has
 333 recursion and error handling.  One interesting aspect of this is that it uses
 334 <em>look-ahead</em> to determine if the current identifier is a stand alone
 335 variable reference or if it is a function call expression.  It handles this by
 336 checking to see if the token after the identifier is a '(' token, and constructs
 337 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
 338 </p>
 339
 340 <p>Now that we have all of our simple expression parsing logic in place, we can
 341 define a helper function to wrap them up in a class.  We call this class of
 342 expressions "primary" expressions, for reasons that will become more clear
 343 later.  In order to parse a primary expression, we need to determine what sort
 344 of expression it is:</p>
 345
 346 <div class="doc_code">
 347 <pre>
 348 /// primary
 349 ///   ::= identifierexpr
 350 ///   ::= numberexpr
 351 ///   ::= parenexpr
 352 static ExprAST *ParsePrimary() {
 353   switch (CurTok) {
 354   default: return Error("unknown token when expecting an expression");
 355   case tok_identifier: return ParseIdentifierExpr();
 356   case tok_number:     return ParseNumberExpr();
 357   case '(':            return ParseParenExpr();
 358   }
 359 }
 360 </pre>
 361 </div>
 362
 363 <p>Now that you see the definition of this function, it makes it more obvious
 364 why we can assume the state of CurTok in the various functions.  This uses
 365 look-ahead to determine which sort of expression is being inspected, and parses
 366 it with a function call.</p>
 367
 368 <p>Now that basic expressions are handled, we need to handle binary expressions,
 369 which are a bit more complex.</p>
 370
 371 </div>
 372
 373 <!-- *********************************************************************** -->
 374 <div class="doc_section"><a name="parserbinops">Binary Expression
 375  Parsing</a></div>
 376 <!-- *********************************************************************** -->
 377
 378 <div class="doc_text">
 379
 380 <p>Binary expressions are significantly harder to parse because they are often
 381 ambiguous.  For example, when given the string "x+y*z", the parser can choose
 382 to parse it as either "(x+y)*z" or "x+(y*z)".  With common definitions from
 383 mathematics, we expect the later parse, because "*" (multiplication) has
 384 higher <em>precedence</em> than "+" (addition).</p>
 385
 386 <p>There are many ways to handle this, but an elegant and efficient way is to
 387 use <a href=
 388 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
 389 Parsing</a>.  This parsing technique uses the precedence of binary operators to
 390 guide recursion.  To start with, we need a table of precedences:</p>
 391
 392 <div class="doc_code">
 393 <pre>
 394 /// BinopPrecedence - This holds the precedence for each binary operator that is
 395 /// defined.
 396 static std::map&lt;char, int&gt; BinopPrecedence;
 397
 398 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 399 static int GetTokPrecedence() {
 400   if (!isascii(CurTok))
 401     return -1;
 402
 403   // Make sure it's a declared binop.
 404   int TokPrec = BinopPrecedence[CurTok];
 405   if (TokPrec &lt;= 0) return -1;
 406   return TokPrec;
 407 }
 408
 409 int main() {
 410   // Install standard binary operators.
 411   // 1 is lowest precedence.
 412   BinopPrecedence['&lt;'] = 10;
 413   BinopPrecedence['+'] = 20;
 414   BinopPrecedence['-'] = 20;
 415   BinopPrecedence['*'] = 40;  // highest.
 416   ...
 417 }
 418 </pre>
 419 </div>
 420
 421 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
 422 (this can obviously be extended by you, the reader).  The
 423 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
 424 or -1 if the token is not a binary operator.  Having a map makes it easy to add
 425 new operators and makes it clear that the algorithm doesn't depend on the
 426 specific operators involved, but it would be easy enough to eliminate the map
 427 and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
 428
 429 <p>With the helper above defined, we can now start parsing binary expressions.
 430 The basic idea of operator precedence parsing is to break down an expression
 431 with potentially ambiguous binary operators into pieces.  Consider for example
 432 the expression "a+b+(c+d)*e*f+g".  Operator precedence parsing considers this
 433 as a stream of primary expressions separated by binary operators.  As such,
 434 it will first parse the leading primary expression "a", then it will see the
 435 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g].  Note that because parentheses
 436 are primary expressions that the binary expression parser doesn't need to worry
 437 about nested subexpressions like (c+d) at all.
 438 </p>
 439
 440 <p>
 441 To start, an expression is a primary expression potentially followed by a
 442 sequence of [binop,primaryexpr] pairs:</p>
 443
 444 <div class="doc_code">
 445 <pre>
 446 /// expression
 447 ///   ::= primary binoprhs
 448 ///
 449 static ExprAST *ParseExpression() {
 450   ExprAST *LHS = ParsePrimary();
 451   if (!LHS) return 0;
 452
 453   return ParseBinOpRHS(0, LHS);
 454 }
 455 </pre>
 456 </div>
 457
 458 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
 459 us.  It takes a precedence and a pointer to an expression for the part parsed
 460 so far.   Note that "x" is a perfectly valid expression: As such, "binoprhs" is
 461 allowed to be empty, in which case it returns the expression that is passed into
 462 it. In our example above, the code passes the expression for "a" into
 463 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
 464
 465 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
 466 minimal operator precedence</em> that the function is allowed to eat.  For
 467 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
 468 passed in a precedence of 40, it will not consume any tokens (because the
 469 precedence of '+' is only 20).  With this in mind, <tt>ParseBinOpRHS</tt> starts
 470 with:</p>
 471
 472 <div class="doc_code">
 473 <pre>
 474 /// binoprhs
 475 ///   ::= ('+' primary)*
 476 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 477   // If this is a binop, find its precedence.
 478   while (1) {
 479     int TokPrec = GetTokPrecedence();
 480
 481     // If this is a binop that binds at least as tightly as the current binop,
 482     // consume it, otherwise we are done.
 483     if (TokPrec &lt; ExprPrec)
 484       return LHS;
 485 </pre>
 486 </div>
 487
 488 <p>This code gets the precedence of the current token and checks to see if if is
 489 too low.  Because we defined invalid tokens to have a precedence of -1, this
 490 check implicitly knows that the pair-stream ends when the token stream runs out
 491 of binary operators.  If this check succeeds, we know that the token is a binary
 492 operator and that it will be included in this expression:</p>
 493
 494 <div class="doc_code">
 495 <pre>
 496     // Okay, we know this is a binop.
 497     int BinOp = CurTok;
 498     getNextToken();  // eat binop
 499
 500     // Parse the primary expression after the binary operator.
 501     ExprAST *RHS = ParsePrimary();
 502     if (!RHS) return 0;
 503 </pre>
 504 </div>
 505
 506 <p>As such, this code eats (and remembers) the binary operator and then parses
 507 the following primary expression.  This builds up the whole pair, the first of
 508 which is [+, b] for the running example.</p>
 509
 510 <p>Now that we parsed the left-hand side of an expression and one pair of the
 511 RHS sequence, we have to decide which way the expression associates.  In
 512 particular, we could have "(a+b) binop unparsed"  or "a + (b binop unparsed)".
 513 To determine this, we look ahead at "binop" to determine its precedence and
 514 compare it to BinOp's precedence (which is '+' in this case):</p>
 515
 516 <div class="doc_code">
 517 <pre>
 518     // If BinOp binds less tightly with RHS than the operator after RHS, let
 519     // the pending operator take RHS as its LHS.
 520     int NextPrec = GetTokPrecedence();
 521     if (TokPrec &lt; NextPrec) {
 522 </pre>
 523 </div>
 524
 525 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
 526 precedence of our current operator, then we know that the parentheses associate
 527 as "(a+b) binop ...".  In our example, since the next operator is "+" and so is
 528 our current one, we know that they have the same precedence.  In this case we'll
 529 create the AST node for "a+b", and then continue parsing:</p>
 530
 531 <div class="doc_code">
 532 <pre>
 533       ... if body omitted ...
 534     }
 535
 536     // Merge LHS/RHS.
 537     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 538   }  // loop around to the top of the while loop.
 539 }
 540 </pre>
 541 </div>
 542
 543 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
 544 iteration of the loop, with "+" as the current token.  The code above will eat
 545 and remember it and parse "(c+d)" as the primary expression, which makes the
 546 current pair be [+, (c+d)].  It will then enter the 'if' above with "*" as the
 547 binop to the right of the primary.  In this case, the precedence of "*" is
 548 higher than the precedence of "+" so the if condition will be entered.</p>
 549
 550 <p>The critical question left here is "how can the if condition parse the right
 551 hand side in full"?  In particular, to build the AST correctly for our example,
 552 it needs to get all of "(c+d)*e*f" as the RHS expression variable.  The code to
 553 do this is surprisingly simple (code from the above two blocks duplicated for
 554 context):</p>
 555
 556 <div class="doc_code">
 557 <pre>
 558     // If BinOp binds less tightly with RHS than the operator after RHS, let
 559     // the pending operator take RHS as its LHS.
 560     int NextPrec = GetTokPrecedence();
 561     if (TokPrec &lt; NextPrec) {
 562       RHS = ParseBinOpRHS(TokPrec+1, RHS);
 563       if (RHS == 0) return 0;
 564     }
 565     // Merge LHS/RHS.
 566     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 567   }  // loop around to the top of the while loop.
 568 }
 569 </pre>
 570 </div>
 571
 572 <p>At this point, we know that the binary operator to the RHS of our primary
 573 has higher precedence than the binop we are currently parsing.  As such, we know
 574 that any sequence of pairs whose operators are all higher precedence than "+"
 575 should be parsed together and returned as "RHS".  To do this, we recursively
 576 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
 577 precedence required for it to continue.  In our example above, this will cause
 578 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
 579 of the '+' expression.</p>
 580
 581 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
 582 and added to the AST.  With this little bit of code (14 non-trivial lines), we
 583 correctly handle fully general binary expression parsing in a very elegant way.
 584 </p>
 585
 586 <p>This wraps up handling of expressions.  At this point, we can point the
 587 parser at an arbitrary token stream and build an expression from them, stopping
 588 at the first token that is not part of the expression.  Next up we need to
 589 handle function definitions etc.</p>
 590
 591 </div>
 592
 593 <!-- *********************************************************************** -->
 594 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
 595 <!-- *********************************************************************** -->
 596
 597 <div class="doc_text">
 598
 599 <p>
 600 The first basic thing missing is that of function prototypes.  In Kaleidoscope,
 601 these are used both for 'extern' function declarations as well as function body
 602 definitions.  The code to do this is straight-forward and not very interesting
 603 (once you've survived expressions):
 604 </p>
 605
 606 <div class="doc_code">
 607 <pre>
 608 /// prototype
 609 ///   ::= id '(' id* ')'
 610 static PrototypeAST *ParsePrototype() {
 611   if (CurTok != tok_identifier)
 612     return ErrorP("Expected function name in prototype");
 613
 614   std::string FnName = IdentifierStr;
 615   getNextToken();
 616
 617   if (CurTok != '(')
 618     return ErrorP("Expected '(' in prototype");
 619
 620   std::vector&lt;std::string&gt; ArgNames;
 621   while (getNextToken() == tok_identifier)
 622     ArgNames.push_back(IdentifierStr);
 623   if (CurTok != ')')
 624     return ErrorP("Expected ')' in prototype");
 625
 626   // success.
 627   getNextToken();  // eat ')'.
 628
 629   return new PrototypeAST(FnName, ArgNames);
 630 }
 631 </pre>
 632 </div>
 633
 634 <p>Given this, a function definition is very simple, just a prototype plus
 635 and expression to implement the body:</p>
 636
 637 <div class="doc_code">
 638 <pre>
 639 /// definition ::= 'def' prototype expression
 640 static FunctionAST *ParseDefinition() {
 641   getNextToken();  // eat def.
 642   PrototypeAST *Proto = ParsePrototype();
 643   if (Proto == 0) return 0;
 644
 645   if (ExprAST *E = ParseExpression())
 646     return new FunctionAST(Proto, E);
 647   return 0;
 648 }
 649 </pre>
 650 </div>
 651
 652 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
 653 well as to support forward declaration of user functions.  'externs' are just
 654 prototypes with no body:</p>
 655
 656 <div class="doc_code">
 657 <pre>
 658 /// external ::= 'extern' prototype
 659 static PrototypeAST *ParseExtern() {
 660   getNextToken();  // eat extern.
 661   return ParsePrototype();
 662 }
 663 </pre>
 664 </div>
 665
 666 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
 667 evaluate them on the fly.  We will handle this by defining anonymous nullary
 668 (zero argument) functions for them:</p>
 669
 670 <div class="doc_code">
 671 <pre>
 672 /// toplevelexpr ::= expression
 673 static FunctionAST *ParseTopLevelExpr() {
 674   if (ExprAST *E = ParseExpression()) {
 675     // Make an anonymous proto.
 676     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
 677     return new FunctionAST(Proto, E);
 678   }
 679   return 0;
 680 }
 681 </pre>
 682 </div>
 683
 684 <p>Now that we have all the pieces, lets build a little driver that will let us
 685 actually <em>execute</em> this code we've built!</p>
 686
 687 </div>
 688
 689 <!-- *********************************************************************** -->
 690 <div class="doc_section"><a name="driver">The Driver</a></div>
 691 <!-- *********************************************************************** -->
 692
 693 <div class="doc_text">
 694
 695 <p>The driver for this simply invokes all of the parsing pieces with a top-level
 696 dispatch loop.  There isn't much interesting here, so I'll just include the
 697 top-level loop.  See <a href="#code">below</a> for full code in the "Top-Level
 698 Parsing" section.</p>
 699
 700 <div class="doc_code">
 701 <pre>
 702 /// top ::= definition | external | expression | ';'
 703 static void MainLoop() {
 704   while (1) {
 705     fprintf(stderr, "ready&gt; ");
 706     switch (CurTok) {
 707     case tok_eof:    return;
 708     case ';':        getNextToken(); break;  // ignore top level semicolons.
 709     case tok_def:    HandleDefinition(); break;
 710     case tok_extern: HandleExtern(); break;
 711     default:         HandleTopLevelExpression(); break;
 712     }
 713   }
 714 }
 715 </pre>
 716 </div>
 717
 718 <p>The most interesting part of this is that we ignore top-level semi colons.
 719 Why is this do you ask?  The basic reason is that if you type "4 + 5" at the
 720 command line, the parser doesn't know that that is the end of what you will
 721 type.  For example, on the next line you could type "def foo..." in which case
 722 4+5 is the end of a top-level expression.  Alternatively you could type "* 6",
 723 which would continue the expression.  Having top-level semicolons allows you to
 724 type "4+5;" and the parser will know you are done.</p>
 725
 726 </div>
 727
 728 <!-- *********************************************************************** -->
 729 <div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
 730 <!-- *********************************************************************** -->
 731
 732 <div class="doc_text">
 733
 734 <p>With just under 400 lines of commented code, we fully defined our minimal
 735 language, including a lexer, parser and AST builder.  With this done, the
 736 executable will validate code and tell us if it is gramatically invalid.  For
 737 example, here is a sample interaction:</p>
 738
 739 <div class="doc_code">
 740 <pre>
 741 $ ./a.out
 742 ready> def foo(x y) x+foo(y, 4.0);
 743 ready> Parsed an function definition.
 744 ready> def foo(x y) x+y y;
 745 ready> Parsed an function definition.
 746 ready> Parsed a top-level expr
 747 ready> def foo(x y) x+y );
 748 ready> Parsed an function definition.
 749 ready> Error: unknown token when expecting an expression
 750 ready> extern sin(a);
 751 ready> Parsed an extern
 752 ready> ^D
 753 $
 754 </pre>
 755 </div>
 756
 757 <p>
 758 Here is the full code.  Note that it is fully self-contained: you don't even
 759 need LLVM for this.  In the <a href="LangImpl3.html">next installment</a>, we
 760 will describe how to generate LLVM IR from the AST.</p>
 761
 762 <div class="doc_code">
 763 <pre>
 764 // To build this:
 765 //  g++ -g toy.cpp
 766 //  ./a.out
 767
 768 #include &lt;cstdio&gt;
 769 #include &lt;string&gt;
 770 #include &lt;
 771 #include &lt;vector&gt;
 772
 773 //===----------------------------------------------------------------------===//
 774 // Lexer
 775 //===----------------------------------------------------------------------===//
 776
 777 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
 778 // of these for known things.
 779 enum Token {
 780   tok_eof = -1,
 781
 782   // commands
 783   tok_def = -2, tok_extern = -3,
 784
 785   // primary
 786   tok_identifier = -4, tok_number = -5,
 787 };
 788
 789 static std::string IdentifierStr;  // Filled in if tok_identifier
 790 static double NumVal;              // Filled in if tok_number
 791
 792 /// gettok - Return the next token from standard input.
 793 static int gettok() {
 794   static int LastChar = ' ';
 795
 796   // Skip any whitespace.
 797   while (isspace(LastChar))
 798     LastChar = getchar();
 799
 800   if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
 801     IdentifierStr = LastChar;
 802     while (isalnum((LastChar = getchar())))
 803       IdentifierStr += LastChar;
 804
 805     if (IdentifierStr == "def") return tok_def;
 806     if (IdentifierStr == "extern") return tok_extern;
 807     return tok_identifier;
 808   }
 809
 810   if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
 811     std::string NumStr;
 812     do {
 813       NumStr += LastChar;
 814       LastChar = getchar();
 815     } while (isdigit(LastChar) || LastChar == '.');
 816
 817     NumVal = strtod(NumStr.c_str(), 0);
 818     return tok_number;
 819   }
 820
 821   if (LastChar == '#') {
 822     // Comment until end of line.
 823     do LastChar = getchar();
 824     while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
 825
 826     if (LastChar != EOF)
 827       return gettok();
 828   }
 829
 830   // Check for end of file.  Don't eat the EOF.
 831   if (LastChar == EOF)
 832     return tok_eof;
 833
 834   // Otherwise, just return the character as its ascii value.
 835   int ThisChar = LastChar;
 836   LastChar = getchar();
 837   return ThisChar;
 838 }
 839
 840 //===----------------------------------------------------------------------===//
 841 // Abstract Syntax Tree (aka Parse Tree)
 842 //===----------------------------------------------------------------------===//
 843
 844 /// ExprAST - Base class for all expression nodes.
 845 class ExprAST {
 846 public:
 847   virtual ~ExprAST() {}
 848 };
 849
 850 /// NumberExprAST - Expression class for numeric literals like "1.0".
 851 class NumberExprAST : public ExprAST {
 852   double Val;
 853 public:
 854   NumberExprAST(double val) : Val(val) {}
 855 };
 856
 857 /// VariableExprAST - Expression class for referencing a variable, like "a".
 858 class VariableExprAST : public ExprAST {
 859   std::string Name;
 860 public:
 861   VariableExprAST(const std::string &amp;name) : Name(name) {}
 862 };
 863
 864 /// BinaryExprAST - Expression class for a binary operator.
 865 class BinaryExprAST : public ExprAST {
 866   char Op;
 867   ExprAST *LHS, *RHS;
 868 public:
 869   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 870     : Op(op), LHS(lhs), RHS(rhs) {}
 871 };
 872
 873 /// CallExprAST - Expression class for function calls.
 874 class CallExprAST : public ExprAST {
 875   std::string Callee;
 876   std::vector&lt;ExprAST*&gt; Args;
 877 public:
 878   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 879     : Callee(callee), Args(args) {}
 880 };
 881
 882 /// PrototypeAST - This class represents the "prototype" for a function,
 883 /// which captures its argument names as well as if it is an operator.
 884 class PrototypeAST {
 885   std::string Name;
 886   std::vector&lt; Args;
 887 public:
 888   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 889     : Name(name), Args(args) {}
 890
 891 };
 892
 893 /// FunctionAST - This class represents a function definition itself.
 894 class FunctionAST {
 895   PrototypeAST *Proto;
 896   ExprAST *Body;
 897 public:
 898   FunctionAST(PrototypeAST *proto, ExprAST *body)
 899     : Proto(proto), Body(body) {}
 900
 901 };
 902
 903 //===----------------------------------------------------------------------===//
 904 // Parser
 905 //===----------------------------------------------------------------------===//
 906
 907 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 908 /// token the parser it looking at.  getNextToken reads another token from the
 909 /// lexer and updates CurTok with its results.
 910 static int CurTok;
 911 static int getNextToken() {
 912   return CurTok = gettok();
 913 }
 914
 915 /// BinopPrecedence - This holds the precedence for each binary operator that is
 916 /// defined.
 917 static std::map&lt;char, int&gt; BinopPrecedence;
 918
 919 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 920 static int GetTokPrecedence() {
 921   if (!isascii(CurTok))
 922     return -1;
 923
 924   // Make sure it's a declared binop.
 925   int TokPrec = BinopPrecedence[CurTok];
 926   if (TokPrec &lt;= 0) return -1;
 927   return TokPrec;
 928 }
 929
 930 /// Error* - These are little helper functions for error handling.
 931 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 932 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 933 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 934
 935 static ExprAST *ParseExpression();
 936
 937 /// identifierexpr
 938 ///   ::= identifer
 939 ///   ::= identifer '(' expression* ')'
 940 static ExprAST *ParseIdentifierExpr() {
 941   std::string IdName = IdentifierStr;
 942
 943   getNextToken();  // eat identifer.
 944
 945   if (CurTok != '(') // Simple variable ref.
 946     return new VariableExprAST(IdName);
 947
 948   // Call.
 949   getNextToken();  // eat (
 950   std::vector&lt;ExprAST*&gt; Args;
 951   while (1) {
 952     ExprAST *Arg = ParseExpression();
 953     if (!Arg) return 0;
 954     Args.push_back(Arg);
 955
 956     if (CurTok == ')') break;
 957
 958     if (CurTok != ',')
 959       return Error("Expected ')'");
 960     getNextToken();
 961   }
 962
 963   // Eat the ')'.
 964   getNextToken();
 965
 966   return new CallExprAST(IdName, Args);
 967 }
 968
 969 /// numberexpr ::= number
 970 static ExprAST *ParseNumberExpr() {
 971   ExprAST *Result = new NumberExprAST(NumVal);
 972   getNextToken(); // consume the number
 973   return Result;
 974 }
 975
 976 /// parenexpr ::= '(' expression ')'
 977 static ExprAST *ParseParenExpr() {
 978   getNextToken();  // eat (.
 979   ExprAST *V = ParseExpression();
 980   if (!V) return 0;
 981
 982   if (CurTok != ')')
 983     return Error("expected ')'");
 984   getNextToken();  // eat ).
 985   return V;
 986 }
 987
 988 /// primary
 989 ///   ::= identifierexpr
 990 ///   ::= numberexpr
 991 ///   ::= parenexpr
 992 static ExprAST *ParsePrimary() {
 993   switch (CurTok) {
 994   default: return Error("unknown token when expecting an expression");
 995   case tok_identifier: return ParseIdentifierExpr();
 996   case tok_number:     return ParseNumberExpr();
 997   case '(':            return ParseParenExpr();
 998   }
 999 }
1000
1001 /// binoprhs
1002 ///   ::= ('+' primary)*
1003 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1004   // If this is a binop, find its precedence.
1005   while (1) {
1006     int TokPrec = GetTokPrecedence();
1007
1008     // If this is a binop that binds at least as tightly as the current binop,
1009     // consume it, otherwise we are done.
1010     if (TokPrec &lt; ExprPrec)
1011       return LHS;
1012
1013     // Okay, we know this is a binop.
1014     int BinOp = CurTok;
1015     getNextToken();  // eat binop
1016
1017     // Parse the primary expression after the binary operator.
1018     ExprAST *RHS = ParsePrimary();
1019     if (!RHS) return 0;
1020
1021     // If BinOp binds less tightly with RHS than the operator after RHS, let
1022     // the pending operator take RHS as its LHS.
1023     int NextPrec = GetTokPrecedence();
1024     if (TokPrec &lt; NextPrec) {
1025       RHS = ParseBinOpRHS(TokPrec+1, RHS);
1026       if (RHS == 0) return 0;
1027     }
1028
1029     // Merge LHS/RHS.
1030     LHS = new BinaryExprAST(BinOp, LHS, RHS);
1031   }
1032 }
1033
1034 /// expression
1035 ///   ::= primary binoprhs
1036 ///
1037 static ExprAST *ParseExpression() {
1038   ExprAST *LHS = ParsePrimary();
1039   if (!LHS) return 0;
1040
1041   return ParseBinOpRHS(0, LHS);
1042 }
1043
1044 /// prototype
1045 ///   ::= id '(' id* ')'
1046 static PrototypeAST *ParsePrototype() {
1047   if (CurTok != tok_identifier)
1048     return ErrorP("Expected function name in prototype");
1049
1050   std::string FnName = IdentifierStr;
1051   getNextToken();
1052
1053   if (CurTok != '(')
1054     return ErrorP("Expected '(' in prototype");
1055
1056   std::vector&lt;std::string&gt; ArgNames;
1057   while (getNextToken() == tok_identifier)
1058     ArgNames.push_back(IdentifierStr);
1059   if (CurTok != ')')
1060     return ErrorP("Expected ')' in prototype");
1061
1062   // success.
1063   getNextToken();  // eat ')'.
1064
1065   return new PrototypeAST(FnName, ArgNames);
1066 }
1067
1068 /// definition ::= 'def' prototype expression
1069 static FunctionAST *ParseDefinition() {
1070   getNextToken();  // eat def.
1071   PrototypeAST *Proto = ParsePrototype();
1072   if (Proto == 0) return 0;
1073
1074   if (ExprAST *E = ParseExpression())
1075     return new FunctionAST(Proto, E);
1076   return 0;
1077 }
1078
1079 /// toplevelexpr ::= expression
1080 static FunctionAST *ParseTopLevelExpr() {
1081   if (ExprAST *E = ParseExpression()) {
1082     // Make an anonymous proto.
1083     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1084     return new FunctionAST(Proto, E);
1085   }
1086   return 0;
1087 }
1088
1089 /// external ::= 'extern' prototype
1090 static PrototypeAST *ParseExtern() {
1091   getNextToken();  // eat extern.
1092   return ParsePrototype();
1093 }
1094
1095 //===----------------------------------------------------------------------===//
1096 // Top-Level parsing
1097 //===----------------------------------------------------------------------===//
1098
1099 static void HandleDefinition() {
1100   if (FunctionAST *F = ParseDefinition()) {
1101     fprintf(stderr, "Parsed a function definition.\n");
1102   } else {
1103     // Skip token for error recovery.
1104     getNextToken();
1105   }
1106 }
1107
1108 static void HandleExtern() {
1109   if (PrototypeAST *P = ParseExtern()) {
1110     fprintf(stderr, "Parsed an extern\n");
1111   } else {
1112     // Skip token for error recovery.
1113     getNextToken();
1114   }
1115 }
1116
1117 static void HandleTopLevelExpression() {
1118   // Evaluate a top level expression into an anonymous function.
1119   if (FunctionAST *F = ParseTopLevelExpr()) {
1120     fprintf(stderr, "Parsed a top-level expr\n");
1121   } else {
1122     // Skip token for error recovery.
1123     getNextToken();
1124   }
1125 }
1126
1127 /// top ::= definition | external | expression | ';'
1128 static void MainLoop() {
1129   while (1) {
1130     fprintf(stderr, "ready&gt; ");
1131     switch (CurTok) {
1132     case tok_eof:    return;
1133     case ';':        getNextToken(); break;  // ignore top level semicolons.
1134     case tok_def:    HandleDefinition(); break;
1135     case tok_extern: HandleExtern(); break;
1136     default:         HandleTopLevelExpression(); break;
1137     }
1138   }
1139 }
1140
1141 //===----------------------------------------------------------------------===//
1142 // Main driver code.
1143 //===----------------------------------------------------------------------===//
1144
1145 int main() {
1146   // Install standard binary operators.
1147   // 1 is lowest precedence.
1148   BinopPrecedence['&lt;'] = 10;
1149   BinopPrecedence['+'] = 20;
1150   BinopPrecedence['-'] = 20;
1151   BinopPrecedence['*'] = 40;  // highest.
1152
1153   // Prime the first token.
1154   fprintf(stderr, "ready&gt; ");
1155   getNextToken();
1156
1157   MainLoop();
1158   return 0;
1159 }
1160 </pre>
1161 </div>
1162 </div>
1163
1164 <!-- *********************************************************************** -->
1165 <hr>
1166 <address>
1167   <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1168   src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1169   <a href="http://validator.w3.org/check/referer"><img
1170   src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!" /></a>
1171
1172   <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1173   <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1174   Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1175 </address>
1176 </body>
1177 </html>