docs/tutorial/LangImpl2.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   2                       "http://www.w3.org/TR/html4/strict.dtd">
   3
   4 <html>
   5 <head>
   6   <title>Kaleidoscope: Implementing a Parser and AST</title>
   7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   8   <meta name="author" content="Chris Lattner">
   9   <link rel="stylesheet" href="../llvm.css" type="text/css">
  10 </head>
  11
  12 <body>
  13
  14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
  15
  16 <div class="doc_author">
  17   <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
  18 </div>
  19
  20 <!-- *********************************************************************** -->
  21 <div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
  22 <!-- *********************************************************************** -->
  23
  24 <div class="doc_text">
  25
  26 <p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
  27 LLVM</a>" tutorial.  This chapter shows you how to use the <a
  28 href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
  29 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
  30 our Kaleidoscope language and build an <a
  31 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
  32 Tree</a> (AST).</p>
  33
  34 <p>The parser we will build uses a combination of <a
  35 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
  36 Parsing</a> and <a href=
  37 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
  38 Parsing</a> to parse the Kaleidoscope language (the later for binary expression
  39 and the former for everything else).  Before we get to parsing though, lets talk
  40 about the output of the parser: the Abstract Syntax Tree.</p>
  41
  42 </div>
  43
  44 <!-- *********************************************************************** -->
  45 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
  46 <!-- *********************************************************************** -->
  47
  48 <div class="doc_text">
  49
  50 <p>The AST for a program captures its behavior in a way that it is easy for
  51 later stages of the compiler (e.g. code generation) to interpret.  We basically
  52 want one object for each construct in the language, and the AST should closely
  53 model the language.  In Kaleidoscope, we have expressions, a prototype, and a
  54 function object.  We'll start with expressions first:</p>
  55
  56 <div class="doc_code">
  57 <pre>
  58 /// ExprAST - Base class for all expression nodes.
  59 class ExprAST {
  60 public:
  61   virtual ~ExprAST() {}
  62 };
  63
  64 /// NumberExprAST - Expression class for numeric literals like "1.0".
  65 class NumberExprAST : public ExprAST {
  66   double Val;
  67 public:
  68   NumberExprAST(double val) : Val(val) {}
  69 };
  70 </pre>
  71 </div>
  72
  73 <p>The code above shows the definition of the base ExprAST class and one
  74 subclass which we use for numeric literals.  The important thing about this is
  75 that the NumberExprAST class captures the numeric value of the literal in the
  76 class, so that later phases of the compiler can know what it is.</p>
  77
  78 <p>Right now we only create the AST,  so there are no useful accessor methods on
  79 them.  It would be very easy to add a virtual method to pretty print the code,
  80 for example.  Here are the other expression AST node definitions that we'll use
  81 in the basic form of the Kaleidoscope language.
  82 </p>
  83
  84 <div class="doc_code">
  85 <pre>
  86 /// VariableExprAST - Expression class for referencing a variable, like "a".
  87 class VariableExprAST : public ExprAST {
  88   std::string Name;
  89 public:
  90   VariableExprAST(const std::string &amp;name) : Name(name) {}
  91 };
  92
  93 /// BinaryExprAST - Expression class for a binary operator.
  94 class BinaryExprAST : public ExprAST {
  95   char Op;
  96   ExprAST *LHS, *RHS;
  97 public:
  98   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
  99     : Op(op), LHS(lhs), RHS(rhs) {}
 100 };
 101
 102 /// CallExprAST - Expression class for function calls.
 103 class CallExprAST : public ExprAST {
 104   std::string Callee;
 105   std::vector&lt;ExprAST*&gt; Args;
 106 public:
 107   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 108     : Callee(callee), Args(args) {}
 109 };
 110 </pre>
 111 </div>
 112
 113 <p>This is all (intentially) rather straight-forward: variables capture the
 114 variable name, binary operators capture their opcode (e.g. '+'), and calls
 115 capture a function name and list of argument expressions.  One thing that is
 116 nice about our AST is that it captures the language features without talking
 117 about the syntax of the language.  Note that there is no discussion about
 118 precedence of binary operators, lexical structure etc.</p>
 119
 120 <p>For our basic language, these are all of the expression nodes we'll define.
 121 Because it doesn't have conditional control flow, it isn't Turing-complete;
 122 we'll fix that in a later installment.  The two things we need next are a way
 123 to talk about the interface to a function, and a way to talk about functions
 124 themselves:</p>
 125
 126 <div class="doc_code">
 127 <pre>
 128 /// PrototypeAST - This class represents the "prototype" for a function,
 129 /// which captures its argument names as well as if it is an operator.
 130 class PrototypeAST {
 131   std::string Name;
 132   std::vector&lt;std::string&gt; Args;
 133 public:
 134   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 135     : Name(name), Args(args) {}
 136 };
 137
 138 /// FunctionAST - This class represents a function definition itself.
 139 class FunctionAST {
 140   PrototypeAST *Proto;
 141   ExprAST *Body;
 142 public:
 143   FunctionAST(PrototypeAST *proto, ExprAST *body)
 144     : Proto(proto), Body(body) {}
 145 };
 146 </pre>
 147 </div>
 148
 149 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
 150 Since all values are double precision floating point, this fact doesn't need to
 151 be captured anywhere.  In a more aggressive and realistic language, the
 152 "ExprAST" class would probably have a type field.</p>
 153
 154 <p>With this scaffolding, we can now talk about parsing expressions and function
 155 bodies in Kaleidoscope.</p>
 156
 157 </div>
 158
 159 <!-- *********************************************************************** -->
 160 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
 161 <!-- *********************************************************************** -->
 162
 163 <div class="doc_text">
 164
 165 <p>Now that we have an AST to build, we need to define the parser code to build
 166 it.  The idea here is that we want to parse something like "x+y" (which is
 167 returned as three tokens by the lexer) into an AST that could be generated with
 168 calls like this:</p>
 169
 170 <div class="doc_code">
 171 <pre>
 172   ExprAST *X = new VariableExprAST("x");
 173   ExprAST *Y = new VariableExprAST("y");
 174   ExprAST *Result = new BinaryExprAST('+', X, Y);
 175 </pre>
 176 </div>
 177
 178 <p>In order to do this, we'll start by defining some basic helper routines:</p>
 179
 180 <div class="doc_code">
 181 <pre>
 182 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 183 /// token the parser it looking at.  getNextToken reads another token from the
 184 /// lexer and updates CurTok with its results.
 185 static int CurTok;
 186 static int getNextToken() {
 187   return CurTok = gettok();
 188 }
 189 </pre>
 190 </div>
 191
 192 <p>
 193 This implements a simple token buffer around the lexer.  This allows
 194 us to look one token ahead at what the lexer is returning.  Every function in
 195 our lexer will assume that CurTok is the current token that needs to be
 196 parsed.</p>
 197
 198 <p>Again, we define these with global variables; it would be better design to wrap the entire parser in a class and use instance variables for these.
 199 </p>
 200
 201 <div class="doc_code">
 202 <pre>
 203
 204 /// Error* - These are little helper functions for error handling.
 205 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 206 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 207 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 208 </pre>
 209 </div>
 210
 211 <p>
 212 The <tt>Error</tt> routines are simple helper routines that our parser will use
 213 to handle errors.  The error recovery in our parser will not be the best and
 214 are not particular user-friendly, but it will be enough for our tutorial.  These
 215 routines make it easier to handle errors in routines that have various return
 216 types: they always return null.</p>
 217
 218 <p>With these basic helper functions implemented, we can implement the first
 219 piece of our grammar: we'll start with numeric literals.</p>
 220
 221 </div>
 222
 223 <!-- *********************************************************************** -->
 224 <div class="doc_section"><a name="parserprimexprs">Basic Expression
 225  Parsing</a></div>
 226 <!-- *********************************************************************** -->
 227
 228 <div class="doc_text">
 229
 230 <p>We start with numeric literals, because they are the simplest to process.
 231 For each production in our grammar, we'll define a function which parses that
 232 production.  For numeric literals, we have:
 233 </p>
 234
 235 <div class="doc_code">
 236 <pre>
 237 /// numberexpr ::= number
 238 static ExprAST *ParseNumberExpr() {
 239   ExprAST *Result = new NumberExprAST(NumVal);
 240   getNextToken(); // consume the number
 241   return Result;
 242 }
 243 </pre>
 244 </div>
 245
 246 <p>This routine is very simple: it expects to be called when the current token
 247 is a <tt>tok_number</tt> token.  It takes the current number value, creates
 248 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
 249 returns.</p>
 250
 251 <p>There are some interesting aspects of this.  The most important one is that
 252 this routine eats all of the tokens that correspond to the production, and
 253 returns the lexer buffer with the next token (which is not part of the grammar
 254 production) ready to go.  This is a fairly standard way to go for recursive
 255 descent parsers.  For a better example, the parenthesis operator is defined like
 256 this:</p>
 257
 258 <div class="doc_code">
 259 <pre>
 260 /// parenexpr ::= '(' expression ')'
 261 static ExprAST *ParseParenExpr() {
 262   getNextToken();  // eat (.
 263   ExprAST *V = ParseExpression();
 264   if (!V) return 0;
 265
 266   if (CurTok != ')')
 267     return Error("expected ')'");
 268   getNextToken();  // eat ).
 269   return V;
 270 }
 271 </pre>
 272 </div>
 273
 274 <p>This function illustrates a number of interesting things about the parser:
 275 1) it shows how we use the Error routines.  When called, this function expects
 276 that the current token is a '(' token, but after parsing the subexpression, it
 277 is possible that there is not a ')' waiting.  For example, if the user types in
 278 "(4 x" instead of "(4)", the parser should emit an error.  Because errors can
 279 occur, the parser needs a way to indicate that they happened: in our parser, we
 280 return null on an error.</p>
 281
 282 <p>Another interesting aspect of this function is that it uses recursion by
 283 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
 284 <tt>ParseParenExpr</tt>).  This is powerful because it allows us to handle
 285 recursive grammars, and keeps each production very simple.  Note that
 286 parenthesis do not cause construction of AST nodes themselves.  While we could
 287 do this, the most important role of parens are to guide the parser and provide
 288 grouping.  Once the parser constructs the AST, parens are not needed.</p>
 289
 290 <p>The next simple production is for handling variable references and function
 291 calls:</p>
 292
 293 <div class="doc_code">
 294 <pre>
 295 /// identifierexpr
 296 ///   ::= identifer
 297 ///   ::= identifer '(' expression* ')'
 298 static ExprAST *ParseIdentifierExpr() {
 299   std::string IdName = IdentifierStr;
 300
 301   getNextToken();  // eat identifer.
 302
 303   if (CurTok != '(') // Simple variable ref.
 304     return new VariableExprAST(IdName);
 305
 306   // Call.
 307   getNextToken();  // eat (
 308   std::vector&lt;ExprAST*&gt; Args;
 309   while (1) {
 310     ExprAST *Arg = ParseExpression();
 311     if (!Arg) return 0;
 312     Args.push_back(Arg);
 313
 314     if (CurTok == ')') break;
 315
 316     if (CurTok != ',')
 317       return Error("Expected ')'");
 318     getNextToken();
 319   }
 320
 321   // Eat the ')'.
 322   getNextToken();
 323
 324   return new CallExprAST(IdName, Args);
 325 }
 326 </pre>
 327 </div>
 328
 329 <p>This routine follows the same style as the other routines (it expects to be
 330 called if the current token is a <tt>tok_identifier</tt> token).  It also has
 331 recursion and error handling.  One interesting aspect of this is that it uses
 332 <em>look-ahead</em> to determine if the current identifier is a stand alone
 333 variable reference or if it is a function call expression.  It handles this by
 334 checking to see if the token after the identifier is a '(' token, and constructs
 335 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
 336 </p>
 337
 338 <p>Now that we have all of our simple expression parsing logic in place, we can
 339 define a helper function to wrap them up in a class.  We call this class of
 340 expressions "primary" expressions, for reasons that will become more clear
 341 later.  In order to parse a primary expression, we need to determine what sort
 342 of expression it is:</p>
 343
 344 <div class="doc_code">
 345 <pre>
 346 /// primary
 347 ///   ::= identifierexpr
 348 ///   ::= numberexpr
 349 ///   ::= parenexpr
 350 static ExprAST *ParsePrimary() {
 351   switch (CurTok) {
 352   default: return Error("unknown token when expecting an expression");
 353   case tok_identifier: return ParseIdentifierExpr();
 354   case tok_number:     return ParseNumberExpr();
 355   case '(':            return ParseParenExpr();
 356   }
 357 }
 358 </pre>
 359 </div>
 360
 361 <p>Now that you see the definition of this function, it makes it more obvious
 362 why we can assume the state of CurTok in the various functions.  This uses
 363 look-ahead to determine which sort of expression is being inspected, and parses
 364 it with a function call.</p>
 365
 366 <p>Now that basic expressions are handled, we need to handle binary expressions,
 367 which are a bit more complex.</p>
 368
 369 </div>
 370
 371 <!-- *********************************************************************** -->
 372 <div class="doc_section"><a name="parserbinops">Binary Expression
 373  Parsing</a></div>
 374 <!-- *********************************************************************** -->
 375
 376 <div class="doc_text">
 377
 378 <p>Binary expressions are significantly harder to parse because they are often
 379 ambiguous.  For example, when given the string "x+y*z", the parser can choose
 380 to parse it as either "(x+y)*z" or "x+(y*z)".  With common definitions from
 381 mathematics, we expect the later parse, because "*" (multiplication) has
 382 higher <em>precedence</em> than "+" (addition).</p>
 383
 384 <p>There are many ways to handle this, but an elegant and efficient way is to
 385 use <a href=
 386 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
 387 Parsing</a>.  This parsing technique uses the precedence of binary operators to
 388 guide recursion.  To start with, we need a table of precedences:</p>
 389
 390 <div class="doc_code">
 391 <pre>
 392 /// BinopPrecedence - This holds the precedence for each binary operator that is
 393 /// defined.
 394 static std::map&lt;char, int&gt; BinopPrecedence;
 395
 396 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 397 static int GetTokPrecedence() {
 398   if (!isascii(CurTok))
 399     return -1;
 400
 401   // Make sure it's a declared binop.
 402   int TokPrec = BinopPrecedence[CurTok];
 403   if (TokPrec &lt;= 0) return -1;
 404   return TokPrec;
 405 }
 406
 407 int main() {
 408   // Install standard binary operators.
 409   // 1 is lowest precedence.
 410   BinopPrecedence['&lt;'] = 10;
 411   BinopPrecedence['+'] = 20;
 412   BinopPrecedence['-'] = 20;
 413   BinopPrecedence['*'] = 40;  // highest.
 414   ...
 415 }
 416 </pre>
 417 </div>
 418
 419 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
 420 (this can obviously be extended by you, the reader).  The
 421 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
 422 or -1 if the token is not a binary operator.  Having a map makes it easy to add
 423 new operators and makes it clear that the algorithm doesn't depend on the
 424 specific operators involved, but it would be easy enough to eliminate the map
 425 and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
 426
 427 <p>With the helper above defined, we can now start parsing binary expressions.
 428 The basic idea of operator precedence parsing is to break down an expression
 429 with potentially ambiguous binary operators into pieces.  Consider for example
 430 the expression "a+b+(c+d)*e*f+g".  Operator precedence parsing considers this
 431 as a stream of primary expressions separated by binary operators.  As such,
 432 it will first parse the leading primary expression "a", then it will see the
 433 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g].  Note that because parentheses
 434 are primary expressions that the binary expression parser doesn't need to worry
 435 about nested subexpressions like (c+d) at all.
 436 </p>
 437
 438 <p>
 439 To start, an expression is a primary expression potentially followed by a
 440 sequence of [binop,primaryexpr] pairs:</p>
 441
 442 <div class="doc_code">
 443 <pre>
 444 /// expression
 445 ///   ::= primary binoprhs
 446 ///
 447 static ExprAST *ParseExpression() {
 448   ExprAST *LHS = ParsePrimary();
 449   if (!LHS) return 0;
 450
 451   return ParseBinOpRHS(0, LHS);
 452 }
 453 </pre>
 454 </div>
 455
 456 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
 457 us.  It takes a precedence and a pointer to an expression for the part parsed
 458 so far.   Note that "x" is a perfectly valid expression: As such, "binoprhs" is
 459 allowed to be empty, in which case it returns the expression that is passed into
 460 it. In our example above, the code passes the expression for "a" into
 461 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
 462
 463 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
 464 minimal operator precedence</em> that the function is allowed to eat.  For
 465 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
 466 passed in a precedence of 40, it will not consume any tokens (because the
 467 precedence of '+' is only 20).  With this in mind, <tt>ParseBinOpRHS</tt> starts
 468 with:</p>
 469
 470 <div class="doc_code">
 471 <pre>
 472 /// binoprhs
 473 ///   ::= ('+' primary)*
 474 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 475   // If this is a binop, find its precedence.
 476   while (1) {
 477     int TokPrec = GetTokPrecedence();
 478
 479     // If this is a binop that binds at least as tightly as the current binop,
 480     // consume it, otherwise we are done.
 481     if (TokPrec &lt; ExprPrec)
 482       return LHS;
 483 </pre>
 484 </div>
 485
 486 <p>This code gets the precedence of the current token and checks to see if if is
 487 too low.  Because we defined invalid tokens to have a precedence of -1, this
 488 check implicitly knows that the pair-stream ends when the token stream runs out
 489 of binary operators.  If this check succeeds, we know that the token is a binary
 490 operator and that it will be included in this expression:</p>
 491
 492 <div class="doc_code">
 493 <pre>
 494     // Okay, we know this is a binop.
 495     int BinOp = CurTok;
 496     getNextToken();  // eat binop
 497
 498     // Parse the primary expression after the binary operator.
 499     ExprAST *RHS = ParsePrimary();
 500     if (!RHS) return 0;
 501 </pre>
 502 </div>
 503
 504 <p>As such, this code eats (and remembers) the binary operator and then parses
 505 the following primary expression.  This builds up the whole pair, the first of
 506 which is [+, b] for the running example.</p>
 507
 508 <p>Now that we parsed the left-hand side of an expression and one pair of the
 509 RHS sequence, we have to decide which way the expression associates.  In
 510 particular, we could have "(a+b) binop unparsed"  or "a + (b binop unparsed)".
 511 To determine this, we look ahead at "binop" to determine its precedence and
 512 compare it to BinOp's precedence (which is '+' in this case):</p>
 513
 514 <div class="doc_code">
 515 <pre>
 516     // If BinOp binds less tightly with RHS than the operator after RHS, let
 517     // the pending operator take RHS as its LHS.
 518     int NextPrec = GetTokPrecedence();
 519     if (TokPrec &lt; NextPrec) {
 520 </pre>
 521 </div>
 522
 523 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
 524 precedence of our current operator, then we know that the parentheses associate
 525 as "(a+b) binop ...".  In our example, since the next operator is "+" and so is
 526 our current one, we know that they have the same precedence.  In this case we'll
 527 create the AST node for "a+b", and then continue parsing:</p>
 528
 529 <div class="doc_code">
 530 <pre>
 531       ... if body omitted ...
 532     }
 533
 534     // Merge LHS/RHS.
 535     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 536   }  // loop around to the top of the while loop.
 537 }
 538 </pre>
 539 </div>
 540
 541 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
 542 iteration of the loop, with "+" as the current token.  The code above will eat
 543 and remember it and parse "(c+d)" as the primary expression, which makes the
 544 current pair be [+, (c+d)].  It will then enter the 'if' above with "*" as the
 545 binop to the right of the primary.  In this case, the precedence of "*" is
 546 higher than the precedence of "+" so the if condition will be entered.</p>
 547
 548 <p>The critical question left here is "how can the if condition parse the right
 549 hand side in full"?  In particular, to build the AST correctly for our example,
 550 it needs to get all of "(c+d)*e*f" as the RHS expression variable.  The code to
 551 do this is surprisingly simple (code from the above two blocks duplicated for
 552 context):</p>
 553
 554 <div class="doc_code">
 555 <pre>
 556     // If BinOp binds less tightly with RHS than the operator after RHS, let
 557     // the pending operator take RHS as its LHS.
 558     int NextPrec = GetTokPrecedence();
 559     if (TokPrec &lt; NextPrec) {
 560       RHS = ParseBinOpRHS(TokPrec+1, RHS);
 561       if (RHS == 0) return 0;
 562     }
 563     // Merge LHS/RHS.
 564     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 565   }  // loop around to the top of the while loop.
 566 }
 567 </pre>
 568 </div>
 569
 570 <p>At this point, we know that the binary operator to the RHS of our primary
 571 has higher precedence than the binop we are currently parsing.  As such, we know
 572 that any sequence of pairs whose operators are all higher precedence than "+"
 573 should be parsed together and returned as "RHS".  To do this, we recursively
 574 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
 575 precedence required for it to continue.  In our example above, this will cause
 576 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
 577 of the '+' expression.</p>
 578
 579 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
 580 and added to the AST.  With this little bit of code (14 non-trivial lines), we
 581 correctly handle fully general binary expression parsing in a very elegant way.
 582 </p>
 583
 584 <p>This wraps up handling of expressions.  At this point, we can point the
 585 parser at an arbitrary token stream and build an expression from them, stopping
 586 at the first token that is not part of the expression.  Next up we need to
 587 handle function definitions etc.</p>
 588
 589 </div>
 590
 591 <!-- *********************************************************************** -->
 592 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
 593 <!-- *********************************************************************** -->
 594
 595 <div class="doc_text">
 596
 597 <p>
 598 The first basic thing missing is that of function prototypes.  In Kaleidoscope,
 599 these are used both for 'extern' function declarations as well as function body
 600 definitions.  The code to do this is straight-forward and not very interesting
 601 (once you've survived expressions):
 602 </p>
 603
 604 <div class="doc_code">
 605 <pre>
 606 /// prototype
 607 ///   ::= id '(' id* ')'
 608 static PrototypeAST *ParsePrototype() {
 609   if (CurTok != tok_identifier)
 610     return ErrorP("Expected function name in prototype");
 611
 612   std::string FnName = IdentifierStr;
 613   getNextToken();
 614
 615   if (CurTok != '(')
 616     return ErrorP("Expected '(' in prototype");
 617
 618   std::vector&lt;std::string&gt; ArgNames;
 619   while (getNextToken() == tok_identifier)
 620     ArgNames.push_back(IdentifierStr);
 621   if (CurTok != ')')
 622     return ErrorP("Expected ')' in prototype");
 623
 624   // success.
 625   getNextToken();  // eat ')'.
 626
 627   return new PrototypeAST(FnName, ArgNames);
 628 }
 629 </pre>
 630 </div>
 631
 632 <p>Given this, a function definition is very simple, just a prototype plus
 633 and expression to implement the body:</p>
 634
 635 <div class="doc_code">
 636 <pre>
 637 /// definition ::= 'def' prototype expression
 638 static FunctionAST *ParseDefinition() {
 639   getNextToken();  // eat def.
 640   PrototypeAST *Proto = ParsePrototype();
 641   if (Proto == 0) return 0;
 642
 643   if (ExprAST *E = ParseExpression())
 644     return new FunctionAST(Proto, E);
 645   return 0;
 646 }
 647 </pre>
 648 </div>
 649
 650 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
 651 well as to support forward declaration of user functions.  'externs' are just
 652 prototypes with no body:</p>
 653
 654 <div class="doc_code">
 655 <pre>
 656 /// external ::= 'extern' prototype
 657 static PrototypeAST *ParseExtern() {
 658   getNextToken();  // eat extern.
 659   return ParsePrototype();
 660 }
 661 </pre>
 662 </div>
 663
 664 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
 665 evaluate them on the fly.  We will handle this by defining anonymous nullary
 666 (zero argument) functions for them:</p>
 667
 668 <div class="doc_code">
 669 <pre>
 670 /// toplevelexpr ::= expression
 671 static FunctionAST *ParseTopLevelExpr() {
 672   if (ExprAST *E = ParseExpression()) {
 673     // Make an anonymous proto.
 674     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
 675     return new FunctionAST(Proto, E);
 676   }
 677   return 0;
 678 }
 679 </pre>
 680 </div>
 681
 682 <p>Now that we have all the pieces, lets build a little driver that will let us
 683 actually <em>execute</em> this code we've built!</p>
 684
 685 </div>
 686
 687 <!-- *********************************************************************** -->
 688 <div class="doc_section"><a name="driver">The Driver</a></div>
 689 <!-- *********************************************************************** -->
 690
 691 <div class="doc_text">
 692
 693 <p>The driver for this simply invokes all of the parsing pieces with a top-level
 694 dispatch loop.  There isn't much interesting here, so I'll just include the
 695 top-level loop.  See <a href="#code">below</a> for full code in the "Top-Level
 696 Parsing" section.</p>
 697
 698 <div class="doc_code">
 699 <pre>
 700 /// top ::= definition | external | expression | ';'
 701 static void MainLoop() {
 702   while (1) {
 703     fprintf(stderr, "ready&gt; ");
 704     switch (CurTok) {
 705     case tok_eof:    return;
 706     case ';':        getNextToken(); break;  // ignore top level semicolons.
 707     case tok_def:    HandleDefinition(); break;
 708     case tok_extern: HandleExtern(); break;
 709     default:         HandleTopLevelExpression(); break;
 710     }
 711   }
 712 }
 713 </pre>
 714 </div>
 715
 716 <p>The most interesting part of this is that we ignore top-level semi colons.
 717 Why is this, you ask?  The basic reason is that if you type "4 + 5" at the
 718 command line, the parser doesn't know that that is the end of what you will
 719 type.  For example, on the next line you could type "def foo..." in which case
 720 4+5 is the end of a top-level expression.  Alternatively you could type "* 6",
 721 which would continue the expression.  Having top-level semicolons allows you to
 722 type "4+5;" and the parser will know you are done.</p>
 723
 724 </div>
 725
 726 <!-- *********************************************************************** -->
 727 <div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
 728 <!-- *********************************************************************** -->
 729
 730 <div class="doc_text">
 731
 732 <p>With just under 400 lines of commented code, we fully defined our minimal
 733 language, including a lexer, parser and AST builder.  With this done, the
 734 executable will validate code and tell us if it is gramatically invalid.  For
 735 example, here is a sample interaction:</p>
 736
 737 <div class="doc_code">
 738 <pre>
 739 $ ./a.out
 740 ready> def foo(x y) x+foo(y, 4.0);
 741 ready> Parsed an function definition.
 742 ready> def foo(x y) x+y y;
 743 ready> Parsed an function definition.
 744 ready> Parsed a top-level expr
 745 ready> def foo(x y) x+y );
 746 ready> Parsed an function definition.
 747 ready> Error: unknown token when expecting an expression
 748 ready> extern sin(a);
 749 ready> Parsed an extern
 750 ready> ^D
 751 $
 752 </pre>
 753 </div>
 754
 755 <p>
 756 Here is the full code.  Note that it is fully self-contained: you don't even
 757 need LLVM for this.  In the <a href="LangImpl3.html">next installment</a>, we
 758 will describe how to generate LLVM IR from the AST.</p>
 759
 760 <div class="doc_code">
 761 <pre>
 762 // To build this:
 763 //  g++ -g toy.cpp
 764 //  ./a.out
 765
 766 #include &lt;cstdio&gt;
 767 #include &lt;string&gt;
 768 #include &lt;
 769 #include &lt;vector&gt;
 770
 771 //===----------------------------------------------------------------------===//
 772 // Lexer
 773 //===----------------------------------------------------------------------===//
 774
 775 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
 776 // of these for known things.
 777 enum Token {
 778   tok_eof = -1,
 779
 780   // commands
 781   tok_def = -2, tok_extern = -3,
 782
 783   // primary
 784   tok_identifier = -4, tok_number = -5,
 785 };
 786
 787 static std::string IdentifierStr;  // Filled in if tok_identifier
 788 static double NumVal;              // Filled in if tok_number
 789
 790 /// gettok - Return the next token from standard input.
 791 static int gettok() {
 792   static int LastChar = ' ';
 793
 794   // Skip any whitespace.
 795   while (isspace(LastChar))
 796     LastChar = getchar();
 797
 798   if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
 799     IdentifierStr = LastChar;
 800     while (isalnum((LastChar = getchar())))
 801       IdentifierStr += LastChar;
 802
 803     if (IdentifierStr == "def") return tok_def;
 804     if (IdentifierStr == "extern") return tok_extern;
 805     return tok_identifier;
 806   }
 807
 808   if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
 809     std::string NumStr;
 810     do {
 811       NumStr += LastChar;
 812       LastChar = getchar();
 813     } while (isdigit(LastChar) || LastChar == '.');
 814
 815     NumVal = strtod(NumStr.c_str(), 0);
 816     return tok_number;
 817   }
 818
 819   if (LastChar == '#') {
 820     // Comment until end of line.
 821     do LastChar = getchar();
 822     while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
 823
 824     if (LastChar != EOF)
 825       return gettok();
 826   }
 827
 828   // Check for end of file.  Don't eat the EOF.
 829   if (LastChar == EOF)
 830     return tok_eof;
 831
 832   // Otherwise, just return the character as its ascii value.
 833   int ThisChar = LastChar;
 834   LastChar = getchar();
 835   return ThisChar;
 836 }
 837
 838 //===----------------------------------------------------------------------===//
 839 // Abstract Syntax Tree (aka Parse Tree)
 840 //===----------------------------------------------------------------------===//
 841
 842 /// ExprAST - Base class for all expression nodes.
 843 class ExprAST {
 844 public:
 845   virtual ~ExprAST() {}
 846 };
 847
 848 /// NumberExprAST - Expression class for numeric literals like "1.0".
 849 class NumberExprAST : public ExprAST {
 850   double Val;
 851 public:
 852   NumberExprAST(double val) : Val(val) {}
 853 };
 854
 855 /// VariableExprAST - Expression class for referencing a variable, like "a".
 856 class VariableExprAST : public ExprAST {
 857   std::string Name;
 858 public:
 859   VariableExprAST(const std::string &amp;name) : Name(name) {}
 860 };
 861
 862 /// BinaryExprAST - Expression class for a binary operator.
 863 class BinaryExprAST : public ExprAST {
 864   char Op;
 865   ExprAST *LHS, *RHS;
 866 public:
 867   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 868     : Op(op), LHS(lhs), RHS(rhs) {}
 869 };
 870
 871 /// CallExprAST - Expression class for function calls.
 872 class CallExprAST : public ExprAST {
 873   std::string Callee;
 874   std::vector&lt;ExprAST*&gt; Args;
 875 public:
 876   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 877     : Callee(callee), Args(args) {}
 878 };
 879
 880 /// PrototypeAST - This class represents the "prototype" for a function,
 881 /// which captures its argument names as well as if it is an operator.
 882 class PrototypeAST {
 883   std::string Name;
 884   std::vector&lt; Args;
 885 public:
 886   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 887     : Name(name), Args(args) {}
 888
 889 };
 890
 891 /// FunctionAST - This class represents a function definition itself.
 892 class FunctionAST {
 893   PrototypeAST *Proto;
 894   ExprAST *Body;
 895 public:
 896   FunctionAST(PrototypeAST *proto, ExprAST *body)
 897     : Proto(proto), Body(body) {}
 898
 899 };
 900
 901 //===----------------------------------------------------------------------===//
 902 // Parser
 903 //===----------------------------------------------------------------------===//
 904
 905 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 906 /// token the parser it looking at.  getNextToken reads another token from the
 907 /// lexer and updates CurTok with its results.
 908 static int CurTok;
 909 static int getNextToken() {
 910   return CurTok = gettok();
 911 }
 912
 913 /// BinopPrecedence - This holds the precedence for each binary operator that is
 914 /// defined.
 915 static std::map&lt;char, int&gt; BinopPrecedence;
 916
 917 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 918 static int GetTokPrecedence() {
 919   if (!isascii(CurTok))
 920     return -1;
 921
 922   // Make sure it's a declared binop.
 923   int TokPrec = BinopPrecedence[CurTok];
 924   if (TokPrec &lt;= 0) return -1;
 925   return TokPrec;
 926 }
 927
 928 /// Error* - These are little helper functions for error handling.
 929 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 930 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 931 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 932
 933 static ExprAST *ParseExpression();
 934
 935 /// identifierexpr
 936 ///   ::= identifer
 937 ///   ::= identifer '(' expression* ')'
 938 static ExprAST *ParseIdentifierExpr() {
 939   std::string IdName = IdentifierStr;
 940
 941   getNextToken();  // eat identifer.
 942
 943   if (CurTok != '(') // Simple variable ref.
 944     return new VariableExprAST(IdName);
 945
 946   // Call.
 947   getNextToken();  // eat (
 948   std::vector&lt;ExprAST*&gt; Args;
 949   while (1) {
 950     ExprAST *Arg = ParseExpression();
 951     if (!Arg) return 0;
 952     Args.push_back(Arg);
 953
 954     if (CurTok == ')') break;
 955
 956     if (CurTok != ',')
 957       return Error("Expected ')'");
 958     getNextToken();
 959   }
 960
 961   // Eat the ')'.
 962   getNextToken();
 963
 964   return new CallExprAST(IdName, Args);
 965 }
 966
 967 /// numberexpr ::= number
 968 static ExprAST *ParseNumberExpr() {
 969   ExprAST *Result = new NumberExprAST(NumVal);
 970   getNextToken(); // consume the number
 971   return Result;
 972 }
 973
 974 /// parenexpr ::= '(' expression ')'
 975 static ExprAST *ParseParenExpr() {
 976   getNextToken();  // eat (.
 977   ExprAST *V = ParseExpression();
 978   if (!V) return 0;
 979
 980   if (CurTok != ')')
 981     return Error("expected ')'");
 982   getNextToken();  // eat ).
 983   return V;
 984 }
 985
 986 /// primary
 987 ///   ::= identifierexpr
 988 ///   ::= numberexpr
 989 ///   ::= parenexpr
 990 static ExprAST *ParsePrimary() {
 991   switch (CurTok) {
 992   default: return Error("unknown token when expecting an expression");
 993   case tok_identifier: return ParseIdentifierExpr();
 994   case tok_number:     return ParseNumberExpr();
 995   case '(':            return ParseParenExpr();
 996   }
 997 }
 998
 999 /// binoprhs
1000 ///   ::= ('+' primary)*
1001 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1002   // If this is a binop, find its precedence.
1003   while (1) {
1004     int TokPrec = GetTokPrecedence();
1005
1006     // If this is a binop that binds at least as tightly as the current binop,
1007     // consume it, otherwise we are done.
1008     if (TokPrec &lt; ExprPrec)
1009       return LHS;
1010
1011     // Okay, we know this is a binop.
1012     int BinOp = CurTok;
1013     getNextToken();  // eat binop
1014
1015     // Parse the primary expression after the binary operator.
1016     ExprAST *RHS = ParsePrimary();
1017     if (!RHS) return 0;
1018
1019     // If BinOp binds less tightly with RHS than the operator after RHS, let
1020     // the pending operator take RHS as its LHS.
1021     int NextPrec = GetTokPrecedence();
1022     if (TokPrec &lt; NextPrec) {
1023       RHS = ParseBinOpRHS(TokPrec+1, RHS);
1024       if (RHS == 0) return 0;
1025     }
1026
1027     // Merge LHS/RHS.
1028     LHS = new BinaryExprAST(BinOp, LHS, RHS);
1029   }
1030 }
1031
1032 /// expression
1033 ///   ::= primary binoprhs
1034 ///
1035 static ExprAST *ParseExpression() {
1036   ExprAST *LHS = ParsePrimary();
1037   if (!LHS) return 0;
1038
1039   return ParseBinOpRHS(0, LHS);
1040 }
1041
1042 /// prototype
1043 ///   ::= id '(' id* ')'
1044 static PrototypeAST *ParsePrototype() {
1045   if (CurTok != tok_identifier)
1046     return ErrorP("Expected function name in prototype");
1047
1048   std::string FnName = IdentifierStr;
1049   getNextToken();
1050
1051   if (CurTok != '(')
1052     return ErrorP("Expected '(' in prototype");
1053
1054   std::vector&lt;std::string&gt; ArgNames;
1055   while (getNextToken() == tok_identifier)
1056     ArgNames.push_back(IdentifierStr);
1057   if (CurTok != ')')
1058     return ErrorP("Expected ')' in prototype");
1059
1060   // success.
1061   getNextToken();  // eat ')'.
1062
1063   return new PrototypeAST(FnName, ArgNames);
1064 }
1065
1066 /// definition ::= 'def' prototype expression
1067 static FunctionAST *ParseDefinition() {
1068   getNextToken();  // eat def.
1069   PrototypeAST *Proto = ParsePrototype();
1070   if (Proto == 0) return 0;
1071
1072   if (ExprAST *E = ParseExpression())
1073     return new FunctionAST(Proto, E);
1074   return 0;
1075 }
1076
1077 /// toplevelexpr ::= expression
1078 static FunctionAST *ParseTopLevelExpr() {
1079   if (ExprAST *E = ParseExpression()) {
1080     // Make an anonymous proto.
1081     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1082     return new FunctionAST(Proto, E);
1083   }
1084   return 0;
1085 }
1086
1087 /// external ::= 'extern' prototype
1088 static PrototypeAST *ParseExtern() {
1089   getNextToken();  // eat extern.
1090   return ParsePrototype();
1091 }
1092
1093 //===----------------------------------------------------------------------===//
1094 // Top-Level parsing
1095 //===----------------------------------------------------------------------===//
1096
1097 static void HandleDefinition() {
1098   if (FunctionAST *F = ParseDefinition()) {
1099     fprintf(stderr, "Parsed a function definition.\n");
1100   } else {
1101     // Skip token for error recovery.
1102     getNextToken();
1103   }
1104 }
1105
1106 static void HandleExtern() {
1107   if (PrototypeAST *P = ParseExtern()) {
1108     fprintf(stderr, "Parsed an extern\n");
1109   } else {
1110     // Skip token for error recovery.
1111     getNextToken();
1112   }
1113 }
1114
1115 static void HandleTopLevelExpression() {
1116   // Evaluate a top level expression into an anonymous function.
1117   if (FunctionAST *F = ParseTopLevelExpr()) {
1118     fprintf(stderr, "Parsed a top-level expr\n");
1119   } else {
1120     // Skip token for error recovery.
1121     getNextToken();
1122   }
1123 }
1124
1125 /// top ::= definition | external | expression | ';'
1126 static void MainLoop() {
1127   while (1) {
1128     fprintf(stderr, "ready&gt; ");
1129     switch (CurTok) {
1130     case tok_eof:    return;
1131     case ';':        getNextToken(); break;  // ignore top level semicolons.
1132     case tok_def:    HandleDefinition(); break;
1133     case tok_extern: HandleExtern(); break;
1134     default:         HandleTopLevelExpression(); break;
1135     }
1136   }
1137 }
1138
1139 //===----------------------------------------------------------------------===//
1140 // Main driver code.
1141 //===----------------------------------------------------------------------===//
1142
1143 int main() {
1144   // Install standard binary operators.
1145   // 1 is lowest precedence.
1146   BinopPrecedence['&lt;'] = 10;
1147   BinopPrecedence['+'] = 20;
1148   BinopPrecedence['-'] = 20;
1149   BinopPrecedence['*'] = 40;  // highest.
1150
1151   // Prime the first token.
1152   fprintf(stderr, "ready&gt; ");
1153   getNextToken();
1154
1155   MainLoop();
1156   return 0;
1157 }
1158 </pre>
1159 </div>
1160 </div>
1161
1162 <!-- *********************************************************************** -->
1163 <hr>
1164 <address>
1165   <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1166   src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1167   <a href="http://validator.w3.org/check/referer"><img
1168   src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!" /></a>
1169
1170   <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1171   <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1172   Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1173 </address>
1174 </body>
1175 </html>