1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
17 <li><a href="index.html">Up to Tutorial Index</a></li>
20 <li><a href="#intro">Chapter 2 Introduction</a></li>
21 <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
22 <li><a href="#parserbasics">Parser Basics</a></li>
23 <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
24 <li><a href="#parserbinops">Binary Expression Parsing</a></li>
25 <li><a href="#parsertop">Parsing the Rest</a></li>
26 <li><a href="#driver">The Driver</a></li>
27 <li><a href="#conclusions">Conclusions</a></li>
28 <li><a href="#code">Full Code Listing</a></li>
31 <li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
34 <div class="doc_author">
35 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
38 <!-- *********************************************************************** -->
39 <div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
40 <!-- *********************************************************************** -->
42 <div class="doc_text">
44 <p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
45 with LLVM</a>" tutorial. This chapter shows you how to use the <a
46 href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
47 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
48 our Kaleidoscope language. Once we have a parser, we'll define and build an <a
49 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
52 <p>The parser we will build uses a combination of <a
53 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
54 Parsing</a> and <a href=
55 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
56 Parsing</a> to parse the Kaleidoscope language (the latter for binary expression
57 and the former for everything else). Before we get to parsing though, lets talk
58 about the output of the parser: the Abstract Syntax Tree.</p>
62 <!-- *********************************************************************** -->
63 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
64 <!-- *********************************************************************** -->
66 <div class="doc_text">
68 <p>The AST for a program captures its behavior in a way that it is easy for
69 later stages of the compiler (e.g. code generation) to interpret. We basically
70 want one object for each construct in the language, and the AST should closely
71 model the language. In Kaleidoscope, we have expressions, a prototype, and a
72 function object. We'll start with expressions first:</p>
74 <div class="doc_code">
76 /// ExprAST - Base class for all expression nodes.
82 /// NumberExprAST - Expression class for numeric literals like "1.0".
83 class NumberExprAST : public ExprAST {
86 explicit NumberExprAST(double val) : Val(val) {}
91 <p>The code above shows the definition of the base ExprAST class and one
92 subclass which we use for numeric literals. The important thing about this is
93 that the NumberExprAST class captures the numeric value of the literal in the
94 class, so that later phases of the compiler can know what it is.</p>
96 <p>Right now we only create the AST, so there are no useful accessor methods on
97 them. It would be very easy to add a virtual method to pretty print the code,
98 for example. Here are the other expression AST node definitions that we'll use
99 in the basic form of the Kaleidoscope language.
102 <div class="doc_code">
104 /// VariableExprAST - Expression class for referencing a variable, like "a".
105 class VariableExprAST : public ExprAST {
108 explicit VariableExprAST(const std::string &name) : Name(name) {}
111 /// BinaryExprAST - Expression class for a binary operator.
112 class BinaryExprAST : public ExprAST {
116 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
117 : Op(op), LHS(lhs), RHS(rhs) {}
120 /// CallExprAST - Expression class for function calls.
121 class CallExprAST : public ExprAST {
123 std::vector<ExprAST*> Args;
125 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
126 : Callee(callee), Args(args) {}
131 <p>This is all (intentially) rather straight-forward: variables capture the
132 variable name, binary operators capture their opcode (e.g. '+'), and calls
133 capture a function name and list of argument expressions. One thing that is
134 nice about our AST is that it captures the language features without talking
135 about the syntax of the language. Note that there is no discussion about
136 precedence of binary operators, lexical structure etc.</p>
138 <p>For our basic language, these are all of the expression nodes we'll define.
139 Because it doesn't have conditional control flow, it isn't Turing-complete;
140 we'll fix that in a later installment. The two things we need next are a way
141 to talk about the interface to a function, and a way to talk about functions
144 <div class="doc_code">
146 /// PrototypeAST - This class represents the "prototype" for a function,
147 /// which captures its name, and its argument names (thus implicitly the number
148 /// of arguments the function takes).
151 std::vector<std::string> Args;
153 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
154 : Name(name), Args(args) {}
157 /// FunctionAST - This class represents a function definition itself.
162 FunctionAST(PrototypeAST *proto, ExprAST *body)
163 : Proto(proto), Body(body) {}
168 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
169 Since all values are double precision floating point, the type of each argument
170 doesn't need to be stored anywhere. In a more aggressive and realistic
171 language, the "ExprAST" class would probably have a type field.</p>
173 <p>With this scaffolding, we can now talk about parsing expressions and function
174 bodies in Kaleidoscope.</p>
178 <!-- *********************************************************************** -->
179 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
180 <!-- *********************************************************************** -->
182 <div class="doc_text">
184 <p>Now that we have an AST to build, we need to define the parser code to build
185 it. The idea here is that we want to parse something like "x+y" (which is
186 returned as three tokens by the lexer) into an AST that could be generated with
189 <div class="doc_code">
191 ExprAST *X = new VariableExprAST("x");
192 ExprAST *Y = new VariableExprAST("y");
193 ExprAST *Result = new BinaryExprAST('+', X, Y);
197 <p>In order to do this, we'll start by defining some basic helper routines:</p>
199 <div class="doc_code">
201 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
202 /// token the parser it looking at. getNextToken reads another token from the
203 /// lexer and updates CurTok with its results.
205 static int getNextToken() {
206 return CurTok = gettok();
212 This implements a simple token buffer around the lexer. This allows
213 us to look one token ahead at what the lexer is returning. Every function in
214 our parser will assume that CurTok is the current token that needs to be
217 <div class="doc_code">
220 /// Error* - These are little helper functions for error handling.
221 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
222 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
223 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
228 The <tt>Error</tt> routines are simple helper routines that our parser will use
229 to handle errors. The error recovery in our parser will not be the best and
230 is not particular user-friendly, but it will be enough for our tutorial. These
231 routines make it easier to handle errors in routines that have various return
232 types: they always return null.</p>
234 <p>With these basic helper functions implemented, we can implement the first
235 piece of our grammar: we'll start with numeric literals.</p>
239 <!-- *********************************************************************** -->
240 <div class="doc_section"><a name="parserprimexprs">Basic Expression
242 <!-- *********************************************************************** -->
244 <div class="doc_text">
246 <p>We start with numeric literals, because they are the simplest to process.
247 For each production in our grammar, we'll define a function which parses that
248 production. For numeric literals, we have:
251 <div class="doc_code">
253 /// numberexpr ::= number
254 static ExprAST *ParseNumberExpr() {
255 ExprAST *Result = new NumberExprAST(NumVal);
256 getNextToken(); // consume the number
262 <p>This routine is very simple: it expects to be called when the current token
263 is a <tt>tok_number</tt> token. It takes the current number value, creates
264 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
267 <p>There are some interesting aspects of this. The most important one is that
268 this routine eats all of the tokens that correspond to the production, and
269 returns the lexer buffer with the next token (which is not part of the grammar
270 production) ready to go. This is a fairly standard way to go for recursive
271 descent parsers. For a better example, the parenthesis operator is defined like
274 <div class="doc_code">
276 /// parenexpr ::= '(' expression ')'
277 static ExprAST *ParseParenExpr() {
278 getNextToken(); // eat (.
279 ExprAST *V = ParseExpression();
283 return Error("expected ')'");
284 getNextToken(); // eat ).
290 <p>This function illustrates a number of interesting things about the parser:
291 1) it shows how we use the Error routines. When called, this function expects
292 that the current token is a '(' token, but after parsing the subexpression, it
293 is possible that there is no ')' waiting. For example, if the user types in
294 "(4 x" instead of "(4)", the parser should emit an error. Because errors can
295 occur, the parser needs a way to indicate that they happened: in our parser, we
296 return null on an error.</p>
298 <p>Another interesting aspect of this function is that it uses recursion by
299 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
300 <tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
301 recursive grammars, and keeps each production very simple. Note that
302 parentheses do not cause construction of AST nodes themselves. While we could
303 do this, the most important role of parens are to guide the parser and provide
304 grouping. Once the parser constructs the AST, parens are not needed.</p>
306 <p>The next simple production is for handling variable references and function
309 <div class="doc_code">
313 /// ::= identifier '(' expression* ')'
314 static ExprAST *ParseIdentifierExpr() {
315 std::string IdName = IdentifierStr;
317 getNextToken(); // eat identifier.
319 if (CurTok != '(') // Simple variable ref.
320 return new VariableExprAST(IdName);
323 getNextToken(); // eat (
324 std::vector<ExprAST*> Args;
327 ExprAST *Arg = ParseExpression();
331 if (CurTok == ')') break;
334 return Error("Expected ')'");
342 return new CallExprAST(IdName, Args);
347 <p>This routine follows the same style as the other routines (it expects to be
348 called if the current token is a <tt>tok_identifier</tt> token). It also has
349 recursion and error handling. One interesting aspect of this is that it uses
350 <em>look-ahead</em> to determine if the current identifier is a stand alone
351 variable reference or if it is a function call expression. It handles this by
352 checking to see if the token after the identifier is a '(' token, and constructs
353 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
356 <p>Now that we have all of our simple expression parsing logic in place, we can
357 define a helper function to wrap it together into one entry-point. We call this
358 class of expressions "primary" expressions, for reasons that will become more
359 clear <a href="LangImpl6.html#unary">later in the tutorial</a>. In order to
360 parse an arbitrary primary expression, we need to determine what sort of
361 specific expression it is:</p>
363 <div class="doc_code">
366 /// ::= identifierexpr
369 static ExprAST *ParsePrimary() {
371 default: return Error("unknown token when expecting an expression");
372 case tok_identifier: return ParseIdentifierExpr();
373 case tok_number: return ParseNumberExpr();
374 case '(': return ParseParenExpr();
380 <p>Now that you see the definition of this function, it makes it more obvious
381 why we can assume the state of CurTok in the various functions. This uses
382 look-ahead to determine which sort of expression is being inspected, and parses
383 it with a function call.</p>
385 <p>Now that basic expressions are handled, we need to handle binary expressions,
386 which are a bit more complex.</p>
390 <!-- *********************************************************************** -->
391 <div class="doc_section"><a name="parserbinops">Binary Expression
393 <!-- *********************************************************************** -->
395 <div class="doc_text">
397 <p>Binary expressions are significantly harder to parse because they are often
398 ambiguous. For example, when given the string "x+y*z", the parser can choose
399 to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
400 mathematics, we expect the later parse, because "*" (multiplication) has
401 higher <em>precedence</em> than "+" (addition).</p>
403 <p>There are many ways to handle this, but an elegant and efficient way is to
405 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
406 Parsing</a>. This parsing technique uses the precedence of binary operators to
407 guide recursion. To start with, we need a table of precedences:</p>
409 <div class="doc_code">
411 /// BinopPrecedence - This holds the precedence for each binary operator that is
413 static std::map<char, int> BinopPrecedence;
415 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
416 static int GetTokPrecedence() {
417 if (!isascii(CurTok))
420 // Make sure it's a declared binop.
421 int TokPrec = BinopPrecedence[CurTok];
422 if (TokPrec <= 0) return -1;
427 // Install standard binary operators.
428 // 1 is lowest precedence.
429 BinopPrecedence['<'] = 10;
430 BinopPrecedence['+'] = 20;
431 BinopPrecedence['-'] = 20;
432 BinopPrecedence['*'] = 40; // highest.
438 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
439 (this can obviously be extended by you, our brave and intrepid reader). The
440 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
441 or -1 if the token is not a binary operator. Having a map makes it easy to add
442 new operators and makes it clear that the algorithm doesn't depend on the
443 specific operators involved, but it would be easy enough to eliminate the map
444 and do the comparisons in the <tt>GetTokPrecedence</tt> function (or just use
445 a fixed-size array).</p>
447 <p>With the helper above defined, we can now start parsing binary expressions.
448 The basic idea of operator precedence parsing is to break down an expression
449 with potentially ambiguous binary operators into pieces. Consider for example
450 the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
451 as a stream of primary expressions separated by binary operators. As such,
452 it will first parse the leading primary expression "a", then it will see the
453 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
454 are primary expressions, the binary expression parser doesn't need to worry
455 about nested subexpressions like (c+d) at all.
459 To start, an expression is a primary expression potentially followed by a
460 sequence of [binop,primaryexpr] pairs:</p>
462 <div class="doc_code">
465 /// ::= primary binoprhs
467 static ExprAST *ParseExpression() {
468 ExprAST *LHS = ParsePrimary();
471 return ParseBinOpRHS(0, LHS);
476 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
477 us. It takes a precedence and a pointer to an expression for the part parsed
478 so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
479 allowed to be empty, in which case it returns the expression that is passed into
480 it. In our example above, the code passes the expression for "a" into
481 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
483 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
484 minimal operator precedence</em> that the function is allowed to eat. For
485 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
486 passed in a precedence of 40, it will not consume any tokens (because the
487 precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
490 <div class="doc_code">
493 /// ::= ('+' primary)*
494 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
495 // If this is a binop, find its precedence.
497 int TokPrec = GetTokPrecedence();
499 // If this is a binop that binds at least as tightly as the current binop,
500 // consume it, otherwise we are done.
501 if (TokPrec < ExprPrec)
506 <p>This code gets the precedence of the current token and checks to see if if is
507 too low. Because we defined invalid tokens to have a precedence of -1, this
508 check implicitly knows that the pair-stream ends when the token stream runs out
509 of binary operators. If this check succeeds, we know that the token is a binary
510 operator and that it will be included in this expression:</p>
512 <div class="doc_code">
514 // Okay, we know this is a binop.
516 getNextToken(); // eat binop
518 // Parse the primary expression after the binary operator.
519 ExprAST *RHS = ParsePrimary();
524 <p>As such, this code eats (and remembers) the binary operator and then parses
525 the following primary expression. This builds up the whole pair, the first of
526 which is [+, b] for the running example.</p>
528 <p>Now that we parsed the left-hand side of an expression and one pair of the
529 RHS sequence, we have to decide which way the expression associates. In
530 particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
531 To determine this, we look ahead at "binop" to determine its precedence and
532 compare it to BinOp's precedence (which is '+' in this case):</p>
534 <div class="doc_code">
536 // If BinOp binds less tightly with RHS than the operator after RHS, let
537 // the pending operator take RHS as its LHS.
538 int NextPrec = GetTokPrecedence();
539 if (TokPrec < NextPrec) {
543 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
544 precedence of our current operator, then we know that the parentheses associate
545 as "(a+b) binop ...". In our example, since the next operator is "+" and so is
546 our current one, we know that they have the same precedence. In this case we'll
547 create the AST node for "a+b", and then continue parsing:</p>
549 <div class="doc_code">
551 ... if body omitted ...
555 LHS = new BinaryExprAST(BinOp, LHS, RHS);
556 } // loop around to the top of the while loop.
561 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
562 iteration of the loop, with "+" as the current token. The code above will eat
563 and remember it and parse "(c+d)" as the primary expression, which makes the
564 current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
565 binop to the right of the primary. In this case, the precedence of "*" is
566 higher than the precedence of "+" so the if condition will be entered.</p>
568 <p>The critical question left here is "how can the if condition parse the right
569 hand side in full"? In particular, to build the AST correctly for our example,
570 it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
571 do this is surprisingly simple (code from the above two blocks duplicated for
574 <div class="doc_code">
576 // If BinOp binds less tightly with RHS than the operator after RHS, let
577 // the pending operator take RHS as its LHS.
578 int NextPrec = GetTokPrecedence();
579 if (TokPrec < NextPrec) {
580 <b>RHS = ParseBinOpRHS(TokPrec+1, RHS);
581 if (RHS == 0) return 0;</b>
584 LHS = new BinaryExprAST(BinOp, LHS, RHS);
585 } // loop around to the top of the while loop.
590 <p>At this point, we know that the binary operator to the RHS of our primary
591 has higher precedence than the binop we are currently parsing. As such, we know
592 that any sequence of pairs whose operators are all higher precedence than "+"
593 should be parsed together and returned as "RHS". To do this, we recursively
594 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
595 precedence required for it to continue. In our example above, this will cause
596 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
597 of the '+' expression.</p>
599 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
600 and added to the AST. With this little bit of code (14 non-trivial lines), we
601 correctly handle fully general binary expression parsing in a very elegant way.
602 This was a whirlwind tour of this code, and it is somewhat subtle. I recommend
603 running through it with a few tough examples to see how it works.
606 <p>This wraps up handling of expressions. At this point, we can point the
607 parser at an arbitrary token stream and build an expression from them, stopping
608 at the first token that is not part of the expression. Next up we need to
609 handle function definitions etc.</p>
613 <!-- *********************************************************************** -->
614 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
615 <!-- *********************************************************************** -->
617 <div class="doc_text">
620 The next thing missing is handling of function prototypes. In Kaleidoscope,
621 these are used both for 'extern' function declarations as well as function body
622 definitions. The code to do this is straight-forward and not very interesting
623 (once you've survived expressions):
626 <div class="doc_code">
629 /// ::= id '(' id* ')'
630 static PrototypeAST *ParsePrototype() {
631 if (CurTok != tok_identifier)
632 return ErrorP("Expected function name in prototype");
634 std::string FnName = IdentifierStr;
638 return ErrorP("Expected '(' in prototype");
640 // Read the list of argument names.
641 std::vector<std::string> ArgNames;
642 while (getNextToken() == tok_identifier)
643 ArgNames.push_back(IdentifierStr);
645 return ErrorP("Expected ')' in prototype");
648 getNextToken(); // eat ')'.
650 return new PrototypeAST(FnName, ArgNames);
655 <p>Given this, a function definition is very simple, just a prototype plus
656 an expression to implement the body:</p>
658 <div class="doc_code">
660 /// definition ::= 'def' prototype expression
661 static FunctionAST *ParseDefinition() {
662 getNextToken(); // eat def.
663 PrototypeAST *Proto = ParsePrototype();
664 if (Proto == 0) return 0;
666 if (ExprAST *E = ParseExpression())
667 return new FunctionAST(Proto, E);
673 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
674 well as to support forward declaration of user functions. 'externs' are just
675 prototypes with no body:</p>
677 <div class="doc_code">
679 /// external ::= 'extern' prototype
680 static PrototypeAST *ParseExtern() {
681 getNextToken(); // eat extern.
682 return ParsePrototype();
687 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
688 evaluate them on the fly. We will handle this by defining anonymous nullary
689 (zero argument) functions for them:</p>
691 <div class="doc_code">
693 /// toplevelexpr ::= expression
694 static FunctionAST *ParseTopLevelExpr() {
695 if (ExprAST *E = ParseExpression()) {
696 // Make an anonymous proto.
697 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
698 return new FunctionAST(Proto, E);
705 <p>Now that we have all the pieces, lets build a little driver that will let us
706 actually <em>execute</em> this code we've built!</p>
710 <!-- *********************************************************************** -->
711 <div class="doc_section"><a name="driver">The Driver</a></div>
712 <!-- *********************************************************************** -->
714 <div class="doc_text">
716 <p>The driver for this simply invokes all of the parsing pieces with a top-level
717 dispatch loop. There isn't much interesting here, so I'll just include the
718 top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
719 Parsing" section.</p>
721 <div class="doc_code">
723 /// top ::= definition | external | expression | ';'
724 static void MainLoop() {
726 fprintf(stderr, "ready> ");
728 case tok_eof: return;
729 case ';': getNextToken(); break; // ignore top level semicolons.
730 case tok_def: HandleDefinition(); break;
731 case tok_extern: HandleExtern(); break;
732 default: HandleTopLevelExpression(); break;
739 <p>The most interesting part of this is that we ignore top-level semi colons.
740 Why is this, you ask? The basic reason is that if you type "4 + 5" at the
741 command line, the parser doesn't know that that is the end of what you will
742 type. For example, on the next line you could type "def foo..." in which case
743 4+5 is the end of a top-level expression. Alternatively you could type "* 6",
744 which would continue the expression. Having top-level semicolons allows you to
745 type "4+5;" and the parser will know you are done.</p>
749 <!-- *********************************************************************** -->
750 <div class="doc_section"><a name="conclusions">Conclusions</a></div>
751 <!-- *********************************************************************** -->
753 <div class="doc_text">
755 <p>With just under 400 lines of commented code (240 lines of non-comment,
756 non-blank code), we fully defined our minimal language, including a lexer,
757 parser and AST builder. With this done, the executable will validate
758 Kaleidoscope code and tell us if it is gramatically invalid. For
759 example, here is a sample interaction:</p>
761 <div class="doc_code">
764 ready> <b>def foo(x y) x+foo(y, 4.0);</b>
765 Parsed a function definition.
766 ready> <b>def foo(x y) x+y y;</b>
767 Parsed a function definition.
768 Parsed a top-level expr
769 ready> <b>def foo(x y) x+y );</b>
770 Parsed a function definition.
771 Error: unknown token when expecting an expression
772 ready> <b>extern sin(a);</b>
773 ready> Parsed an extern
779 <p>There is a lot of room for extension here. You can define new AST nodes,
780 extend the language in many ways, etc. In the <a href="LangImpl3.html">next
781 installment</a>, we will describe how to generate LLVM IR from the AST.</p>
785 <!-- *********************************************************************** -->
786 <div class="doc_section"><a name="code">Full Code Listing</a></div>
787 <!-- *********************************************************************** -->
789 <div class="doc_text">
792 Here is the complete code listing for this and the previous chapter.
793 Note that it is fully self-contained: you don't need LLVM or any external
794 libraries at all for this (other than the C and C++ standard libraries of
795 course). To build this, just compile with:</p>
797 <div class="doc_code">
806 <p>Here is the code:</p>
808 <div class="doc_code">
810 #include <cstdio>
811 #include <string>
813 #include <vector>
815 //===----------------------------------------------------------------------===//
817 //===----------------------------------------------------------------------===//
819 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
820 // of these for known things.
825 tok_def = -2, tok_extern = -3,
828 tok_identifier = -4, tok_number = -5,
831 static std::string IdentifierStr; // Filled in if tok_identifier
832 static double NumVal; // Filled in if tok_number
834 /// gettok - Return the next token from standard input.
835 static int gettok() {
836 static int LastChar = ' ';
838 // Skip any whitespace.
839 while (isspace(LastChar))
840 LastChar = getchar();
842 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
843 IdentifierStr = LastChar;
844 while (isalnum((LastChar = getchar())))
845 IdentifierStr += LastChar;
847 if (IdentifierStr == "def") return tok_def;
848 if (IdentifierStr == "extern") return tok_extern;
849 return tok_identifier;
852 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
856 LastChar = getchar();
857 } while (isdigit(LastChar) || LastChar == '.');
859 NumVal = strtod(NumStr.c_str(), 0);
863 if (LastChar == '#') {
864 // Comment until end of line.
865 do LastChar = getchar();
866 while (LastChar != EOF && LastChar != '\n' & LastChar != '\r');
872 // Check for end of file. Don't eat the EOF.
876 // Otherwise, just return the character as its ascii value.
877 int ThisChar = LastChar;
878 LastChar = getchar();
882 //===----------------------------------------------------------------------===//
883 // Abstract Syntax Tree (aka Parse Tree)
884 //===----------------------------------------------------------------------===//
886 /// ExprAST - Base class for all expression nodes.
889 virtual ~ExprAST() {}
892 /// NumberExprAST - Expression class for numeric literals like "1.0".
893 class NumberExprAST : public ExprAST {
896 explicit NumberExprAST(double val) : Val(val) {}
899 /// VariableExprAST - Expression class for referencing a variable, like "a".
900 class VariableExprAST : public ExprAST {
903 explicit VariableExprAST(const std::string &name) : Name(name) {}
906 /// BinaryExprAST - Expression class for a binary operator.
907 class BinaryExprAST : public ExprAST {
911 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
912 : Op(op), LHS(lhs), RHS(rhs) {}
915 /// CallExprAST - Expression class for function calls.
916 class CallExprAST : public ExprAST {
918 std::vector<ExprAST*> Args;
920 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
921 : Callee(callee), Args(args) {}
924 /// PrototypeAST - This class represents the "prototype" for a function,
925 /// which captures its name, and its argument names (thus implicitly the number
926 /// of arguments the function takes).
929 std::vector< Args;
931 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
932 : Name(name), Args(args) {}
936 /// FunctionAST - This class represents a function definition itself.
941 FunctionAST(PrototypeAST *proto, ExprAST *body)
942 : Proto(proto), Body(body) {}
946 //===----------------------------------------------------------------------===//
948 //===----------------------------------------------------------------------===//
950 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
951 /// token the parser it looking at. getNextToken reads another token from the
952 /// lexer and updates CurTok with its results.
954 static int getNextToken() {
955 return CurTok = gettok();
958 /// BinopPrecedence - This holds the precedence for each binary operator that is
960 static std::map<char, int> BinopPrecedence;
962 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
963 static int GetTokPrecedence() {
964 if (!isascii(CurTok))
967 // Make sure it's a declared binop.
968 int TokPrec = BinopPrecedence[CurTok];
969 if (TokPrec <= 0) return -1;
973 /// Error* - These are little helper functions for error handling.
974 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
975 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
976 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
978 static ExprAST *ParseExpression();
982 /// ::= identifier '(' expression* ')'
983 static ExprAST *ParseIdentifierExpr() {
984 std::string IdName = IdentifierStr;
986 getNextToken(); // eat identifier.
988 if (CurTok != '(') // Simple variable ref.
989 return new VariableExprAST(IdName);
992 getNextToken(); // eat (
993 std::vector<ExprAST*> Args;
996 ExprAST *Arg = ParseExpression();
1000 if (CurTok == ')') break;
1003 return Error("Expected ')'");
1011 return new CallExprAST(IdName, Args);
1014 /// numberexpr ::= number
1015 static ExprAST *ParseNumberExpr() {
1016 ExprAST *Result = new NumberExprAST(NumVal);
1017 getNextToken(); // consume the number
1021 /// parenexpr ::= '(' expression ')'
1022 static ExprAST *ParseParenExpr() {
1023 getNextToken(); // eat (.
1024 ExprAST *V = ParseExpression();
1028 return Error("expected ')'");
1029 getNextToken(); // eat ).
1034 /// ::= identifierexpr
1037 static ExprAST *ParsePrimary() {
1039 default: return Error("unknown token when expecting an expression");
1040 case tok_identifier: return ParseIdentifierExpr();
1041 case tok_number: return ParseNumberExpr();
1042 case '(': return ParseParenExpr();
1047 /// ::= ('+' primary)*
1048 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1049 // If this is a binop, find its precedence.
1051 int TokPrec = GetTokPrecedence();
1053 // If this is a binop that binds at least as tightly as the current binop,
1054 // consume it, otherwise we are done.
1055 if (TokPrec < ExprPrec)
1058 // Okay, we know this is a binop.
1060 getNextToken(); // eat binop
1062 // Parse the primary expression after the binary operator.
1063 ExprAST *RHS = ParsePrimary();
1066 // If BinOp binds less tightly with RHS than the operator after RHS, let
1067 // the pending operator take RHS as its LHS.
1068 int NextPrec = GetTokPrecedence();
1069 if (TokPrec < NextPrec) {
1070 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1071 if (RHS == 0) return 0;
1075 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1080 /// ::= primary binoprhs
1082 static ExprAST *ParseExpression() {
1083 ExprAST *LHS = ParsePrimary();
1086 return ParseBinOpRHS(0, LHS);
1090 /// ::= id '(' id* ')'
1091 static PrototypeAST *ParsePrototype() {
1092 if (CurTok != tok_identifier)
1093 return ErrorP("Expected function name in prototype");
1095 std::string FnName = IdentifierStr;
1099 return ErrorP("Expected '(' in prototype");
1101 std::vector<std::string> ArgNames;
1102 while (getNextToken() == tok_identifier)
1103 ArgNames.push_back(IdentifierStr);
1105 return ErrorP("Expected ')' in prototype");
1108 getNextToken(); // eat ')'.
1110 return new PrototypeAST(FnName, ArgNames);
1113 /// definition ::= 'def' prototype expression
1114 static FunctionAST *ParseDefinition() {
1115 getNextToken(); // eat def.
1116 PrototypeAST *Proto = ParsePrototype();
1117 if (Proto == 0) return 0;
1119 if (ExprAST *E = ParseExpression())
1120 return new FunctionAST(Proto, E);
1124 /// toplevelexpr ::= expression
1125 static FunctionAST *ParseTopLevelExpr() {
1126 if (ExprAST *E = ParseExpression()) {
1127 // Make an anonymous proto.
1128 PrototypeAST *Proto = new PrototypeAST("", std::vector<());
1129 return new FunctionAST(Proto, E);
1134 /// external ::= 'extern' prototype
1135 static PrototypeAST *ParseExtern() {
1136 getNextToken(); // eat extern.
1137 return ParsePrototype();
1140 //===----------------------------------------------------------------------===//
1141 // Top-Level parsing
1142 //===----------------------------------------------------------------------===//
1144 static void HandleDefinition() {
1145 if (FunctionAST *F = ParseDefinition()) {
1146 fprintf(stderr, "Parsed a function definition.\n");
1148 // Skip token for error recovery.
1153 static void HandleExtern() {
1154 if (PrototypeAST *P = ParseExtern()) {
1155 fprintf(stderr, "Parsed an extern\n");
1157 // Skip token for error recovery.
1162 static void HandleTopLevelExpression() {
1163 // Evaluate a top level expression into an anonymous function.
1164 if (FunctionAST *F = ParseTopLevelExpr()) {
1165 fprintf(stderr, "Parsed a top-level expr\n");
1167 // Skip token for error recovery.
1172 /// top ::= definition | external | expression | ';'
1173 static void MainLoop() {
1175 fprintf(stderr, "ready> ");
1177 case tok_eof: return;
1178 case ';': getNextToken(); break; // ignore top level semicolons.
1179 case tok_def: HandleDefinition(); break;
1180 case tok_extern: HandleExtern(); break;
1181 default: HandleTopLevelExpression(); break;
1186 //===----------------------------------------------------------------------===//
1187 // Main driver code.
1188 //===----------------------------------------------------------------------===//
1191 // Install standard binary operators.
1192 // 1 is lowest precedence.
1193 BinopPrecedence['<'] = 10;
1194 BinopPrecedence['+'] = 20;
1195 BinopPrecedence['-'] = 20;
1196 BinopPrecedence['*'] = 40; // highest.
1198 // Prime the first token.
1199 fprintf(stderr, "ready> ");
1209 <!-- *********************************************************************** -->
1212 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1213 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1214 <a href="http://validator.w3.org/check/referer"><img
1215 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
1217 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1218 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1219 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $