1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
19 <li><a href="#intro">Chapter 2 Introduction</a></li>
20 <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
21 <li><a href="#parserbasics">Parser Basics</a></li>
22 <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
23 <li><a href="#parserbinops">Binary Expression Parsing</a></li>
24 <li><a href="#parsertop">Parsing the Rest</a></li>
25 <li><a href="#driver">The Driver</a></li>
26 <li><a href="#conclusions">Conclusions</a></li>
27 <li><a href="#code">Full Code Listing</a></li>
32 <div class="doc_author">
33 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
36 <!-- *********************************************************************** -->
37 <div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
38 <!-- *********************************************************************** -->
40 <div class="doc_text">
42 <p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
43 with LLVM</a>" tutorial. This chapter shows you how to use the <a
44 href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
45 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
46 our Kaleidoscope language and build an <a
47 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
50 <p>The parser we will build uses a combination of <a
51 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
52 Parsing</a> and <a href=
53 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
54 Parsing</a> to parse the Kaleidoscope language (the later for binary expression
55 and the former for everything else). Before we get to parsing though, lets talk
56 about the output of the parser: the Abstract Syntax Tree.</p>
60 <!-- *********************************************************************** -->
61 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
62 <!-- *********************************************************************** -->
64 <div class="doc_text">
66 <p>The AST for a program captures its behavior in a way that it is easy for
67 later stages of the compiler (e.g. code generation) to interpret. We basically
68 want one object for each construct in the language, and the AST should closely
69 model the language. In Kaleidoscope, we have expressions, a prototype, and a
70 function object. We'll start with expressions first:</p>
72 <div class="doc_code">
74 /// ExprAST - Base class for all expression nodes.
80 /// NumberExprAST - Expression class for numeric literals like "1.0".
81 class NumberExprAST : public ExprAST {
84 explicit NumberExprAST(double val) : Val(val) {}
89 <p>The code above shows the definition of the base ExprAST class and one
90 subclass which we use for numeric literals. The important thing about this is
91 that the NumberExprAST class captures the numeric value of the literal in the
92 class, so that later phases of the compiler can know what it is.</p>
94 <p>Right now we only create the AST, so there are no useful accessor methods on
95 them. It would be very easy to add a virtual method to pretty print the code,
96 for example. Here are the other expression AST node definitions that we'll use
97 in the basic form of the Kaleidoscope language.
100 <div class="doc_code">
102 /// VariableExprAST - Expression class for referencing a variable, like "a".
103 class VariableExprAST : public ExprAST {
106 explicit VariableExprAST(const std::string &name) : Name(name) {}
109 /// BinaryExprAST - Expression class for a binary operator.
110 class BinaryExprAST : public ExprAST {
114 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
115 : Op(op), LHS(lhs), RHS(rhs) {}
118 /// CallExprAST - Expression class for function calls.
119 class CallExprAST : public ExprAST {
121 std::vector<ExprAST*> Args;
123 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
124 : Callee(callee), Args(args) {}
129 <p>This is all (intentially) rather straight-forward: variables capture the
130 variable name, binary operators capture their opcode (e.g. '+'), and calls
131 capture a function name and list of argument expressions. One thing that is
132 nice about our AST is that it captures the language features without talking
133 about the syntax of the language. Note that there is no discussion about
134 precedence of binary operators, lexical structure etc.</p>
136 <p>For our basic language, these are all of the expression nodes we'll define.
137 Because it doesn't have conditional control flow, it isn't Turing-complete;
138 we'll fix that in a later installment. The two things we need next are a way
139 to talk about the interface to a function, and a way to talk about functions
142 <div class="doc_code">
144 /// PrototypeAST - This class represents the "prototype" for a function,
145 /// which captures its argument names as well as if it is an operator.
148 std::vector<std::string> Args;
150 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
151 : Name(name), Args(args) {}
154 /// FunctionAST - This class represents a function definition itself.
159 FunctionAST(PrototypeAST *proto, ExprAST *body)
160 : Proto(proto), Body(body) {}
165 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
166 Since all values are double precision floating point, this fact doesn't need to
167 be captured anywhere. In a more aggressive and realistic language, the
168 "ExprAST" class would probably have a type field.</p>
170 <p>With this scaffolding, we can now talk about parsing expressions and function
171 bodies in Kaleidoscope.</p>
175 <!-- *********************************************************************** -->
176 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
177 <!-- *********************************************************************** -->
179 <div class="doc_text">
181 <p>Now that we have an AST to build, we need to define the parser code to build
182 it. The idea here is that we want to parse something like "x+y" (which is
183 returned as three tokens by the lexer) into an AST that could be generated with
186 <div class="doc_code">
188 ExprAST *X = new VariableExprAST("x");
189 ExprAST *Y = new VariableExprAST("y");
190 ExprAST *Result = new BinaryExprAST('+', X, Y);
194 <p>In order to do this, we'll start by defining some basic helper routines:</p>
196 <div class="doc_code">
198 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
199 /// token the parser it looking at. getNextToken reads another token from the
200 /// lexer and updates CurTok with its results.
202 static int getNextToken() {
203 return CurTok = gettok();
209 This implements a simple token buffer around the lexer. This allows
210 us to look one token ahead at what the lexer is returning. Every function in
211 our parser will assume that CurTok is the current token that needs to be
214 <p>Again, we define these with global variables; it would be better design to
215 wrap the entire parser in a class and use instance variables for these.
218 <div class="doc_code">
221 /// Error* - These are little helper functions for error handling.
222 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
223 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
224 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
229 The <tt>Error</tt> routines are simple helper routines that our parser will use
230 to handle errors. The error recovery in our parser will not be the best and
231 is not particular user-friendly, but it will be enough for our tutorial. These
232 routines make it easier to handle errors in routines that have various return
233 types: they always return null.</p>
235 <p>With these basic helper functions implemented, we can implement the first
236 piece of our grammar: we'll start with numeric literals.</p>
240 <!-- *********************************************************************** -->
241 <div class="doc_section"><a name="parserprimexprs">Basic Expression
243 <!-- *********************************************************************** -->
245 <div class="doc_text">
247 <p>We start with numeric literals, because they are the simplest to process.
248 For each production in our grammar, we'll define a function which parses that
249 production. For numeric literals, we have:
252 <div class="doc_code">
254 /// numberexpr ::= number
255 static ExprAST *ParseNumberExpr() {
256 ExprAST *Result = new NumberExprAST(NumVal);
257 getNextToken(); // consume the number
263 <p>This routine is very simple: it expects to be called when the current token
264 is a <tt>tok_number</tt> token. It takes the current number value, creates
265 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
268 <p>There are some interesting aspects of this. The most important one is that
269 this routine eats all of the tokens that correspond to the production, and
270 returns the lexer buffer with the next token (which is not part of the grammar
271 production) ready to go. This is a fairly standard way to go for recursive
272 descent parsers. For a better example, the parenthesis operator is defined like
275 <div class="doc_code">
277 /// parenexpr ::= '(' expression ')'
278 static ExprAST *ParseParenExpr() {
279 getNextToken(); // eat (.
280 ExprAST *V = ParseExpression();
284 return Error("expected ')'");
285 getNextToken(); // eat ).
291 <p>This function illustrates a number of interesting things about the parser:
292 1) it shows how we use the Error routines. When called, this function expects
293 that the current token is a '(' token, but after parsing the subexpression, it
294 is possible that there is not a ')' waiting. For example, if the user types in
295 "(4 x" instead of "(4)", the parser should emit an error. Because errors can
296 occur, the parser needs a way to indicate that they happened: in our parser, we
297 return null on an error.</p>
299 <p>Another interesting aspect of this function is that it uses recursion by
300 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
301 <tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
302 recursive grammars, and keeps each production very simple. Note that
303 parentheses do not cause construction of AST nodes themselves. While we could
304 do this, the most important role of parens are to guide the parser and provide
305 grouping. Once the parser constructs the AST, parens are not needed.</p>
307 <p>The next simple production is for handling variable references and function
310 <div class="doc_code">
314 /// ::= identifier '(' expression* ')'
315 static ExprAST *ParseIdentifierExpr() {
316 std::string IdName = IdentifierStr;
318 getNextToken(); // eat identifier.
320 if (CurTok != '(') // Simple variable ref.
321 return new VariableExprAST(IdName);
324 getNextToken(); // eat (
325 std::vector<ExprAST*> Args;
327 ExprAST *Arg = ParseExpression();
331 if (CurTok == ')') break;
334 return Error("Expected ')'");
341 return new CallExprAST(IdName, Args);
346 <p>This routine follows the same style as the other routines (it expects to be
347 called if the current token is a <tt>tok_identifier</tt> token). It also has
348 recursion and error handling. One interesting aspect of this is that it uses
349 <em>look-ahead</em> to determine if the current identifier is a stand alone
350 variable reference or if it is a function call expression. It handles this by
351 checking to see if the token after the identifier is a '(' token, and constructs
352 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
355 <p>Now that we have all of our simple expression parsing logic in place, we can
356 define a helper function to wrap them up in a class. We call this class of
357 expressions "primary" expressions, for reasons that will become more clear
358 later. In order to parse a primary expression, we need to determine what sort
359 of expression it is:</p>
361 <div class="doc_code">
364 /// ::= identifierexpr
367 static ExprAST *ParsePrimary() {
369 default: return Error("unknown token when expecting an expression");
370 case tok_identifier: return ParseIdentifierExpr();
371 case tok_number: return ParseNumberExpr();
372 case '(': return ParseParenExpr();
378 <p>Now that you see the definition of this function, it makes it more obvious
379 why we can assume the state of CurTok in the various functions. This uses
380 look-ahead to determine which sort of expression is being inspected, and parses
381 it with a function call.</p>
383 <p>Now that basic expressions are handled, we need to handle binary expressions,
384 which are a bit more complex.</p>
388 <!-- *********************************************************************** -->
389 <div class="doc_section"><a name="parserbinops">Binary Expression
391 <!-- *********************************************************************** -->
393 <div class="doc_text">
395 <p>Binary expressions are significantly harder to parse because they are often
396 ambiguous. For example, when given the string "x+y*z", the parser can choose
397 to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
398 mathematics, we expect the later parse, because "*" (multiplication) has
399 higher <em>precedence</em> than "+" (addition).</p>
401 <p>There are many ways to handle this, but an elegant and efficient way is to
403 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
404 Parsing</a>. This parsing technique uses the precedence of binary operators to
405 guide recursion. To start with, we need a table of precedences:</p>
407 <div class="doc_code">
409 /// BinopPrecedence - This holds the precedence for each binary operator that is
411 static std::map<char, int> BinopPrecedence;
413 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
414 static int GetTokPrecedence() {
415 if (!isascii(CurTok))
418 // Make sure it's a declared binop.
419 int TokPrec = BinopPrecedence[CurTok];
420 if (TokPrec <= 0) return -1;
425 // Install standard binary operators.
426 // 1 is lowest precedence.
427 BinopPrecedence['<'] = 10;
428 BinopPrecedence['+'] = 20;
429 BinopPrecedence['-'] = 20;
430 BinopPrecedence['*'] = 40; // highest.
436 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
437 (this can obviously be extended by you, the reader). The
438 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
439 or -1 if the token is not a binary operator. Having a map makes it easy to add
440 new operators and makes it clear that the algorithm doesn't depend on the
441 specific operators involved, but it would be easy enough to eliminate the map
442 and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
444 <p>With the helper above defined, we can now start parsing binary expressions.
445 The basic idea of operator precedence parsing is to break down an expression
446 with potentially ambiguous binary operators into pieces. Consider for example
447 the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
448 as a stream of primary expressions separated by binary operators. As such,
449 it will first parse the leading primary expression "a", then it will see the
450 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
451 are primary expressions, the binary expression parser doesn't need to worry
452 about nested subexpressions like (c+d) at all.
456 To start, an expression is a primary expression potentially followed by a
457 sequence of [binop,primaryexpr] pairs:</p>
459 <div class="doc_code">
462 /// ::= primary binoprhs
464 static ExprAST *ParseExpression() {
465 ExprAST *LHS = ParsePrimary();
468 return ParseBinOpRHS(0, LHS);
473 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
474 us. It takes a precedence and a pointer to an expression for the part parsed
475 so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
476 allowed to be empty, in which case it returns the expression that is passed into
477 it. In our example above, the code passes the expression for "a" into
478 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
480 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
481 minimal operator precedence</em> that the function is allowed to eat. For
482 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
483 passed in a precedence of 40, it will not consume any tokens (because the
484 precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
487 <div class="doc_code">
490 /// ::= ('+' primary)*
491 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
492 // If this is a binop, find its precedence.
494 int TokPrec = GetTokPrecedence();
496 // If this is a binop that binds at least as tightly as the current binop,
497 // consume it, otherwise we are done.
498 if (TokPrec < ExprPrec)
503 <p>This code gets the precedence of the current token and checks to see if if is
504 too low. Because we defined invalid tokens to have a precedence of -1, this
505 check implicitly knows that the pair-stream ends when the token stream runs out
506 of binary operators. If this check succeeds, we know that the token is a binary
507 operator and that it will be included in this expression:</p>
509 <div class="doc_code">
511 // Okay, we know this is a binop.
513 getNextToken(); // eat binop
515 // Parse the primary expression after the binary operator.
516 ExprAST *RHS = ParsePrimary();
521 <p>As such, this code eats (and remembers) the binary operator and then parses
522 the following primary expression. This builds up the whole pair, the first of
523 which is [+, b] for the running example.</p>
525 <p>Now that we parsed the left-hand side of an expression and one pair of the
526 RHS sequence, we have to decide which way the expression associates. In
527 particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
528 To determine this, we look ahead at "binop" to determine its precedence and
529 compare it to BinOp's precedence (which is '+' in this case):</p>
531 <div class="doc_code">
533 // If BinOp binds less tightly with RHS than the operator after RHS, let
534 // the pending operator take RHS as its LHS.
535 int NextPrec = GetTokPrecedence();
536 if (TokPrec < NextPrec) {
540 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
541 precedence of our current operator, then we know that the parentheses associate
542 as "(a+b) binop ...". In our example, since the next operator is "+" and so is
543 our current one, we know that they have the same precedence. In this case we'll
544 create the AST node for "a+b", and then continue parsing:</p>
546 <div class="doc_code">
548 ... if body omitted ...
552 LHS = new BinaryExprAST(BinOp, LHS, RHS);
553 } // loop around to the top of the while loop.
558 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
559 iteration of the loop, with "+" as the current token. The code above will eat
560 and remember it and parse "(c+d)" as the primary expression, which makes the
561 current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
562 binop to the right of the primary. In this case, the precedence of "*" is
563 higher than the precedence of "+" so the if condition will be entered.</p>
565 <p>The critical question left here is "how can the if condition parse the right
566 hand side in full"? In particular, to build the AST correctly for our example,
567 it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
568 do this is surprisingly simple (code from the above two blocks duplicated for
571 <div class="doc_code">
573 // If BinOp binds less tightly with RHS than the operator after RHS, let
574 // the pending operator take RHS as its LHS.
575 int NextPrec = GetTokPrecedence();
576 if (TokPrec < NextPrec) {
577 RHS = ParseBinOpRHS(TokPrec+1, RHS);
578 if (RHS == 0) return 0;
581 LHS = new BinaryExprAST(BinOp, LHS, RHS);
582 } // loop around to the top of the while loop.
587 <p>At this point, we know that the binary operator to the RHS of our primary
588 has higher precedence than the binop we are currently parsing. As such, we know
589 that any sequence of pairs whose operators are all higher precedence than "+"
590 should be parsed together and returned as "RHS". To do this, we recursively
591 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
592 precedence required for it to continue. In our example above, this will cause
593 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
594 of the '+' expression.</p>
596 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
597 and added to the AST. With this little bit of code (14 non-trivial lines), we
598 correctly handle fully general binary expression parsing in a very elegant way.
601 <p>This wraps up handling of expressions. At this point, we can point the
602 parser at an arbitrary token stream and build an expression from them, stopping
603 at the first token that is not part of the expression. Next up we need to
604 handle function definitions etc.</p>
608 <!-- *********************************************************************** -->
609 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
610 <!-- *********************************************************************** -->
612 <div class="doc_text">
615 The first basic thing missing is that of function prototypes. In Kaleidoscope,
616 these are used both for 'extern' function declarations as well as function body
617 definitions. The code to do this is straight-forward and not very interesting
618 (once you've survived expressions):
621 <div class="doc_code">
624 /// ::= id '(' id* ')'
625 static PrototypeAST *ParsePrototype() {
626 if (CurTok != tok_identifier)
627 return ErrorP("Expected function name in prototype");
629 std::string FnName = IdentifierStr;
633 return ErrorP("Expected '(' in prototype");
635 std::vector<std::string> ArgNames;
636 while (getNextToken() == tok_identifier)
637 ArgNames.push_back(IdentifierStr);
639 return ErrorP("Expected ')' in prototype");
642 getNextToken(); // eat ')'.
644 return new PrototypeAST(FnName, ArgNames);
649 <p>Given this, a function definition is very simple, just a prototype plus
650 an expression to implement the body:</p>
652 <div class="doc_code">
654 /// definition ::= 'def' prototype expression
655 static FunctionAST *ParseDefinition() {
656 getNextToken(); // eat def.
657 PrototypeAST *Proto = ParsePrototype();
658 if (Proto == 0) return 0;
660 if (ExprAST *E = ParseExpression())
661 return new FunctionAST(Proto, E);
667 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
668 well as to support forward declaration of user functions. 'externs' are just
669 prototypes with no body:</p>
671 <div class="doc_code">
673 /// external ::= 'extern' prototype
674 static PrototypeAST *ParseExtern() {
675 getNextToken(); // eat extern.
676 return ParsePrototype();
681 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
682 evaluate them on the fly. We will handle this by defining anonymous nullary
683 (zero argument) functions for them:</p>
685 <div class="doc_code">
687 /// toplevelexpr ::= expression
688 static FunctionAST *ParseTopLevelExpr() {
689 if (ExprAST *E = ParseExpression()) {
690 // Make an anonymous proto.
691 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
692 return new FunctionAST(Proto, E);
699 <p>Now that we have all the pieces, lets build a little driver that will let us
700 actually <em>execute</em> this code we've built!</p>
704 <!-- *********************************************************************** -->
705 <div class="doc_section"><a name="driver">The Driver</a></div>
706 <!-- *********************************************************************** -->
708 <div class="doc_text">
710 <p>The driver for this simply invokes all of the parsing pieces with a top-level
711 dispatch loop. There isn't much interesting here, so I'll just include the
712 top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
713 Parsing" section.</p>
715 <div class="doc_code">
717 /// top ::= definition | external | expression | ';'
718 static void MainLoop() {
720 fprintf(stderr, "ready> ");
722 case tok_eof: return;
723 case ';': getNextToken(); break; // ignore top level semicolons.
724 case tok_def: HandleDefinition(); break;
725 case tok_extern: HandleExtern(); break;
726 default: HandleTopLevelExpression(); break;
733 <p>The most interesting part of this is that we ignore top-level semi colons.
734 Why is this, you ask? The basic reason is that if you type "4 + 5" at the
735 command line, the parser doesn't know that that is the end of what you will
736 type. For example, on the next line you could type "def foo..." in which case
737 4+5 is the end of a top-level expression. Alternatively you could type "* 6",
738 which would continue the expression. Having top-level semicolons allows you to
739 type "4+5;" and the parser will know you are done.</p>
743 <!-- *********************************************************************** -->
744 <div class="doc_section"><a name="conclusions">Conclusions</a></div>
745 <!-- *********************************************************************** -->
747 <div class="doc_text">
749 <p>With just under 400 lines of commented code, we fully defined our minimal
750 language, including a lexer, parser and AST builder. With this done, the
751 executable will validate code and tell us if it is gramatically invalid. For
752 example, here is a sample interaction:</p>
754 <div class="doc_code">
757 ready> def foo(x y) x+foo(y, 4.0);
758 ready> Parsed a function definition.
759 ready> def foo(x y) x+y y;
760 ready> Parsed a function definition.
761 ready> Parsed a top-level expr
762 ready> def foo(x y) x+y );
763 ready> Parsed a function definition.
764 ready> Error: unknown token when expecting an expression
765 ready> extern sin(a);
766 ready> Parsed an extern
772 <p>There is a lot of room for extension here. You can define new AST nodes,
773 extend the language in many ways, etc. In the <a href="LangImpl3.html">next
774 installment</a>, we will describe how to generate LLVM IR from the AST.</p>
778 <!-- *********************************************************************** -->
779 <div class="doc_section"><a name="code">Full Code Listing</a></div>
780 <!-- *********************************************************************** -->
782 <div class="doc_text">
785 Here is the complete code listing for this and the previous chapter.
786 Note that it is fully self-contained: you don't need LLVM or any external
787 libraries at all for this (other than the C and C++ standard libraries of
788 course). To build this, just compile with:</p>
790 <div class="doc_code">
799 <p>Here is the code:</p>
801 <div class="doc_code">
803 #include <cstdio>
804 #include <string>
806 #include <vector>
808 //===----------------------------------------------------------------------===//
810 //===----------------------------------------------------------------------===//
812 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
813 // of these for known things.
818 tok_def = -2, tok_extern = -3,
821 tok_identifier = -4, tok_number = -5,
824 static std::string IdentifierStr; // Filled in if tok_identifier
825 static double NumVal; // Filled in if tok_number
827 /// gettok - Return the next token from standard input.
828 static int gettok() {
829 static int LastChar = ' ';
831 // Skip any whitespace.
832 while (isspace(LastChar))
833 LastChar = getchar();
835 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
836 IdentifierStr = LastChar;
837 while (isalnum((LastChar = getchar())))
838 IdentifierStr += LastChar;
840 if (IdentifierStr == "def") return tok_def;
841 if (IdentifierStr == "extern") return tok_extern;
842 return tok_identifier;
845 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
849 LastChar = getchar();
850 } while (isdigit(LastChar) || LastChar == '.');
852 NumVal = strtod(NumStr.c_str(), 0);
856 if (LastChar == '#') {
857 // Comment until end of line.
858 do LastChar = getchar();
859 while (LastChar != EOF && LastChar != '\n' & LastChar != '\r');
865 // Check for end of file. Don't eat the EOF.
869 // Otherwise, just return the character as its ascii value.
870 int ThisChar = LastChar;
871 LastChar = getchar();
875 //===----------------------------------------------------------------------===//
876 // Abstract Syntax Tree (aka Parse Tree)
877 //===----------------------------------------------------------------------===//
879 /// ExprAST - Base class for all expression nodes.
882 virtual ~ExprAST() {}
885 /// NumberExprAST - Expression class for numeric literals like "1.0".
886 class NumberExprAST : public ExprAST {
889 explicit NumberExprAST(double val) : Val(val) {}
892 /// VariableExprAST - Expression class for referencing a variable, like "a".
893 class VariableExprAST : public ExprAST {
896 explicit VariableExprAST(const std::string &name) : Name(name) {}
899 /// BinaryExprAST - Expression class for a binary operator.
900 class BinaryExprAST : public ExprAST {
904 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
905 : Op(op), LHS(lhs), RHS(rhs) {}
908 /// CallExprAST - Expression class for function calls.
909 class CallExprAST : public ExprAST {
911 std::vector<ExprAST*> Args;
913 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
914 : Callee(callee), Args(args) {}
917 /// PrototypeAST - This class represents the "prototype" for a function,
918 /// which captures its argument names as well as if it is an operator.
921 std::vector< Args;
923 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
924 : Name(name), Args(args) {}
928 /// FunctionAST - This class represents a function definition itself.
933 FunctionAST(PrototypeAST *proto, ExprAST *body)
934 : Proto(proto), Body(body) {}
938 //===----------------------------------------------------------------------===//
940 //===----------------------------------------------------------------------===//
942 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
943 /// token the parser it looking at. getNextToken reads another token from the
944 /// lexer and updates CurTok with its results.
946 static int getNextToken() {
947 return CurTok = gettok();
950 /// BinopPrecedence - This holds the precedence for each binary operator that is
952 static std::map<char, int> BinopPrecedence;
954 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
955 static int GetTokPrecedence() {
956 if (!isascii(CurTok))
959 // Make sure it's a declared binop.
960 int TokPrec = BinopPrecedence[CurTok];
961 if (TokPrec <= 0) return -1;
965 /// Error* - These are little helper functions for error handling.
966 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
967 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
968 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
970 static ExprAST *ParseExpression();
974 /// ::= identifier '(' expression* ')'
975 static ExprAST *ParseIdentifierExpr() {
976 std::string IdName = IdentifierStr;
978 getNextToken(); // eat identifier.
980 if (CurTok != '(') // Simple variable ref.
981 return new VariableExprAST(IdName);
984 getNextToken(); // eat (
985 std::vector<ExprAST*> Args;
987 ExprAST *Arg = ParseExpression();
991 if (CurTok == ')') break;
994 return Error("Expected ')'");
1001 return new CallExprAST(IdName, Args);
1004 /// numberexpr ::= number
1005 static ExprAST *ParseNumberExpr() {
1006 ExprAST *Result = new NumberExprAST(NumVal);
1007 getNextToken(); // consume the number
1011 /// parenexpr ::= '(' expression ')'
1012 static ExprAST *ParseParenExpr() {
1013 getNextToken(); // eat (.
1014 ExprAST *V = ParseExpression();
1018 return Error("expected ')'");
1019 getNextToken(); // eat ).
1024 /// ::= identifierexpr
1027 static ExprAST *ParsePrimary() {
1029 default: return Error("unknown token when expecting an expression");
1030 case tok_identifier: return ParseIdentifierExpr();
1031 case tok_number: return ParseNumberExpr();
1032 case '(': return ParseParenExpr();
1037 /// ::= ('+' primary)*
1038 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1039 // If this is a binop, find its precedence.
1041 int TokPrec = GetTokPrecedence();
1043 // If this is a binop that binds at least as tightly as the current binop,
1044 // consume it, otherwise we are done.
1045 if (TokPrec < ExprPrec)
1048 // Okay, we know this is a binop.
1050 getNextToken(); // eat binop
1052 // Parse the primary expression after the binary operator.
1053 ExprAST *RHS = ParsePrimary();
1056 // If BinOp binds less tightly with RHS than the operator after RHS, let
1057 // the pending operator take RHS as its LHS.
1058 int NextPrec = GetTokPrecedence();
1059 if (TokPrec < NextPrec) {
1060 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1061 if (RHS == 0) return 0;
1065 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1070 /// ::= primary binoprhs
1072 static ExprAST *ParseExpression() {
1073 ExprAST *LHS = ParsePrimary();
1076 return ParseBinOpRHS(0, LHS);
1080 /// ::= id '(' id* ')'
1081 static PrototypeAST *ParsePrototype() {
1082 if (CurTok != tok_identifier)
1083 return ErrorP("Expected function name in prototype");
1085 std::string FnName = IdentifierStr;
1089 return ErrorP("Expected '(' in prototype");
1091 std::vector<std::string> ArgNames;
1092 while (getNextToken() == tok_identifier)
1093 ArgNames.push_back(IdentifierStr);
1095 return ErrorP("Expected ')' in prototype");
1098 getNextToken(); // eat ')'.
1100 return new PrototypeAST(FnName, ArgNames);
1103 /// definition ::= 'def' prototype expression
1104 static FunctionAST *ParseDefinition() {
1105 getNextToken(); // eat def.
1106 PrototypeAST *Proto = ParsePrototype();
1107 if (Proto == 0) return 0;
1109 if (ExprAST *E = ParseExpression())
1110 return new FunctionAST(Proto, E);
1114 /// toplevelexpr ::= expression
1115 static FunctionAST *ParseTopLevelExpr() {
1116 if (ExprAST *E = ParseExpression()) {
1117 // Make an anonymous proto.
1118 PrototypeAST *Proto = new PrototypeAST("", std::vector<());
1119 return new FunctionAST(Proto, E);
1124 /// external ::= 'extern' prototype
1125 static PrototypeAST *ParseExtern() {
1126 getNextToken(); // eat extern.
1127 return ParsePrototype();
1130 //===----------------------------------------------------------------------===//
1131 // Top-Level parsing
1132 //===----------------------------------------------------------------------===//
1134 static void HandleDefinition() {
1135 if (FunctionAST *F = ParseDefinition()) {
1136 fprintf(stderr, "Parsed a function definition.\n");
1138 // Skip token for error recovery.
1143 static void HandleExtern() {
1144 if (PrototypeAST *P = ParseExtern()) {
1145 fprintf(stderr, "Parsed an extern\n");
1147 // Skip token for error recovery.
1152 static void HandleTopLevelExpression() {
1153 // Evaluate a top level expression into an anonymous function.
1154 if (FunctionAST *F = ParseTopLevelExpr()) {
1155 fprintf(stderr, "Parsed a top-level expr\n");
1157 // Skip token for error recovery.
1162 /// top ::= definition | external | expression | ';'
1163 static void MainLoop() {
1165 fprintf(stderr, "ready> ");
1167 case tok_eof: return;
1168 case ';': getNextToken(); break; // ignore top level semicolons.
1169 case tok_def: HandleDefinition(); break;
1170 case tok_extern: HandleExtern(); break;
1171 default: HandleTopLevelExpression(); break;
1176 //===----------------------------------------------------------------------===//
1177 // Main driver code.
1178 //===----------------------------------------------------------------------===//
1181 // Install standard binary operators.
1182 // 1 is lowest precedence.
1183 BinopPrecedence['<'] = 10;
1184 BinopPrecedence['+'] = 20;
1185 BinopPrecedence['-'] = 20;
1186 BinopPrecedence['*'] = 40; // highest.
1188 // Prime the first token.
1189 fprintf(stderr, "ready> ");
1199 <!-- *********************************************************************** -->
1202 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1203 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1204 <a href="http://validator.w3.org/check/referer"><img
1205 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
1207 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1208 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1209 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $