1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
16 <div class="doc_author">
17 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
20 <!-- *********************************************************************** -->
21 <div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
22 <!-- *********************************************************************** -->
24 <div class="doc_text">
26 <p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
27 LLVM</a>" tutorial. This chapter shows you how to use the <a
28 href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
29 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
30 our Kaleidoscope language and build an <a
31 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
34 <p>The parser we will build uses a combination of <a
35 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
36 Parsing</a> and <a href=
37 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
38 Parsing</a> to parse the Kaleidoscope language (the later for binary expression
39 and the former for everything else). Before we get to parsing though, lets talk
40 about the output of the parser: the Abstract Syntax Tree.</p>
44 <!-- *********************************************************************** -->
45 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
46 <!-- *********************************************************************** -->
48 <div class="doc_text">
50 <p>The AST for a program captures its behavior in a way that it is easy for
51 later stages of the compiler (e.g. code generation) to interpret. We basically
52 want one object for each construct in the language, and the AST should closely
53 model the language. In Kaleidoscope, we have expressions, a prototype, and a
54 function object. We'll start with expressions first:</p>
56 <div class="doc_code">
58 /// ExprAST - Base class for all expression nodes.
64 /// NumberExprAST - Expression class for numeric literals like "1.0".
65 class NumberExprAST : public ExprAST {
68 NumberExprAST(double val) : Val(val) {}
73 <p>The code above shows the definition of the base ExprAST class and one
74 subclass which we use for numeric literals. The important thing about this is
75 that the NumberExprAST class captures the numeric value of the literal in the
76 class, so that later phases of the compiler can know what it is.</p>
78 <p>Right now we only create the AST, so there are no useful accessor methods on
79 them. It would be very easy to add a virtual method to pretty print the code,
80 for example. Here are the other expression AST node definitions that we'll use
81 in the basic form of the Kaleidoscope language.
84 <div class="doc_code">
86 /// VariableExprAST - Expression class for referencing a variable, like "a".
87 class VariableExprAST : public ExprAST {
90 VariableExprAST(const std::string &name) : Name(name) {}
93 /// BinaryExprAST - Expression class for a binary operator.
94 class BinaryExprAST : public ExprAST {
98 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
99 : Op(op), LHS(lhs), RHS(rhs) {}
102 /// CallExprAST - Expression class for function calls.
103 class CallExprAST : public ExprAST {
105 std::vector<ExprAST*> Args;
107 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
108 : Callee(callee), Args(args) {}
113 <p>This is all (intentially) rather straight-forward: variables capture the
114 variable name, binary operators capture their opcode (e.g. '+'), and calls
115 capture a function name and list of argument expressions. One thing that is
116 nice about our AST is that it captures the language features without talking
117 about the syntax of the language. Note that there is no discussion about
118 precedence of binary operators, lexical structure etc.</p>
120 <p>For our basic language, these are all of the expression nodes we'll define.
121 Because it doesn't have conditional control flow, it isn't Turing-complete;
122 we'll fix that in a later installment. The two things we need next are a way
123 to talk about the interface to a function, and a way to talk about functions
126 <div class="doc_code">
128 /// PrototypeAST - This class represents the "prototype" for a function,
129 /// which captures its argument names as well as if it is an operator.
132 std::vector<std::string> Args;
134 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
135 : Name(name), Args(args) {}
138 /// FunctionAST - This class represents a function definition itself.
143 FunctionAST(PrototypeAST *proto, ExprAST *body)
144 : Proto(proto), Body(body) {}
149 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
150 Since all values are double precision floating point, this fact doesn't need to
151 be captured anywhere. In a more aggressive and realistic language, the
152 "ExprAST" class would probably have a type field.</p>
154 <p>With this scaffolding, we can now talk about parsing expressions and function
155 bodies in Kaleidoscope.</p>
159 <!-- *********************************************************************** -->
160 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
161 <!-- *********************************************************************** -->
163 <div class="doc_text">
165 <p>Now that we have an AST to build, we need to define the parser code to build
166 it. The idea here is that we want to parse something like "x+y" (which is
167 returned as three tokens by the lexer) into an AST that could be generated with
170 <div class="doc_code">
172 ExprAST *X = new VariableExprAST("x");
173 ExprAST *Y = new VariableExprAST("y");
174 ExprAST *Result = new BinaryExprAST('+', X, Y);
178 <p>In order to do this, we'll start by defining some basic helper routines:</p>
180 <div class="doc_code">
182 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
183 /// token the parser it looking at. getNextToken reads another token from the
184 /// lexer and updates CurTok with its results.
186 static int getNextToken() {
187 return CurTok = gettok();
193 This implements a simple token buffer around the lexer. This allows
194 us to look one token ahead at what the lexer is returning. Every function in
195 our lexer will assume that CurTok is the current token that needs to be
198 <p>Again, we define these with global variables; it would be better design to wrap the entire parser in a class and use instance variables for these.
201 <div class="doc_code">
204 /// Error* - These are little helper functions for error handling.
205 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
206 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
207 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
212 The <tt>Error</tt> routines are simple helper routines that our parser will use
213 to handle errors. The error recovery in our parser will not be the best and
214 are not particular user-friendly, but it will be enough for our tutorial. These
215 routines make it easier to handle errors in routines that have various return
216 types: they always return null.</p>
218 <p>With these basic helper functions implemented, we can implement the first
219 piece of our grammar: we'll start with numeric literals.</p>
223 <!-- *********************************************************************** -->
224 <div class="doc_section"><a name="parserprimexprs">Basic Expression
226 <!-- *********************************************************************** -->
228 <div class="doc_text">
230 <p>We start with numeric literals, because they are the simplest to process.
231 For each production in our grammar, we'll define a function which parses that
232 production. For numeric literals, we have:
235 <div class="doc_code">
237 /// numberexpr ::= number
238 static ExprAST *ParseNumberExpr() {
239 ExprAST *Result = new NumberExprAST(NumVal);
240 getNextToken(); // consume the number
246 <p>This routine is very simple: it expects to be called when the current token
247 is a <tt>tok_number</tt> token. It takes the current number value, creates
248 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
251 <p>There are some interesting aspects of this. The most important one is that
252 this routine eats all of the tokens that correspond to the production, and
253 returns the lexer buffer with the next token (which is not part of the grammar
254 production) ready to go. This is a fairly standard way to go for recursive
255 descent parsers. For a better example, the parenthesis operator is defined like
258 <div class="doc_code">
260 /// parenexpr ::= '(' expression ')'
261 static ExprAST *ParseParenExpr() {
262 getNextToken(); // eat (.
263 ExprAST *V = ParseExpression();
267 return Error("expected ')'");
268 getNextToken(); // eat ).
274 <p>This function illustrates a number of interesting things about the parser:
275 1) it shows how we use the Error routines. When called, this function expects
276 that the current token is a '(' token, but after parsing the subexpression, it
277 is possible that there is not a ')' waiting. For example, if the user types in
278 "(4 x" instead of "(4)", the parser should emit an error. Because errors can
279 occur, the parser needs a way to indicate that they happened: in our parser, we
280 return null on an error.</p>
282 <p>Another interesting aspect of this function is that it uses recursion by
283 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
284 <tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
285 recursive grammars, and keeps each production very simple. Note that
286 parenthesis do not cause construction of AST nodes themselves. While we could
287 do this, the most important role of parens are to guide the parser and provide
288 grouping. Once the parser constructs the AST, parens are not needed.</p>
290 <p>The next simple production is for handling variable references and function
293 <div class="doc_code">
297 /// ::= identifer '(' expression* ')'
298 static ExprAST *ParseIdentifierExpr() {
299 std::string IdName = IdentifierStr;
301 getNextToken(); // eat identifer.
303 if (CurTok != '(') // Simple variable ref.
304 return new VariableExprAST(IdName);
307 getNextToken(); // eat (
308 std::vector<ExprAST*> Args;
310 ExprAST *Arg = ParseExpression();
314 if (CurTok == ')') break;
317 return Error("Expected ')'");
324 return new CallExprAST(IdName, Args);
329 <p>This routine follows the same style as the other routines (it expects to be
330 called if the current token is a <tt>tok_identifier</tt> token). It also has
331 recursion and error handling. One interesting aspect of this is that it uses
332 <em>look-ahead</em> to determine if the current identifier is a stand alone
333 variable reference or if it is a function call expression. It handles this by
334 checking to see if the token after the identifier is a '(' token, and constructs
335 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
338 <p>Now that we have all of our simple expression parsing logic in place, we can
339 define a helper function to wrap them up in a class. We call this class of
340 expressions "primary" expressions, for reasons that will become more clear
341 later. In order to parse a primary expression, we need to determine what sort
342 of expression it is:</p>
344 <div class="doc_code">
347 /// ::= identifierexpr
350 static ExprAST *ParsePrimary() {
352 default: return Error("unknown token when expecting an expression");
353 case tok_identifier: return ParseIdentifierExpr();
354 case tok_number: return ParseNumberExpr();
355 case '(': return ParseParenExpr();
361 <p>Now that you see the definition of this function, it makes it more obvious
362 why we can assume the state of CurTok in the various functions. This uses
363 look-ahead to determine which sort of expression is being inspected, and parses
364 it with a function call.</p>
366 <p>Now that basic expressions are handled, we need to handle binary expressions,
367 which are a bit more complex.</p>
371 <!-- *********************************************************************** -->
372 <div class="doc_section"><a name="parserbinops">Binary Expression
374 <!-- *********************************************************************** -->
376 <div class="doc_text">
378 <p>Binary expressions are significantly harder to parse because they are often
379 ambiguous. For example, when given the string "x+y*z", the parser can choose
380 to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
381 mathematics, we expect the later parse, because "*" (multiplication) has
382 higher <em>precedence</em> than "+" (addition).</p>
384 <p>There are many ways to handle this, but an elegant and efficient way is to
386 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
387 Parsing</a>. This parsing technique uses the precedence of binary operators to
388 guide recursion. To start with, we need a table of precedences:</p>
390 <div class="doc_code">
392 /// BinopPrecedence - This holds the precedence for each binary operator that is
394 static std::map<char, int> BinopPrecedence;
396 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
397 static int GetTokPrecedence() {
398 if (!isascii(CurTok))
401 // Make sure it's a declared binop.
402 int TokPrec = BinopPrecedence[CurTok];
403 if (TokPrec <= 0) return -1;
408 // Install standard binary operators.
409 // 1 is lowest precedence.
410 BinopPrecedence['<'] = 10;
411 BinopPrecedence['+'] = 20;
412 BinopPrecedence['-'] = 20;
413 BinopPrecedence['*'] = 40; // highest.
419 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
420 (this can obviously be extended by you, the reader). The
421 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
422 or -1 if the token is not a binary operator. Having a map makes it easy to add
423 new operators and makes it clear that the algorithm doesn't depend on the
424 specific operators involved, but it would be easy enough to eliminate the map
425 and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
427 <p>With the helper above defined, we can now start parsing binary expressions.
428 The basic idea of operator precedence parsing is to break down an expression
429 with potentially ambiguous binary operators into pieces. Consider for example
430 the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
431 as a stream of primary expressions separated by binary operators. As such,
432 it will first parse the leading primary expression "a", then it will see the
433 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
434 are primary expressions that the binary expression parser doesn't need to worry
435 about nested subexpressions like (c+d) at all.
439 To start, an expression is a primary expression potentially followed by a
440 sequence of [binop,primaryexpr] pairs:</p>
442 <div class="doc_code">
445 /// ::= primary binoprhs
447 static ExprAST *ParseExpression() {
448 ExprAST *LHS = ParsePrimary();
451 return ParseBinOpRHS(0, LHS);
456 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
457 us. It takes a precedence and a pointer to an expression for the part parsed
458 so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
459 allowed to be empty, in which case it returns the expression that is passed into
460 it. In our example above, the code passes the expression for "a" into
461 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
463 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
464 minimal operator precedence</em> that the function is allowed to eat. For
465 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
466 passed in a precedence of 40, it will not consume any tokens (because the
467 precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
470 <div class="doc_code">
473 /// ::= ('+' primary)*
474 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
475 // If this is a binop, find its precedence.
477 int TokPrec = GetTokPrecedence();
479 // If this is a binop that binds at least as tightly as the current binop,
480 // consume it, otherwise we are done.
481 if (TokPrec < ExprPrec)
486 <p>This code gets the precedence of the current token and checks to see if if is
487 too low. Because we defined invalid tokens to have a precedence of -1, this
488 check implicitly knows that the pair-stream ends when the token stream runs out
489 of binary operators. If this check succeeds, we know that the token is a binary
490 operator and that it will be included in this expression:</p>
492 <div class="doc_code">
494 // Okay, we know this is a binop.
496 getNextToken(); // eat binop
498 // Parse the primary expression after the binary operator.
499 ExprAST *RHS = ParsePrimary();
504 <p>As such, this code eats (and remembers) the binary operator and then parses
505 the following primary expression. This builds up the whole pair, the first of
506 which is [+, b] for the running example.</p>
508 <p>Now that we parsed the left-hand side of an expression and one pair of the
509 RHS sequence, we have to decide which way the expression associates. In
510 particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
511 To determine this, we look ahead at "binop" to determine its precedence and
512 compare it to BinOp's precedence (which is '+' in this case):</p>
514 <div class="doc_code">
516 // If BinOp binds less tightly with RHS than the operator after RHS, let
517 // the pending operator take RHS as its LHS.
518 int NextPrec = GetTokPrecedence();
519 if (TokPrec < NextPrec) {
523 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
524 precedence of our current operator, then we know that the parentheses associate
525 as "(a+b) binop ...". In our example, since the next operator is "+" and so is
526 our current one, we know that they have the same precedence. In this case we'll
527 create the AST node for "a+b", and then continue parsing:</p>
529 <div class="doc_code">
531 ... if body omitted ...
535 LHS = new BinaryExprAST(BinOp, LHS, RHS);
536 } // loop around to the top of the while loop.
541 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
542 iteration of the loop, with "+" as the current token. The code above will eat
543 and remember it and parse "(c+d)" as the primary expression, which makes the
544 current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
545 binop to the right of the primary. In this case, the precedence of "*" is
546 higher than the precedence of "+" so the if condition will be entered.</p>
548 <p>The critical question left here is "how can the if condition parse the right
549 hand side in full"? In particular, to build the AST correctly for our example,
550 it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
551 do this is surprisingly simple (code from the above two blocks duplicated for
554 <div class="doc_code">
556 // If BinOp binds less tightly with RHS than the operator after RHS, let
557 // the pending operator take RHS as its LHS.
558 int NextPrec = GetTokPrecedence();
559 if (TokPrec < NextPrec) {
560 RHS = ParseBinOpRHS(TokPrec+1, RHS);
561 if (RHS == 0) return 0;
564 LHS = new BinaryExprAST(BinOp, LHS, RHS);
565 } // loop around to the top of the while loop.
570 <p>At this point, we know that the binary operator to the RHS of our primary
571 has higher precedence than the binop we are currently parsing. As such, we know
572 that any sequence of pairs whose operators are all higher precedence than "+"
573 should be parsed together and returned as "RHS". To do this, we recursively
574 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
575 precedence required for it to continue. In our example above, this will cause
576 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
577 of the '+' expression.</p>
579 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
580 and added to the AST. With this little bit of code (14 non-trivial lines), we
581 correctly handle fully general binary expression parsing in a very elegant way.
584 <p>This wraps up handling of expressions. At this point, we can point the
585 parser at an arbitrary token stream and build an expression from them, stopping
586 at the first token that is not part of the expression. Next up we need to
587 handle function definitions etc.</p>
591 <!-- *********************************************************************** -->
592 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
593 <!-- *********************************************************************** -->
595 <div class="doc_text">
598 The first basic thing missing is that of function prototypes. In Kaleidoscope,
599 these are used both for 'extern' function declarations as well as function body
600 definitions. The code to do this is straight-forward and not very interesting
601 (once you've survived expressions):
604 <div class="doc_code">
607 /// ::= id '(' id* ')'
608 static PrototypeAST *ParsePrototype() {
609 if (CurTok != tok_identifier)
610 return ErrorP("Expected function name in prototype");
612 std::string FnName = IdentifierStr;
616 return ErrorP("Expected '(' in prototype");
618 std::vector<std::string> ArgNames;
619 while (getNextToken() == tok_identifier)
620 ArgNames.push_back(IdentifierStr);
622 return ErrorP("Expected ')' in prototype");
625 getNextToken(); // eat ')'.
627 return new PrototypeAST(FnName, ArgNames);
632 <p>Given this, a function definition is very simple, just a prototype plus
633 and expression to implement the body:</p>
635 <div class="doc_code">
637 /// definition ::= 'def' prototype expression
638 static FunctionAST *ParseDefinition() {
639 getNextToken(); // eat def.
640 PrototypeAST *Proto = ParsePrototype();
641 if (Proto == 0) return 0;
643 if (ExprAST *E = ParseExpression())
644 return new FunctionAST(Proto, E);
650 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
651 well as to support forward declaration of user functions. 'externs' are just
652 prototypes with no body:</p>
654 <div class="doc_code">
656 /// external ::= 'extern' prototype
657 static PrototypeAST *ParseExtern() {
658 getNextToken(); // eat extern.
659 return ParsePrototype();
664 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
665 evaluate them on the fly. We will handle this by defining anonymous nullary
666 (zero argument) functions for them:</p>
668 <div class="doc_code">
670 /// toplevelexpr ::= expression
671 static FunctionAST *ParseTopLevelExpr() {
672 if (ExprAST *E = ParseExpression()) {
673 // Make an anonymous proto.
674 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
675 return new FunctionAST(Proto, E);
682 <p>Now that we have all the pieces, lets build a little driver that will let us
683 actually <em>execute</em> this code we've built!</p>
687 <!-- *********************************************************************** -->
688 <div class="doc_section"><a name="driver">The Driver</a></div>
689 <!-- *********************************************************************** -->
691 <div class="doc_text">
693 <p>The driver for this simply invokes all of the parsing pieces with a top-level
694 dispatch loop. There isn't much interesting here, so I'll just include the
695 top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
696 Parsing" section.</p>
698 <div class="doc_code">
700 /// top ::= definition | external | expression | ';'
701 static void MainLoop() {
703 fprintf(stderr, "ready> ");
705 case tok_eof: return;
706 case ';': getNextToken(); break; // ignore top level semicolons.
707 case tok_def: HandleDefinition(); break;
708 case tok_extern: HandleExtern(); break;
709 default: HandleTopLevelExpression(); break;
716 <p>The most interesting part of this is that we ignore top-level semi colons.
717 Why is this, you ask? The basic reason is that if you type "4 + 5" at the
718 command line, the parser doesn't know that that is the end of what you will
719 type. For example, on the next line you could type "def foo..." in which case
720 4+5 is the end of a top-level expression. Alternatively you could type "* 6",
721 which would continue the expression. Having top-level semicolons allows you to
722 type "4+5;" and the parser will know you are done.</p>
726 <!-- *********************************************************************** -->
727 <div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
728 <!-- *********************************************************************** -->
730 <div class="doc_text">
732 <p>With just under 400 lines of commented code, we fully defined our minimal
733 language, including a lexer, parser and AST builder. With this done, the
734 executable will validate code and tell us if it is gramatically invalid. For
735 example, here is a sample interaction:</p>
737 <div class="doc_code">
740 ready> def foo(x y) x+foo(y, 4.0);
741 ready> Parsed an function definition.
742 ready> def foo(x y) x+y y;
743 ready> Parsed an function definition.
744 ready> Parsed a top-level expr
745 ready> def foo(x y) x+y );
746 ready> Parsed an function definition.
747 ready> Error: unknown token when expecting an expression
748 ready> extern sin(a);
749 ready> Parsed an extern
756 Here is the full code. Note that it is fully self-contained: you don't even
757 need LLVM for this. In the <a href="LangImpl3.html">next installment</a>, we
758 will describe how to generate LLVM IR from the AST.</p>
760 <div class="doc_code">
766 #include <cstdio>
767 #include <string>
769 #include <vector>
771 //===----------------------------------------------------------------------===//
773 //===----------------------------------------------------------------------===//
775 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
776 // of these for known things.
781 tok_def = -2, tok_extern = -3,
784 tok_identifier = -4, tok_number = -5,
787 static std::string IdentifierStr; // Filled in if tok_identifier
788 static double NumVal; // Filled in if tok_number
790 /// gettok - Return the next token from standard input.
791 static int gettok() {
792 static int LastChar = ' ';
794 // Skip any whitespace.
795 while (isspace(LastChar))
796 LastChar = getchar();
798 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
799 IdentifierStr = LastChar;
800 while (isalnum((LastChar = getchar())))
801 IdentifierStr += LastChar;
803 if (IdentifierStr == "def") return tok_def;
804 if (IdentifierStr == "extern") return tok_extern;
805 return tok_identifier;
808 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
812 LastChar = getchar();
813 } while (isdigit(LastChar) || LastChar == '.');
815 NumVal = strtod(NumStr.c_str(), 0);
819 if (LastChar == '#') {
820 // Comment until end of line.
821 do LastChar = getchar();
822 while (LastChar != EOF && LastChar != '\n' & LastChar != '\r');
828 // Check for end of file. Don't eat the EOF.
832 // Otherwise, just return the character as its ascii value.
833 int ThisChar = LastChar;
834 LastChar = getchar();
838 //===----------------------------------------------------------------------===//
839 // Abstract Syntax Tree (aka Parse Tree)
840 //===----------------------------------------------------------------------===//
842 /// ExprAST - Base class for all expression nodes.
845 virtual ~ExprAST() {}
848 /// NumberExprAST - Expression class for numeric literals like "1.0".
849 class NumberExprAST : public ExprAST {
852 NumberExprAST(double val) : Val(val) {}
855 /// VariableExprAST - Expression class for referencing a variable, like "a".
856 class VariableExprAST : public ExprAST {
859 VariableExprAST(const std::string &name) : Name(name) {}
862 /// BinaryExprAST - Expression class for a binary operator.
863 class BinaryExprAST : public ExprAST {
867 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
868 : Op(op), LHS(lhs), RHS(rhs) {}
871 /// CallExprAST - Expression class for function calls.
872 class CallExprAST : public ExprAST {
874 std::vector<ExprAST*> Args;
876 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
877 : Callee(callee), Args(args) {}
880 /// PrototypeAST - This class represents the "prototype" for a function,
881 /// which captures its argument names as well as if it is an operator.
884 std::vector< Args;
886 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
887 : Name(name), Args(args) {}
891 /// FunctionAST - This class represents a function definition itself.
896 FunctionAST(PrototypeAST *proto, ExprAST *body)
897 : Proto(proto), Body(body) {}
901 //===----------------------------------------------------------------------===//
903 //===----------------------------------------------------------------------===//
905 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
906 /// token the parser it looking at. getNextToken reads another token from the
907 /// lexer and updates CurTok with its results.
909 static int getNextToken() {
910 return CurTok = gettok();
913 /// BinopPrecedence - This holds the precedence for each binary operator that is
915 static std::map<char, int> BinopPrecedence;
917 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
918 static int GetTokPrecedence() {
919 if (!isascii(CurTok))
922 // Make sure it's a declared binop.
923 int TokPrec = BinopPrecedence[CurTok];
924 if (TokPrec <= 0) return -1;
928 /// Error* - These are little helper functions for error handling.
929 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
930 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
931 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
933 static ExprAST *ParseExpression();
937 /// ::= identifer '(' expression* ')'
938 static ExprAST *ParseIdentifierExpr() {
939 std::string IdName = IdentifierStr;
941 getNextToken(); // eat identifer.
943 if (CurTok != '(') // Simple variable ref.
944 return new VariableExprAST(IdName);
947 getNextToken(); // eat (
948 std::vector<ExprAST*> Args;
950 ExprAST *Arg = ParseExpression();
954 if (CurTok == ')') break;
957 return Error("Expected ')'");
964 return new CallExprAST(IdName, Args);
967 /// numberexpr ::= number
968 static ExprAST *ParseNumberExpr() {
969 ExprAST *Result = new NumberExprAST(NumVal);
970 getNextToken(); // consume the number
974 /// parenexpr ::= '(' expression ')'
975 static ExprAST *ParseParenExpr() {
976 getNextToken(); // eat (.
977 ExprAST *V = ParseExpression();
981 return Error("expected ')'");
982 getNextToken(); // eat ).
987 /// ::= identifierexpr
990 static ExprAST *ParsePrimary() {
992 default: return Error("unknown token when expecting an expression");
993 case tok_identifier: return ParseIdentifierExpr();
994 case tok_number: return ParseNumberExpr();
995 case '(': return ParseParenExpr();
1000 /// ::= ('+' primary)*
1001 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1002 // If this is a binop, find its precedence.
1004 int TokPrec = GetTokPrecedence();
1006 // If this is a binop that binds at least as tightly as the current binop,
1007 // consume it, otherwise we are done.
1008 if (TokPrec < ExprPrec)
1011 // Okay, we know this is a binop.
1013 getNextToken(); // eat binop
1015 // Parse the primary expression after the binary operator.
1016 ExprAST *RHS = ParsePrimary();
1019 // If BinOp binds less tightly with RHS than the operator after RHS, let
1020 // the pending operator take RHS as its LHS.
1021 int NextPrec = GetTokPrecedence();
1022 if (TokPrec < NextPrec) {
1023 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1024 if (RHS == 0) return 0;
1028 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1033 /// ::= primary binoprhs
1035 static ExprAST *ParseExpression() {
1036 ExprAST *LHS = ParsePrimary();
1039 return ParseBinOpRHS(0, LHS);
1043 /// ::= id '(' id* ')'
1044 static PrototypeAST *ParsePrototype() {
1045 if (CurTok != tok_identifier)
1046 return ErrorP("Expected function name in prototype");
1048 std::string FnName = IdentifierStr;
1052 return ErrorP("Expected '(' in prototype");
1054 std::vector<std::string> ArgNames;
1055 while (getNextToken() == tok_identifier)
1056 ArgNames.push_back(IdentifierStr);
1058 return ErrorP("Expected ')' in prototype");
1061 getNextToken(); // eat ')'.
1063 return new PrototypeAST(FnName, ArgNames);
1066 /// definition ::= 'def' prototype expression
1067 static FunctionAST *ParseDefinition() {
1068 getNextToken(); // eat def.
1069 PrototypeAST *Proto = ParsePrototype();
1070 if (Proto == 0) return 0;
1072 if (ExprAST *E = ParseExpression())
1073 return new FunctionAST(Proto, E);
1077 /// toplevelexpr ::= expression
1078 static FunctionAST *ParseTopLevelExpr() {
1079 if (ExprAST *E = ParseExpression()) {
1080 // Make an anonymous proto.
1081 PrototypeAST *Proto = new PrototypeAST("", std::vector<());
1082 return new FunctionAST(Proto, E);
1087 /// external ::= 'extern' prototype
1088 static PrototypeAST *ParseExtern() {
1089 getNextToken(); // eat extern.
1090 return ParsePrototype();
1093 //===----------------------------------------------------------------------===//
1094 // Top-Level parsing
1095 //===----------------------------------------------------------------------===//
1097 static void HandleDefinition() {
1098 if (FunctionAST *F = ParseDefinition()) {
1099 fprintf(stderr, "Parsed a function definition.\n");
1101 // Skip token for error recovery.
1106 static void HandleExtern() {
1107 if (PrototypeAST *P = ParseExtern()) {
1108 fprintf(stderr, "Parsed an extern\n");
1110 // Skip token for error recovery.
1115 static void HandleTopLevelExpression() {
1116 // Evaluate a top level expression into an anonymous function.
1117 if (FunctionAST *F = ParseTopLevelExpr()) {
1118 fprintf(stderr, "Parsed a top-level expr\n");
1120 // Skip token for error recovery.
1125 /// top ::= definition | external | expression | ';'
1126 static void MainLoop() {
1128 fprintf(stderr, "ready> ");
1130 case tok_eof: return;
1131 case ';': getNextToken(); break; // ignore top level semicolons.
1132 case tok_def: HandleDefinition(); break;
1133 case tok_extern: HandleExtern(); break;
1134 default: HandleTopLevelExpression(); break;
1139 //===----------------------------------------------------------------------===//
1140 // Main driver code.
1141 //===----------------------------------------------------------------------===//
1144 // Install standard binary operators.
1145 // 1 is lowest precedence.
1146 BinopPrecedence['<'] = 10;
1147 BinopPrecedence['+'] = 20;
1148 BinopPrecedence['-'] = 20;
1149 BinopPrecedence['*'] = 40; // highest.
1151 // Prime the first token.
1152 fprintf(stderr, "ready> ");
1162 <!-- *********************************************************************** -->
1165 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1166 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1167 <a href="http://validator.w3.org/check/referer"><img
1168 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!" /></a>
1170 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1171 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1172 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $