Re-entrant flex/bison parser
to make Abstract Syntax Tree

Download source files - 13 Kb

Introduction

To do parsing in C++, the most elegant and modern tool is Spirit. (It's part of Boost, which is the second standard library after STL). But Spirit's abstract syntax trees require at least Visual Studio .NET 2003. Anyway, flex and bison are more widespread. So this article gives an example of their use.

This document assumes you know what an abstract syntax tree is, and you know how to write grammars in flex/bison. It's goal is, starting from this base, to show how to achieve ASTs, re-entrancy and C++.

An Abstract Syntax Tree (AST) is an in-memory tree structure that corresponds to the grammatical structure of the file that has just been parsed. By building the entire AST first, it's easier to compute from it and manipulate it and match up symbols. (The alternative is to do computation and manipulation 'on-the-fly' while parsing happens, which is how most flex/bison examples work.)

Re-entrant, in the context of flex/bison, means that global variables aren't used. Hence, you can have one thread parsing one document while another thread parses another one. It also means that the parser takes input from a local structure (rather than stdin) and outputs into a local structure (rather than stdout).

Modern C++ is the goal of this article: to give an example that uses the STL and modern C++ idioms. I'm not an expert, and if anyone can advise me how to make the C++ more elegant, then I'd love to hear it. Note: I deliberately avoided "turning everything into an object". I think that parsing is inherently functional, and should be coded as functions. Use of C++ was therefore limited to just the STL vector class.

Installation

Old versions of flex/bison (pre-2002) don't support re-entrancy. And new versions aren't available "native" for Windows. Instead, you must install Cygwin, making sure to install flex, bison and M4. I invoke the Cygwin utilities via custom build steps, as part of the Visual Studio build process. For reference, the actual command lines are these:

\cygwin\bin\bison.exe --defines --verbose --debug gram.ypp
\cygwin\bin\bash -c "PATH="/bin" export PATH;
                     flex -ogram.l.cpp gram.l"

Flex generates code that tries to #include the <unistd.h> library, whatever that is. It's not part of Visual Studio. So I add my own directory 'inc' to the list of includes (under Project > Settings > C++ > include directories), and put an empty file 'unistd.h' in that directory. In this way we avoid compile-time errors.

The files in the accompanying source code are as follows.

re-ast.cpp   --   the "main" file
gram.h       --   defines the abstract syntax tree
gram.ypp     --   the parser source file
  gram.tab.cpp    (autogenerated by bison)
  gram.tab.h      (autogenerated by bison)
gram.l       --   the lexer source file
  gram.l.cpp      (autogenerated by flex)

Overview

We take these grammar rules as an example:

stmt : LET ID = exp ;    // assignment
     | id ( exp ) ;      // function call
     | { stmt }          // nesting

exp  : ID                // variable name
     | exp + exp         // addition
     | INT               // literal integer

// The following text will be parsed
LET x = y + 3 ;

When the lexer recognises an 'ID' or other token, or when the parser recognises a complete rule, they both return two things, value and type:

The value of the entity that has been recognised.
For ID, the lexer will return the text string "x".
For INT, the lexer will return the integer 3.
For (exp+exp), the parser will return an AST 'addition' node.
For (LET x=y+3;), the parser will return an AST 'assignment' node.
The type of the thing that has been recognised.
This will be ID, INT, exp or stmt.
Token types are always integers, and are auto-generated by bison, one per grammar entity.

The distinction between value and type is important, and they will be used repeatedly in this article.

(We talk about returning AST nodes, but they are not anything built in. It's the programmer's job to define appropriate structs/classes for every kind of AST node in his/her grammar. Incidentally, the name 'ID' is already used internally, so it really has to be called IDENTIFIER to avoid clashes.)

In a beautiful parser like JavaCC or Spirit, each grammar rule would return an value of a different type, be it AstAssign or AstAddition or AstId. But in bison, every rule must return exactly the same type. C people typically use a union{} for the value, and they union over all Ast node types. But C++ classes can't go in unions. So we use a different solution.

We use a single C++ type AstNode to cover all possible grammar-types in the abstract syntax tree:

struct AstNode
{ enum AstType {atLet, atInvoke, atId, atInt, atAdd};
  AstType type;
  string s; int i;
  int child, child2;
};

vector<AstNode> nodes; int root;

// The fields s,i,child,child2 contain the parameters
// for all types:
//
// atLet:    "LET [child] = [child2] ;"
// atInvoke: "[child] ( [child2] ) ;"
// atId:     "s"
// atInt:    "i"
// atAdd:    "[child] + [child2]"

We use a flat vector of nodes, indexed by integer. One particular node is the root of the AST. Where it needs to refer to child nodes, it stores the integer index of that child node. By using integer indexing, we're able to use the nice 'vector' container from STL. No pointers.

Here are some examples.

When the lexer recognises the identifier "x",
it creates a new AstNode(atId, s="x") in the vector nodes[].
For its return value, it returns the index of this node.
For its return type, it returns the typecode ID.
When the parser recognises the rule "LET x = exp ;",
it has been given the value $2 = index of AstNode(atId),
and the value $4 = index of AstNode(atExp).
It creates a new AstNode(atAssign, child=$2, child2=$4).
For its return value, it returns the index of this new node.
For its return type, it return the typecode 'stmt'.
When the lexer recognises the keyword "LET",
it doesn't need any AstNode.
Instead it merely returns the typecode LET.
When the parser recognises the rule "{ stmt }",
it doesn't need any new AstNodes.
It can just return the value $2, ie. the same AstNode index as for the stmt inside. That's because braces are used purely to disambiguate the grammar, not as real nodes in the abstract syntax tree.

In summary, the solution is: (1) When the lexer or parser recognises something, it creates an AstNode for it, and sticks this AstNode into the vector nodes[]. (2) The return value is the index of the newly created node. (3) The return type is the typecode, ID/INT/stmt/exp. (Actually, bison returns typecodes automatically, but flex requires the programmer to supply them manually.)

Header file, gram.h

struct AstNode
{ ...
};

struct ErrorInfo {string msg; int line;};

// parse_x: this parses the specified buffer.
// It appends all the generated AstNodes into 'nodes'.
// It returns the index of the root of the generated AST.
// But if there was a parse error, it returns -1,
// and fills out 'einfo'.
//
int parse_x(const char *buf, vector<AstNode> &nodes,
            ErrorInfo *einfo);

The function parse_x is implemented in the bison file gram.ypp, discussed below. You'd invoke it like this:

vector<AstNode> nodes;
int root = parse_x("LET x = 5;",nodes);

Re-entrancy involves some additional definitions in the header file gram.h. This definitions are shared between gram.l (the lexer source file) and gram.ypp (the parser source file), but by no one else.

Conceptually, our 'state' consists of the input buffer and of what has been generated so far for the AST vector. And this state must be passed as an argument to the parser (which will use it to append more AST nodes) and to the lexer (which will also use it to fetch more input). Actually, the lexer likes to have its own state, void *yyscanner, and allows the user to get/set a single pointer within this state, yyset_extra(yyscanner,dat) or yyget_extra(yyscanner).

Therefore, we define a 'state' structure that we will parse to the parser. And within this structure, we have a pointer 'yyscanner' to the lexer's state. And within the lexer's state, there is a pointer back to the whole structure.

struct TParseParm
{ TParseParm(const char *buf_, vector<AstNode> &nodes_)
     : nodes(nodes_), buf(buf_), pos(0), length(strlen(buf)) {}
  // a dummy assignment operator to stop the compiler whingeing.
  operator=(const TParseParm &) {exit(0);}
  ErrorInfo *einfo;        // in case there was an error
  //
  void *yyscanner;         // the state of the lexer
  const char *buf;         // This buffer is for the source text
  unsigned int pos,length; 
  vector<AstNode> &nodes;  // all AST nodes are stored in here
  int res;                 // the most recent AstNode produced
};

Curiously, bison uses return values internally, but when you invoke bison it doesn't extrude the information. That's part of the bison philosophy of using side-effects rather than pure functions. Yuck. We however need to know the top-level returned value, ie. the root of the AST. This will necessarily be the last rule that was matched. So, every time we match a rule, we update the 'res' variable.

Bison, gram.ypp

%pure_parser

This bison directive, pure_parser, will make it generate a parser that doesn't use global variables. Instead, every parser function accepts an additional formal parameter void* YYPARSE_PARM. And every time it invokes the lexer (to fetch the next token), it uses the additional argument YYLEX_PARAM. Recall that the lexer wants to see its own state 'yyscanner', not our TParseParam state.

#define YYPARSE_PARAM parm
#define YYLEX_PARAM ((TParseParm*)parm)->yyscanner

The implementation of parse_x, the function that does the parsing, is as follows.

int parse_x(const char *buf, vector<AstNode> &nodes,
            ErrorInfo *einfo)
{ TParseParm pp(buf,nodes); pp.einfo=einfo;
  yylex_init(&pp.yyscanner);       // to init lexer's state.
  yyset_extra(&pp, pp.yyscanner);
  int res=yyparse(&pp);            // parse!
  yylex_destroy(pp.yyscanner);
  if (res==1) return -1; else return pp.res;
}

The grammar rules

For illustration, we use a slightly more complex grammar than in the initial example. This grammar has a sequence of terms. (The terms are actually terms in the 'pi calculus', in case you're interested...)

terms ::=  empty  |  terms term  |  terms error
term  ::=  rcv x;  |  snd x;  |  {terms}

(This grammar demonstrates two aspects of good bison style: using left-recursion to allow multiple terms, and adding an 'error' at the end of a load of terms, to catch errors in the most appropriate place.)

Hopefully, it will be clear by this stage how one might construct an AST corresponding to this grammar, where

rcv x; snd x; snd x;

becomes

AstSeq( Seq( Seq( Seq(Empty,Rcv), Snd), Snd), Snd)

But this looks a little ugly. It'd be nicer to have a single AstSeq node which admits a list of children:

AstSeq( Rcv, Snd, Snd )

The idea is as follows. The invariant to remember: every time the grammar matches 'terms', it returns the index of an AstSeq node.

let the struct AstNode include vector children.
When we match an 'empty', construct an AstSeq node. (This is the base case of the invariant).
When we match 'terms term', $1 will be an AstSeq and $2 will be the term. So, stick $2 onto the end of the AstSeq, and return $1 again. (This is the induction case of the invariant).

Programmatically, the bison grammar looks like this:

terms
  : /* empty */ {
       TParseParm *pp=(TParseParm*)parm;
       AstNode a; a.type=atSeq;  
       $$ = pp->res = pp->nodes.size();
       pp->nodes.push_back(a);
       // We've created a new 'Seq' node,
       // and returned its index as the value $$
       }
  | terms term  {
       TParseParm *pp=(TParseParm*)parm;
       AstNode &sofar = pp->nodes[$1];
       sofar.child.push_back($2);
       $$ = pp->res = $1;
       // We've reused the existing Seq node,
       // and returned its index as the value $$
       }
  | terms error {
       TParseParm *pp=(TParseParm*)parm;
       pp->einfo->line=yyget_lineno(pp->yyscanner);
       // to record the line-number where the error happened
       YYABORT;
       // to stop parsing, and return -1.
       }

Flex, gram.l

%option reentrant
%option noyywrap
%option yylineno

The first option creates a re-entrent lexer (ie. no global variables). The second option means that, when the lexer reaches the end of the input we give it, it won't try to "spill over" into an additional input file. The third option makes it keep track of line-numbers, of use in error reporting by the parser.

// Whenever flex wants to get more input, it invokes the
// YY_INPUT macro.
// We're going to give it data from our TParseParm state...
// The function returns as many bytes as it can, up to a
// maximum of 'size'.

#define YY_INPUT(buf,result,max_size) \
      result=yy_input_proc(buf,max_size,yyscanner)

int yy_input_proc(char *buf,int size,yyscan_t yyscanner)
{ TParseParm *parm = (TParseParm*)yyget_extra(yyscanner);
  if (parm->pos >= parm->length) return 0;
  int num = parm->length - parm->pos; if (num>size) num=size;
  memcpy(buf,parm->buf+parm->pos,num);
  parm->pos += num;
  return num;
}

The lexer rules

The first few lexer rules are easy:

snd { return SEND; }
rcv { return RECEIVE; }

These match the keywords "snd" and "rcv". They return the types SEND and RECEIVE. Recall that 'types' (also called tokens) are defined as integers in the auto-generated parser header file gram.hpp. There is no need to return any value (ie. any AstNode), since these two have no additional information.

[_]*[a-zA-Z][a-zA-Z0-9_]* {
      TParseParm *pp = (TParseParm*)yyget_extra(yyscanner);
      // yytext is the piece of text that has
      // just been matched by this lex.
      AstNode a; a.type=astId; a.s=yytext;
      *yylval_param = pp->res = pp->nodes.size();
      pp->nodes.push_back(a);
      return IDENTIFIER;
      }

This lexer rule matches identifiers. It needs to return as value a new AstNode containing the textual string of the identifier. And it returns IDENTIFIER as type. Note the odd syntax that flex uses to return values: *yylval_param = val. (And recall that 'values' in our parser are always integer indexes into the vector of AstNodes).

Conclusions

Building abstract syntax trees is not discussed adequately in the flex/bison manuals, and there are few examples online. There are also few examples of it in C++. Hopefully this article will contribute.

Essentially, flex/bison date from the 'C' age when programmers hunted wooly mammoths with global variables, and they communicated via side effects which were indistinguishable from magic, and that's probably why so many got burnt at the stake. Trying to fit it around modern C++ is like acute kidney failure, only not as fun. This is my growl at the designers: I learnt JavaCC and Spirit in twenty minutes each, but it took me four whole days to make something acceptable out of flex/bison.

Re-entrant flex/bison parser to make Abstract Syntax Tree