notes-computer-programming-programmingLanguagesBook-programmingLanguagesChParsingImpl

Table of Contents for Programming Languages: a survey

Chapter : parsing (and lexing)

The goal of these stages is to start with a text file and to end up either with a parse tree, or an informative syntax error.

todo intro to parse trees

The syntax of a programming language is usually represented by a grammar. A grammar specifies not only which textual strings are syntactically valid, but also how they should be parsed into a parse tree. todo intro grammars

The usual notation for a grammar is EBNF. todo intro BNF, EBNF. Todo explain the two interpretations of EBNF, see section 2.3.2.3, page 35 of http://dickgrune.com/Books/PTAPG_1st_Edition/

The general pipeline goes:

In some languages, this sort of thing is repeated in multiple passes. For example, languages that have macros that are expanded at parse time need to first go thru and identify and expand the macros, and then to lex and parse the transformed source code.

A token is usually taken to be a pair consisting of a token type and a token value. For example, the integer literal 3 might be processed to a token containing (INTEGER_LITERAL, 3). In practice tokens may also have other information such as a pointer to say where in the source code they occur, to aid later debugging. The part of the source code corresponding to a certain token is called a 'lexeme'.

Tokenization usually proceeds by scanning the input to delimit and classify lexemes, and also running some lexemes through a lexeme evaluator to produce the token values.

The scanner is usually specified in terms of regular expressions (regexps), or some sort of extended regular expressions.

Highly non-local properties, such as matching parenthesis, are put off until the parsing stage.

The parsing stage takes a stream of tokens as input, and, using a grammar over tokens, produces a parse tree as output.

Code that does parsing is called a 'parser', and it is usually governed by a specification of valid parse trees (and hence of valid strings) called a 'grammar'. There are different classes of grammars; various classes have various expressive power, and various types of parsers with various efficiencies and properties are available for the different classes.

Most common grammars and parsers can do lexical analysis, too. However, in practice, lexical analysis is usually done as a separate stage for reasons of both efficiency and simplicity.

When designing a programming language, you may want to choose the syntax in such a way so as to make it efficient and easy to tokenize and to parse. todo what proportion of compiler/interpreter time is usually spent in lexing and parsing -- i've heard the parse time can be significant? The part of syntax corresponding to lexical analysis is often called 'lexical syntax'. To make it efficient and easy to do lexical analysis, you should choose a lexical syntax amenable to scanning with regular expressions (best) or extended regular expressions (not as good but still good).

To make it efficient and easy to parse, you should choose a type of grammar over tokens for which a standard and efficient type of parser is available.

Note that many popular languages do not follow these prescriptions, and use ad-hoc algorithms for lexing and parsing that are more complex than simply scanning with regexes, running token evaluators on lexemes, and then using standard parsing algorithms with a common grammar type.

What kind of grammar over tokens should you choose for your programming language? What sorts of types of grammars are there for which standard and efficient parsers are known, and what are their trade-offs and limitations? I've found it surprisingly hard to find a concise overview of answers to these questions on the web (please send me any if you find them).

The clearest authoritative recommendation that I've found it:

"If one has the luxury of being in a position to design the grammar oneself, the choice is simple: design the grammar to be LL(1) and use a predictive recursive descent parser. It can be generated automatically, with good error recovery, and allows semantic routines to be included in-line. This can be summarized as: parsing is a problem only if someone else is in charge of the grammar." -- http://dickgrune.com/Books/PTAPG_2nd_Edition/

I was surprised to find that recommendation since there are so many other grammars which are more expressive than LL(1); when I initially started looking this stuff up, I had gotten the impression that we had moved beyond LL(1), which is one of the most constraining types of grammar, and that it would be a good idea to write LALR(1) languages. But I guess not. Other commentators, although not as authoritative as the authors of PTAPG, often seemed to agree; a common sentiment was that working with extant LALR and LR parser generators is not ideal (and writing them for oneself is a long and difficult task, at least when one includes the necessity of error reporting) and that it's better if you have at least the option of avoiding them.

An additional somewhat commonly observed sentiment to note is that some people seemed to think that mixing operator precedence parsing with predictive recursive descent LL(1) parsing is okay. So if you want your recursive descent parser to be more efficient or to more clearly represent operator precedence than with a BNF grammar, you can mix in operator precedence parsing e.g. Top Down Operator Precedence or Pratt Parsing.

If you are writing your compiler or interpreter in a functional language such as Haskell or Scala, you may be able to use parser combinators instead of parser generators (see http://stackoverflow.com/questions/5507665/are-there-any-ll-parser-generators-for-functional-languages-such-as-haskell-or-s ). Whether or not parser combinators are available, you could write your own predictive recursive descent parser.

However, you should still write down an LL(1) grammar in BNF (todo: will EBNF do?) and run your grammar thru an LL(1) parser generator first to make sure that (a) it is LL(1) (this also tells you that it is unambiguous), and (b) so that you have a platform-agnostic representation of the grammar.

Random note: minus ('-') is often a special case, because in traditional arithmetic notation it can be either unary negation or binary infix subtraction. With some parsing techniques this may be impossible (without extra passes or some other complication), and with some it may be simpler to have all unary operators bind tighter than all infix operators, which does not match traditional arithmetic, in which -2^4 = -(2^4), not (-2)^4. Some languages even go so far as to make the programmer use a different symbol for unary negation and binary subtraction.

Another note: even if you won't constrain yourself to LL(k), you may want to avoid requiring a symbol table at parse time. E.g. in C, you need a symbol table to parse, because you need to know if some identifiers are type names or not in order to parse some lines containing those identifiers. This apparently makes parsing significantly more complex, and some languages such as golang proudly trumped that they avoid this.

Some tutorials and articles on recursive descent parsing and Pratt parsing:

Advanced topics in choice of grammar

Some types of common grammars are, in declining order of expressiveness:

Van Wijngaarden grammars, also called W-grammars, which are two-level grammars where a meta-grammar is specified which determines the actual phrase-structure grammar. PTAPG 1st ed. says that, unlike unrestricted one-level grammars, Van Wijngaarden grammars can be written in a clear manner, but there are not good parsing techniques available yet (at that time).

The other grammars that we will consider are called phrase structure grammars.

Unrestricted (phrase structure) grammars are simply the class that includes all (phrase structure) grammars. They are equivalent to Chomsky Type 0 grammars (by which we mean, the set of languages that can be recognized by unrestricted grammars are equal to the set of languages that can be recognized by Chomsky Type 0 grammars), which have the restriction that the left side of the production rules includes only nonterminals.

Context-sensitive grammars are grammars with production rules of the form aXb -> ayb, where a and b are any string of terminals and/or non-terminals (and may be empty strings), X is exactly one nonterminal, and y is an non-empty string of terminals and/or nonterminals; or of the form S -> epsilson (where epsilon represents the empty string), where S does not appear on the right-hand side of any production. They are equivalent to noncontracting grammars, grammars whose production rules all obey the constraint that the string length of the left side of the rule is not longer than the string length of the right side of the rule. This noncontraction property is the key difference between context-sensitive grammars and unrestricted grammars, and allow context-sensitive grammars to be recognized by linear-bounded automata (restricted Turing machines which only have an amount of tape linear in the length of the input).

Context-free grammars are grammars with production rules of the form X -> y, where X is exactly one nonterminal, and y is any string of terminals and/or nonterminals, possibly empty. The lack of any context around the nonterminal being consumed/expanded is the key difference between Context-free grammars and context-sensitive grammars.

Right regular grammars are grammars with all production rules of one of the three following forms: X -> a, X -> aY, X -> \epsilon, where X is exactly one nonterminal, a is exactly one terminal, and Y is exactly one nonterminal. Right regular grammars are equivalent to left regular grammars (where the second rule is replaced by X -> Ya), so we just speak of "regular grammars". Regular grammars can be represented by regular expressions.

PTAPG 1st. ed chapter 11, page 239 says "Unrestricted (Type 0) and context-sensitive (Type 1) grammars Unrestricted (Type 0) and context-sensitive (Type 1) grammars are hardly used since, first, they are user-unfriendly in that it is next to impossible to construct a clear and readable Type 0 or Type 1 grammar and, second, all known parsers for them have exponential time requirements."

Regular expressions are not expressive enough to express desired syntactical properties such as matching parentheses.

Therefore we will restrict further attention to grammars which are less expressive than content-sensitive grammars and more expressive than regular expressions.

A further desirable property to have is linear-time parsing. Let's further restrict our attention to grammars for which linear parsers are known (if you want something more general, you can use a Earley or GLR (Tomita) parser, which are n^3 see free online old edition of PTAPG section 11.2, page 250; PTAPG says that Tomita (GLR) is the best). Linear-time-parsable grammars

note: stuff in the above with k's is only a hierarchy when the k is the same, e.g. LL(1) < LALR(1) but LL(k) is disjoint with LALR(1); i think todo

the k's represent the amount of lookahead. E.g. LL(1) is LL with 1 lookahead.

Note that all of the above are unambiguous grammars except for some operator-precedence grammars; however, every language recognized by an operator-precedence grammar is a deterministic context-free language, which means that there exists some unambiguous context-free grammar that can recognize it; and except that some PEG grammars would have been ambiguous except that the PEG formalism resolves all ambiguity by choosing the first alternative.

Parser generators exist that try to transform a BNF or EBNF grammar that you give it into a more restricted, linear-time-parsable form like LALR(1), and generate a linear-time parser if it can do so.

http://en.wikipedia.org/wiki/Comparison_of_parser_generators shows that LR(1), LALR(1), LL(*), LL(k), LL(1), and packrat parser generators are all popular.

The popular ANTLR uses something they made up called LL(*), which is O(n^2) time according to http://www.antlr.org/papers/LL-star-PLDI11.pdf , but which "in practice" is often quite efficient.

on operator precedence

Operator precedence can be mixed with other parsing techniques.

Vaughan Pratt's "Top Down Operator Precedence", sometimes called Pratt parsing, is a parsing technique that combines Recursive Descent and Floyd's Operator Precedence. See also http://javascript.crockford.com/tdop/tdop.html , http://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing/ , https://news.ycombinator.com/item?id=2344837 , http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/ , http://en.wikipedia.org/wiki/Pratt_parser , http://effbot.org/zone/simple-top-down-parsing.htm . See also http://en.wikipedia.org/wiki/Operator-precedence_parser

" Operator-precedence parsers are very easy to construct (often even by hand) and very efficient to use; operator-precedence is the method of choice for all parsing prob- lems that are simple enough to allow it. That only a skeleton parse tree is obtained, is often not an obstacle, since operator grammars often have the property that the seman- tics is attached to the operators rather than to the right-hand sides; the operators are identified correctly. It is surprising how many grammars are (almost) operator-precedence. Almost all formula-like computer input is operator-precedence. Also, large parts of the grammars of many computer languages are operator-precedence. " -- PTAPG 1st ed. section 9.2.1, page 190

from http://en.wikipedia.org/wiki/Operator-precedence_parser , it seems like if you are only using operator precedence grammars for non-ambiguous CFGs like actual operator precedence, then you can already do operator precedence as part of an LL(1) grammar, and the only advantages of mixing in an operator precedence parser over plain recursive descent are:

so, in the absence of custom precedence levels, this is an optimization.

what that page calls the "Precedence climbing method" seems to be the thing that corresponds to Pratt parsing.

prioritized choice (e.g. the resolution of ambiguity choosing the first option in the grammar specification

PEGs and some parser combinators use this method of resolving ambiguity.

Some commentators think this is dangerous, because it makes it possible that the author of the grammar may not be made aware of the initial ambiguity:

" >> parser combinators don't warn you about ambiguity, LL/LR do.

> Parser combinators don't normally warn you about "ambiguity" because they normally have unambiguous semantics (different choices are always tried in order, by design). In my opinion, this is not better or worse, just different.

I strongly believe (and make a case for such in the article) that they are not "just different", but that making everything prioritized choice is actively harmful. Prioritized choice hides real language issues like the "dangling else" problem and the C++ "most vexing parse" where two interpretations of a program can look equally valid to a user, but one is declared "correct" arbitrarily. With an abstraction like PEGs or parser combinators, you are flying blind about whether your grammar contains such issues. "

more on parser combinators

http://stackoverflow.com/questions/5507665/are-there-any-ll-parser-generators-for-functional-languages-such-as-haskell-or-s

says that "The major reason for this is that most LL(k) parsers that are written in functional languages are just implemented using parser combinators, because the easiest path to generate a parser combinator library is recursive descent."

in other words if you have an LL(k) language, and your compiler or interpreter is written in a language with sufficiently powerful functional features, then you can (and should) just use a parser combinator library rather than a parser generator.

"One of my biggest problems with parser combinators is that they tightly couple the grammar to the host language. I believe that one of the biggest problems with parsers today is that they are usually not very reusable. By writing a parser combinator in (say) Haskell, it can only be used from Haskell (or, awkwardly, by embedding Haskell). If you have to write a different parser for every host language, then you have to write O(N2) parsers to parse N languages from N languages." -- http://www.reddit.com/r/programming/comments/1lx0oc/ll_and_lr_in_context_why_parsing_tools_are_hard/

" I have no objections to your problems with prioritized choice. I want to add, however, that I don't think prioritized choice is a fundamental part of parser combinators. In particular, context-free grammars specified using applicative parser combinators are still amenable to these desirable static checks, although I don't know of any implementations that actually do it. "

" Also, some real-life ambiguities (like C's type/variable ambiguity) are not resolved the same way every time, but are resolved based on semantic context. How can parser combinators do this?

Monadic parser combinators can be arbitrarily context-sensitive and allow you to backtrack. This is simultaneously the source of their power and their inefficiencies. You only have to suffer the inefficiencies if you use the extra power, though.

Since it's embedded in a host language, you can even patch up holes later without necessarily having to backtrack, but still within the parser, if you can't come up with a cleaner way to do it. It could be through some sort of knot-tying trick or perhaps something in a monad transformer stack that provides the power for you. "

Recommendations

" The initial demands on a CF parsing technique are obvious: it should be general (i.e., able to handle all CF grammars), it should be fast (i.e., have linear time requirements) and preferably it should be easy to program. There are two serious obstacles to this naive approach to choosing a parser. The first is that the automatic generation of a linear-time parser is possible only for a subset of the CF grammars. The second is that, although this subset is often described as “very large” (especially for LR(1) and LALR(1)), experience shows that a grammar that is designed to best describe the language without concern for parsing is virtually never in this set. What is true, though, is that for almost any arbitrary grammar a slightly different grammar can be found that generates the same language and that does allow linear-time parsing; finding such a grammar, however, almost always requires human intervention and cannot be automated. Using a modified grammar has the disadvantage that the resulting parse trees will differ to a certain extent from the ones implied by the original grammar. " -- -- free online old edition of PTAPG , section 11.1, page 252

" If one is in a position to design the grammar along with the parser, there is little doubt that LL(1) is to be preferred: not only will parsing and performing semantic actions be easier, text that conforms to an LL(1) grammar is also clearer to the human reader. A good example is the design of Modula-2 by Wirth (see Programming in Modula-2 (Third, corrected edition) by Niklaus Wirth, Springer-Verlag, Berlin, 1985). " -- free online old edition of PTAPG , section 11.3.2, page 252

" The book (mentioned by P), Parsing Techniques - A Practical Guide by Grune & Jacobs, is very good. The final chapter is titled "Practical Parser Writing and Usage." After 546 pages of surveying practically every known parsing algorithm, the authors give this very simple and useful advice:

    If one has the luxury of being in a position to design the grammar oneself, the choice is simple: design the grammar to be LL(1) and use a predictive recursive descent parser. It can be generated automatically, with good error recovery, and allows semantic routines to be included in-line. This can be summarized as: parsing is a problem only if someone else is in charge of the grammar.

I would go further and say that, given the current state of parser generators, it may be strongly preferable to use a hand-written LL(1) recursive descent parser. "

The following book is often reccommended (i haven't read it personally however): Dick Grune, Ceriel J.H. Jacobs. Parsing Techniques: A Practical Guide free online old edition of PTAPG

Case studies

misc

todo: other topics: http://en.wikipedia.org/wiki/Attribute_grammar , http://en.wikipedia.org/wiki/Syntax-directed_translation , http://en.wikipedia.org/wiki/Boolean_grammar , semantic rules in parsing , http://stackoverflow.com/questions/1857022/limitations-of-peg-grammar-parser-generators , http://blogs.ethz.ch/copton/2009/07/08/parsing-expression-grammars/ , http://compilers.iecc.com/comparch/article/92-05-059 , http://www.garshol.priv.no/download/text/bnf.html#id4.3. , http://programmers.stackexchange.com/questions/19541/what-are-the-main-advantages-and-disadvantages-of-ll-and-lr-parsing , http://stackoverflow.com/questions/6480634/examples-of-ll1-lr1-lr0-lalr1-grammars , http://nathansuniversity.com/pegs.html , http://nathansuniversity.com/pegs10.html , http://stackoverflow.com/questions/15673216/why-some-compilers-prefer-hand-crafted-parser-over-parser-generators , http://www.reddit.com/r/programming/comments/1lx0oc/ll_and_lr_in_context_why_parsing_tools_are_hard/ , http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-tools.html , http://www.lclnet.nl/publications/pure-and-declarative-syntax-definition.pdf , http://www.cs.virginia.edu/~cs415/lectures/weimer-pl-07.pdf , semantic and syntactic predicates

https://groups.google.com/forum/#!topic/comp.compilers/tTUVATHiK2I

Abstract Syntax Trees (AST)

http://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees/