Handout 3 - Grammar Factoids


Some Definitions

grammar - A formal definition of the syntactic structure of a language, normally given in terms of production rules which specify the order of constituents and their sub-constituents in a sentence (a well-formed string in the language). Each rule has a left-hand side symbol naming a syntactic category (e.g. "noun-phrase" for a natural language grammar) and a right-hand side which is a sequence of zero or more symbols. Each symbol may be either a terminal symbol or a non-terminal symbol. A terminal symbol corresponds to one "lexeme" - a part of the sentence with no internal syntactic structure (e.g. an identifier or an operator in a computer language). A non-terminal symbol is the left-hand side of some rule.

One rule is normally designated as the top-level rule which gives the structure for a whole sentence.

A grammar can be used either to parse a sentence (see parser) or to generate one. Parsing assigns a terminal syntactic category to each input token and a non-terminal category to each appropriate group of tokens, up to the level of the whole sentence. Parsing is usually preceded by lexical analysis. Generation starts from the top-level rule and chooses one alternative production wherever there is a choice.


Backus-Naur Form - (BNF) A formal meta-syntax used to express context-free grammars. Backus Normal Form was
renamed Backus-Naur Form at the suggestion of Donald Knuth.

BNF is one of the most commonly used metasyntactic notations for specifying the syntax of programming languages, command sets, and the like. It is widely used for language descriptions but seldom documented anywhere (how do you document a metasyntax?), so that it must usually be learned by osmosis.

Consider this BNF for a US postal address:

<postal-address> ::= <name-part> <street-address> <zip-part>
<personal-part>  ::= <name> | <initial> "."
<name-part>      ::= <personal-part> <last-name> [<jr-part>] <EOL> | <personal-part> <name-part>
<street-address> ::= [<apt>] <house-num> <street-name> <EOL>
<zip-part>       ::= <town-name> "," <state-code> <ZIP-code> <EOL>


This translates into English as: "A postal-address consists of a name-part, followed by a street-address part, followed by a zip-code part. A personal-part consists of either a first name or an initial followed by a dot. A name-part consists of either: a personal-part followed by a last name followed by an optional "jr-part" (Jr., Sr., or dynastic number) and end-of-line, or a personal part followed by a name part (this rule illustrates the use of recursion in BNFs, covering the case of people who use multiple first and middle names and/or initials). A street address consists of an optional apartment specifier, followed by a street number, followed by a street name. A zip-part consists of a town-name, followed by a comma, followed by a state code, followed by a ZIP-code followed by an end-of-line."

Note that many things (such as the format of a personal-part, apartment specifier, or ZIP-code) are left unspecified. These lexical details are presumed to be obvious from context or specified somewhere nearby.

LL - A class of language grammars, which can be parsed without backtracking. The first L stands for Left-to-right scan, the second for Leftmost derivation.

Often found in the form LL(k) where k is the number of tokens of look-ahead required when parsing a sentence of the language. In particular, LL(1) is a fairly restrictive class of grammar, but allows simple top-down parsing (e.g. recursive-descent) to be used without wasteful backtracking. A number of programming languages are LL(1) (or close).

recursive descent parser - A "top-down" parser built from a set of mutually-recursive procedures or a non-recursive equivalent where each such procedure usually implements one of the productions of the grammar. The structure
of the resulting program closely mirrors that of the grammar it recognises.


Checking if a Grammar is LL(1)

The only true was to determine is a grammar is LL(1) is to generate the first, follow and select sets for the grammar which is not a process for the faint of heart. However, we can remove some common problems.


Removing Left recursion

The following is a left recursive grammar:

A ::= Aa | B | C

To remove the left recursion, we can rewrite the grammar using an additional nonterminal as follows:

A ::= BX | CX

X ::= aX | lambda

where lambda is null

Common Left Factors

The following example shows the case where a production may begin with a choice of two terminals as:

A ::= xy | xz

A grammar of this form does not meet the requirements of LL(1). We can rewrite the grammar as:

A ::= xB

B ::= y | z


A Grammar with Issues

<stmt>        ::= <if> | <assign> | <for> | <print>
<if>          ::= IF <cond> THEN <stmt> END |
                  IF <cond> THEN <stmt> ELSE <stmt> END
<assign>      ::= <id> := <exp>
<for>         ::= FOR <id> IN <exp> DO <stmt> END
<print>       ::= PRINT <id>
<cond>        ::= <num> IN <exp> | <exp> <op> <exp>
<op>          ::= # | = | <= | >=
<exp>         ::= <exp> + <term> | <exp> - <term> | <term> |
                  INCL(<exp> , <elem>) | EXCL(<exp> , <elem>)
<term>        ::= <term> * <factor> | <factor>
<factor>      ::= (<exp>) | <set>
<set>         ::= <id> | {} | {<setlist>}
<setlist>     ::= <num> | <num> , <setlist>
<elem>        ::= <id> | <num>
<id>          ::= A | B | C | D | ... | X | Y | Z
<num>         ::= 1 | 2 | 3 | 4 | ... | 256

Revised Grammar (revised form, eliminating left recursion and common left factors)

<stmt>        ::= <if> | <assign> | <for> | <print>
<if>          ::= IF <cond> THEN <stmt> <if_tail>
<if_tail>     ::= END | ELSE <stmt> END
<assign>      ::= <id> := <exp>
<for>         ::= FOR <id> IN <exp> DO <stmt> END
<print>       ::= PRINT <id>
<cond>        ::= <num> IN <exp> | <exp> <op> <exp>
<op>          ::= # | = | <= | >= 
<exp>         ::= <term> <exp_tail>  | INCL(<exp> , <elem>) | EXCL(<exp> , <elem>) 
<exp_tail>    ::=  + <term> <exp_tail> |  - <term> <exp_tail> | lambda 
<term>        ::= <factor> <term_tail> 
<term_tail>   ::= * <factor> | lambda 
<factor>      ::= (<exp>) | <set>
<set>         ::= <id> | { <set_tail>
<set_tail>    ::= } | <setlist> }
<setlist>     ::= <num> <setlist_tail>
<setlist_tail>::= , <setlist> | lambda
<elem>        ::= <id> | <num>
<id>          ::= A | B | C | D | ... | X | Y | Z
<num>         ::= 1 | 2 | 3 | 4 | ... | 256


Consider the grammar segment:

<exp> ::= <exp> + <term> | <exp> - <term> | <term> | INCL(<exp> , <elem>) | EXCL(<exp> , <elem>)

eliminating left recursion

<exp>      ::= <term> <exp_tail> | INCL(<exp> , <elem>) | EXCL(<exp> , <elem>)
<exp_tail> ::= + <term> <exp_tail> | - <term> <exp_tail> | lambda

Note that lambda is the empty element.


Consider the grammar segment:

<term>   ::= <term> * <factor> | <factor>

eliminating left recursion

<term>      ::= <factor> <term_tail>
<term_tail> ::= * <factor> | lambda


Consider the grammar segment:

<if>          ::= IF <cond> THEN <stmt> END |
                  IF <cond> THEN <stmt> ELSE <stmt> END
eliminating common left factors:
<if>          ::= IF <cond> THEN <stmt> <if_tail>
<if_tail>     ::= END | ELSE <stmt> END