CS 441 Project 1 –Yacc-

Afusat Abolanle Oyelaja, Braden Padget, Adam Patton, Rie Maeda

1. YACC History

YACC, Yet Another Compiler-Compiler, was written by Steve Johnson in 1975. YACC was developed at Bell Laboratories, and has been a standard UNIX utility since the 7th UNIX Edition. YACC was first created for the UNIX operating system and is associated with another tool called LEX. LEX generates a scanner later used by YACC, but not necessary. This is later described. Other programs were then written, namely Berkeley YACC, GNU Bison, MKS YACC and Abraxas YACC. Each offer slight improvements and additional features over the original YACC, but the concept has remained the same.

There were many projects developed with YACC, here are a couple of examples. In 1977, the first partial machine grammar of Loglan uses YACC by Doug Landauer. In 1983, AT&T releases System V UNIX, which includes the Source Code Control System (SCCS), allowing many users to work on the same source code while each user is always able to obtain the latest version of each file. System V includes yet another compiler compiler (YACC), which facilitates the creation of parsers for computer languages. System V UNIX was partially developed at UC Berkeley, however lawsuits with AT&T delayed Berkeley's release (FreeBSD) long enough for Linux to take over.

2. YACC Overview

LEX and YACC are tools designed for writers of compilers and interpreters, but they are also used for other purposes like syntax checkers, or string processors. YACC produces a parser based on LR parser theory. YACC is used as a stream parser. The streams, which are transmitted over the network, must conform to a pre-defined grammar. Once an entity has received a data stream, the stream parser verifies it and provides the data for further processing. The output of LEX checks whether it corresponds to a given grammar or not. YACC does not recognize just regular expressions; moreover, it recognizes entire grammars. If the given data stream matches the grammar, a semantic action can be applied, in order to process the data in a certain way.

Grammars for YACC are described using a variant of Backus Naur Form (BNF). This technique was pioneered by John Backus and Peter Naur, and used to describe

ALGOL60. A BNF grammar can be used to express context-free languages. Most constructs in modern programming languages can be represented in BNF. A grammar can have a set of rules beginning with a start-rule. A rule consists of a series of terminal and non-terminal symbols. Each non-terminal symbol must have a separate rule, to split it up in just terminal symbols. For guideline reasons, terminal symbols should be capitalized.

Each rule consists of a single name on the left-hand side followed by the ``:'' operator, a list of symbols and action code on the right-hand side. A semicolon indicates the end of a rule. To produce more complex grammars, the ``|'' operator can be used to provide alternatives.

3. YACC Data Object

The Yacc code is a list of codes for building new objects out of old ones. Any integer coming in from the Lexer is an expression. A variable identifies an array element in which to store the value of an expression, and expressions can be constructed from previously constructed expressions.

Objects from the Lexer are normally identified in Yacc by character constants or uppercase names and objects from parser are identified by lowercase. Codes are attached to new construction by placing the codes in brackets after the objects that was used to build the code. The value of the new object is represented by $$ sign, and regular expressions begin with $ sign and terminate with white space, the value of old objects are represented by $1, $2, …EST.

For example the equation a=3+4*c, is processed as,

expression:

INTEGER

| VARIABLE = {$$ = variables [$1];}

| expression '*' expression = {$$ = $1 * $3;}

| expression '+' expression = {$$ = $1 + $3;}

| VARIABLE '=' expression = {variables [$1] = $3;}

| '(' expression ')' = {$$ = $2;}

When there are several codes to the left side of the rule a bar (|) is used to separate the codes.

4. YACC Sequence Control

The sequence control of Yacc can be considered in terms of its expressions, statements, and subprograms. Examples of these are precedence, conditional and iteration statements, and subprogram calls respectively.

The precedence of all operators and associativity of binary operators are specified by the user due to the disambiguating rules that are used to decide which choice to make in a given situation. This information is sufficient to allow Yacc to resolve the parsing conflicts in accordance with these rules and to construct a parser that realized the desired precedence and associativities. The precedence and associativities are attached to tokens in the declaration section in a series or lines beginning with a keyword, %left, %right, or %nonassoc, followed by a list of tokens (%left means left associative, %right means right associative, and %nonassoc means non-associative). All the tokens on the same line are considered to have the same precedence, and these lines are listed in the increasing order of precedence. For example, the order of mathematical operation can be expressed as following.

%left ‘+’ ‘-’

%left ‘*’ ‘/’

This expression tells the addition and subtraction operators are left associative and have the same precedence that is lower than the precedence of the multiplication and division operators that are also left associative. A more advanced example is following.

%right ‘=’

%left ‘+’ ‘-’

%left ‘*’ ‘/’

expr : expr ‘=’ expr

| expr ‘+’ expr

| expr ‘-’ expr

| expr ‘*’ expr

| expr ‘/’ expr

| NAME

;

This description can be used to build an input,

A = B = C * D + E – F * G / H

as A = ( B = ( ( C * D ) + E ) – ( ( F * G ) / H ) ).

However, Yacc has he keyword, %prec that changes the precedence level associated with a particular grammar rule. An example of use of %proc is to make unary operation whose minus operation has the same precedence as multiplication. This is expressed as following.

%left ‘+’ ‘-’

%left ‘*’ ‘/’

expr : expr ‘+’ expr

| expr ‘-’ expr

| expr ‘*’ expr

| expr ‘/’ expr

| ‘-’ expr %prec ‘*’

| NAME

;

Since Yacc generates C/C++ code, it has both conditional and iterative statements. The action statements in Yacc are only translated with $$, $1, etc, being rewritten appropriately. IF, THEN, ELSE, WHILE, and such statement keywords are input as tokens to Yacc, and it builds table-driven parsers that process sequence of tokens read from an input stream. A parser generated by Yacc gets its tokens by successive calls to a C function named yylex( ), which is the direct equivalent of the KeyPress event handler. The yylex( ) function returns a numeric token that is equivalent to the n-event parameter, but it also copies the full text of the actual word it recognized into a global variable named yytext. This variable is available to any code in the Yacc generated program. Furthermore, the parser can do two legal thins, a shift or a reduction when there is some input string that can be structured in two or more different ways, and it has no way to decide between these. This is called a shift/reduce conflict. If the parser has a choice of two legal reductions, this is called a reduce/reduce conflict, and a shift/shift conflict does not exist. When there are shift/reduce or reduce/reduce conflicts, Yacc still produces a parser by selecting one of the valid steps following the disambiguating rules. In fact, Yacc invokes two disambiguating rules by default. First, the default is to do the shift in a shift/reduce conflict. Second, the default is to reduce by the earlier grammar rule in the input sequence in a reduce/reduce conflict.

The third category of sequence control is subprograms, and Yacc does have subprograms. This is explained in the Yacc subprogram and memory management section.

5. YACC Subroutines and Memory Management

A Yacc subroutine is one that is defined by the user. The subroutine is a lexical analyzer named yylex that is used to accept tokens from the input stream. The lexical analyzer then analyzes the token and sends the appropriate action to the parser according to the grammar rules. The value that is sent to the parser is stored in an external variable named yylval. Another Yacc subroutine is yyerror. This C function is defined by the user and determines the sequence of events to occur upon the occurrence of an error. The function is the main body of the lexical analyzer and can call other user defined subroutines. If there is no yyerror routine defined then the default routine yyparse handles the errors. The user must be careful not to use C keywords in tokens or it could cause conflicts between the subroutines and the parser. Below is an example of a lexical analyzer to handle terminal symbols written by Stephen C. Johnson of AT&T Bell Laboratories.

#  define  BSZ  50        /*  buffer  size  for  floating  point  numbers  */

        /*  lexical  analysis  */

yylex(){

        register  c;

        while(  (c=getchar())  ==  ' '  ){  /*  skip  over  blanks  */  }

        if(  isupper(  c  )  ){

                yylval.ival  =  c  -  'A';

                return(  VREG  );

        if(  islower(  c  )  ){

                yylval.ival  =  c  -  'a';

                return(  DREG  );

        if(  isdigit(  c  )  ||  c=='.'  ){

                /*  gobble  up  digits,  points,  exponents  */

                char  buf[BSZ+1],  *cp  =  buf;

                int  dot  =  0,  exp  =  0;

                for(  ;  (cp-buf)=  BSZ  )  printf(  "constant  too  long:  truncated\n"  );

                else  ungetc(  c,  stdin  );    /*  push  back  last  char  read  */

                yylval.dval  =  atof(  buf  );

                return(  CONST  );

        return(  c  );

Yacc in its parser implementation uses a stack and look-ahead features to parse a machine readable grammar. Yacc does not perform automatic garbage collection. This unfortunately introduces the dreaded memory leak unless there are explicitly defined subroutines to perform garbage collection. Yacc pops elements off the stack in two instances. The first case is when Yacc does a reduction and pops the tokens off the stack. The elements which were on the stack are not freed and returned to the heap. The second instance is when an error occurs. The user can explicitly free the memory by keeping track of the tokens as they are added to the stack and once a reduction occurs then the tokens can be freed and returned to the heap.

6. YACC Data Encapsulation and Abstraction

Yacc accepts word items and, given a list of rules describing how these

items form larger entities, deduces which items or combinations of items satisfy a rule. This can also be thought of as grammatical analysis. Once a rule is satisfied, a programmer's code is applied to the result.

The Yacc grammar describes the specification grammar of the parser and other data structures and it can be used either to parse a sentence or to generate one or more parser, grammar could also be declared as the pathname of a file containing instructions, for which a parser is to be created. Parsing assigns a terminal syntactic category to each input token and a non-terminal category to each appropriate group of tokens, up to the level of the whole sentence. It reads the context free grammar in file.

Grammar:

The language accepted by Yacc, as a grammar is described using the Yacc input language, and the input will ensure that the file contain this specification in order: declaration, grammar and program. The grammar section consists of a sequence of definitions, which is called grammar rule.

Grammar rule:

The rules in this section define the context free grammar to be accepted by the function Yacc generates. The grammar is described below.

For example if we have a grammar such as this;

A: BODY;

A represents a non-terminal name and BODY represents a sequence of Zero or more (names, literals, semantic actions and precedence rule (optional)). Only the name and the literals participate in the formation of a grammar. The colon (;) and semicolon (:) are Yacc punctuation.

NAME: name is either a token name or a non-terminal.
LITERAL: literal stands for itself during a grammar formation.
SEMANTIC ACTION: this allows user to associate actions to be performed with each grammar rule. Action can occur in the middle of a rule as well as the end. These actions could be to return a value or to obtain values returned by a previous action. These values are kept in an object type of YYSTYPE. The result of the action will be kept on the parser stack on the left of the rule, to be accessed by other reductions on the right side of the rule. An action placed in the middle of a rule is the same as a non-terminal symbol placed on the left side of the rule. However an action is one or more C statements enclosed in curly braces, these are macros known only to Yacc.
PRECEDENCE: it is used to change the precedence level associated with a particular rule. It is represented as %prec, within the rule. It can appear after a body and can be followed by a literal or a token name.

If a program section follows, the program section can include the definition of the lexical analyzer or and the code generator and any other functions

Apart from the grammar, we also have a set of variable known as the environment variables and these variables affect the execution of Yacc. Listed below are the variables and there description.

· Lang: provides a default value for the global variables that are unset or null

· LC_ALL: overrides the value of other global variables if set to a non-empty string.

· LC_CTYPE: Determine the location for the interpretation of sequences of bytes of text data as characters.

· LC_MESSAGES: Determine the location that should be used to affect the format and contents of diagnostic messages written to standard error.

· NLSPATH: Determine the location of message catalogs for the processing of LC_MESSAGES.

The program generated by Yacc will also be affected by the content of these variables at runtime.

Comparison of Yacc and Lisp Sequence Control

Both Yacc and Lisp support conditional expressions and recursion, but iterative processes are specified with recursion so that loops would be unnecessary in Lisp. Also, List (pure Lisp) has only two kinds of data structures such as atoms and lists. Atoms are either symbols that have the form of identifiers or numeric literals. When determining precedence, Yacc uses the keywords, %left, %right, and %nonassoc, but Lisp associates with parentheses to delimit elements. For example, a simple list has the form, ( A B C D ) where A, B, C, and D are atoms, and a nested list is ( A ( B C ) D ( E ( F G ) ) ). The first is the atom A. The second is the sublist (B C). The third is the atom D. The fourth is the sublist (E (F G)), which has as its second element, the sublist (F G).

Also, Yacc supports most of conditional and iterative statements in C/C++ since it generates C/C++ code, but Lisp does not. However, Lisp can implement these in other methods because it supports symbolic computing and artificial intelligence programming that reasons about symbolic expressions that are made up of words and groups of words. For example, if there are two expressions:

(Bach is a man)

(All men are mortal)

Lisp implements these statements performing pattern matching and logical inference and can deduce a new fact,

(Bach is mortal)

A Lisp program can reach new facts and add them to the database by applying generic rules of inference to a database of expression.

Comparison of Yacc and Lisp memory management and subroutines

Lisp and Yacc are similar in that they both implement subroutines. The use of subroutines makes the code easier to program and debug due to its modularity. Yacc subroutines looks like <function name>(<type>parameter1,<type>parameter2,…..)

{ body} and Lisp user defined subroutines are in the form of (defun <name> <parameter-list> <body>) and are derived from Lambda Calculus. Yacc subroutines are written in C language and are compiled. The functions are predefined and have a fixed amount of arguments at runtime. Lisp however can have a fixed amount of arguments or an undetermined amount of arguments at runtime. This is implemented through the use of the &rest, &optional, &key parameters. Another feature of Lisp that is not available in Yacc is the ability to use functions as parameters to other functions. This is made possible by the lexical closure feature. The lexical closure keeps track of the functions lexical content when it is passed to another function as an argument. Both Lisp and Yacc make use of a set of Macros and primitive types. Finally Lisp functions are both compiled and interpreted where Yacc functions are only compiled.

Yacc and Lisp memory management techniques are similar in that they both allow dynamic memory allocation. In Yacc the dynamic allocation is done through the C routines using the new operator to create a new instance of that variable during execution.

Lisp allows dynamic memory allocation through the list structure. The list structure is composed of con cells. Each con cell has two parts. The left part of the con cell is the CAR part and the right half is the CDR part. The CAR half hold the address portion of the register and the CDR half holds the decrement portion of the register. Lists can be implemented in Yacc using C but there is the possibility of memory leaks or running out of memory if care is not taken in freeing up dangling pointers and returning them to the heap. Automatic garbage collection is one of Lisp’s features that make it useful for AI applications. The user does not have to worry about freeing up memory. Lisp does automatic garbage collection in 2 phases. The first phase is a marking phase. The marking phase is where the program will go in and mark all lists that are not reachable from the base register. Once all the unused cells are marked a sweep phase is performed. The sweep phase collects all the con cells that were marked during the marking phase and deallocates the memory. The garbage collecting feature of Lisp has at least one disadvantage in that it sometimes slows down the overall runtime of an application.

Comparison of Yacc and Lisp object types

As stated above Yacc code is a list of codes for building new objects out of old ones. Any code coming in from the Lexer is an expression and a variable that identifies an array element in which to store the value of the expression, and the expressions can be constructed from previously constructed expressions.

While Lisp, Atoms are represented as sequences of characters of reasonable length. Lists are recursively constructed from atoms. This means that a given list may contain either atoms or other lists as members.

Examples:

ATOMS                LISTS

A ()

John (a)

34 (a john 34 c3po)

C3po ((john 34) a ((c3po)))

(Atoms and lists are the two most important objects of Lisp)

Comparison of Yacc and Lisp data Abstraction

With Yacc, it is difficult to add actions to a grammar without introducing

conflicts. Lisp features many improvements over Yacc in this category.

Many applications of LISP involve storing a variety of values associated with a symbolic name. Association lists and property lists provide two ways of doing this, unique to LISP. Arrays, vectors and strings are also used for data storage and manipulation.