Phase 1: The lexical scanner
The scanner is a procedure which is called each time a new token is needed. The following pieces of information
are provided by the scanner for each token:
- its type
- its value (in some cases)
- the token string itself (useful for debugging)
- location (row and column) of the token in the source text.
This information is packed in a TOK_INFO structure, defined in the compile.h
file. The file also contains an 'enum' definition for the various token types as well as a symbol table structure.
You should write a function called 'get_token' which will find the next token, classify it, determine its value
if necessary, and save its start location. The 'get_token' function takes a single parameter: the file stream pointer
to the source file. The following files are provided for you:
| compile.h |
a header file with struct definitions and token types |
- symbol_table.c
|
a very simple symbol table function |
- test_scanner.c
|
a wrapper program which will drive your scanner by repeatedly calling "get_token" and printing the token
information |
- scan_test.c--
|
a C-- nonsense program which will test the scanner's ability to deal with various situations |
You should also implement an error report procedure, which will be used by all stages of the compiler and which
will take an error type, a severity indicator, the calling module, and the location (row and column) as arguments,
and produce an error message. The details of this procedure are up to you. Probably the only error conditions which
can occur in the scanner will be illegal characters, invalid final states, and dangling comments (see below).
Implementation hints
In the original implementation instructions for this phase, the authors recommend
that you implement the scanner as a table driven finite state machine. Due to level of complexity (or lack thereof)
of C--'s syntax you may want to use several smaller tables to aid in scanning. The choice is yours.
Three token types in the compile.h file are only there
for internal use in the scanner and should not be returned as tokens to the parser: TOK_COMMENT,
TOK_SPACE, and TOK_ERR. You may or may not find these three token types useful as you code your scanner, but in
any event they should never be seen by the world outside of the 'get_token' function. One strategy however is to
return these tokens to the parser, and have the parser ignore them.
It is necessary to store the characters into a string as they are scanned; this makes it possible to compute
a value from the string:
- a TOK_INTNUM is an integer number type; the string should be translated into its integer value and that value
stored into the appropriate field of the TOK_INFO structure; it is easy to use C's library function 'sscanf' or
'atoi' to do the conversion;
- a TOK_FLOATNUM is a floating point number type; the string should be translated to its float value and stored;
in this project it is easy to use C's library function 'sscanf' or 'atof' to do the conversion;
- a TOK_IDENT is an identifier; its value is obtained by calling the function 'handle_ident' with the string
as its argument; the identifier's value is obtained from the SYM_ENTRY struct returned by the function call; (but
note the discussion below on the handling of C-- keywords;). Take a look at how keywords are added in test_scanner.c.
There are a few minor complications to consider in implementing the scanner:
- You will create the state transition table by first drawing a state transition diagram for C--. Note that it
could include the complete state diagram for comments (i.e. treat a comment just as if you were scanning any other
token). But in the case of long comments the text string could overflow. Instead, consider recognizing the '/*'
as a token and then, instead of returning it, entering a loop which chews up the comment without saving it;
when the end of the comment is reached, reset the scanner and begin scanning the next token. If you take this approach,
be sure that your scanner doesn't crash or hang if the program has a 'dangling comment' (i.e. comment never closes).
- It is recommend that you treat keywords as identifiers in the scanner, but have the symbol table handle them
as a special case. In this approach, the keywords are preloaded into the symbol table before any scanning is done.
The 'handle_ident' function is set up to return one of the eight keyword tokens in the special case, or else the
TOK_IDENT type if the identifier is not a keyword. Note that the 'main' program in test_scanner.c
does this preloading.
- Since the scanner reports the location (row and column) of each token, it must keep track of where it is in
the input stream. The easiest way to handle this is to let the character fetching and unfetching functions handle
this. Two functions have been provided 'get_next_char (fp)' which returns the next character and transparently
adjusts the globals 'line_cnt' and 'column_cnt', and 'unget_next_char (c, fp)' which puts back a character with
transparent adjustments. The 'unget_next_char' function is useful since the scanner often must look ahead one character
before it can determine that the end of a token has been reached.
What You Will Turn In
All project phases will be submitted via the online homework submission system. For this phase you will submit
a file called phase1.zip. This WinZip compatable archive file
should contain the following items:
- In a directory called phase1, the source code for you implementation. Include all .cpp as well
as .h files and project files. Do not include the Debug directory. Make sure that it is possible
for your instructor to compile your program from the source and project files. Make sure that your .exe will function
from the command line.
- A one page descrption (called test.txt, and not placed in the phase1 directory)
of the testing performed to insure that your program is operating properly. List any limitations that your program
has. Admitting program limitations, rather than having your instructor find them will result in fewer points subtracted.
* Based on a project from here.