Phase 1: The lexical scanner


The scanner is a procedure which is called each time a new token is needed. The following pieces of information are provided by the scanner for each token:

  1. its type
  2. its value (in some cases)
  3. the token string itself (useful for debugging)
  4. location (row and column) of the token in the source text.

This information is packed in a TOK_INFO structure, defined in the compile.h file. The file also contains an 'enum' definition for the various token types as well as a symbol table structure.

You should write a function called 'get_token' which will find the next token, classify it, determine its value if necessary, and save its start location. The 'get_token' function takes a single parameter: the file stream pointer to the source file. The following files are provided for you:
compile.h a header file with struct definitions and token types
symbol_table.c
a very simple symbol table function
test_scanner.c
a wrapper program which will drive your scanner by repeatedly calling "get_token" and printing the token information
scan_test.c--
a C-- nonsense program which will test the scanner's ability to deal with various situations
You should also implement an error report procedure, which will be used by all stages of the compiler and which will take an error type, a severity indicator, the calling module, and the location (row and column) as arguments, and produce an error message. The details of this procedure are up to you. Probably the only error conditions which can occur in the scanner will be illegal characters, invalid final states, and dangling comments (see below).

Implementation hints

In the original implementation instructions for this phase, the authors recommend that you implement the scanner as a table driven finite state machine. Due to level of complexity (or lack thereof) of C--'s syntax you may want to use several smaller tables to aid in scanning. The choice is yours.

Three token types in the compile.h file are only there for internal use in the scanner and should not be returned as tokens to the parser: TOK_COMMENT, TOK_SPACE, and TOK_ERR. You may or may not find these three token types useful as you code your scanner, but in any event they should never be seen by the world outside of the 'get_token' function. One strategy however is to return these tokens to the parser, and have the parser ignore them.

It is necessary to store the characters into a string as they are scanned; this makes it possible to compute a value from the string:

There are a few minor complications to consider in implementing the scanner:

What You Will Turn In

All project phases will be submitted via the online homework submission system. For this phase you will submit a file called phase1.zip. This WinZip compatable archive file should contain the following items:

  1. In a directory called phase1, the source code for you implementation. Include all .cpp as well as .h files and project files. Do not include the Debug directory. Make sure that it is possible for your instructor to compile your program from the source and project files. Make sure that your .exe will function from the command line.
  2. A one page descrption (called test.txt, and not placed in the phase1 directory) of the testing performed to insure that your program is operating properly. List any limitations that your program has. Admitting program limitations, rather than having your instructor find them will result in fewer points subtracted.


* Based on a project from here.