YCP Logo Assignment 2: Lexical Analysis

Due: Friday, Sept 25th by 11:59 PM

Assignment 2: Lexical Analysis

Your task is to implement a lexical analyzer for a subset of the Scheme programming language.

Please read this assignment description carefully.

Getting Started

Download CS340_Assign2.zip and import it into your Eclipse workspace (File->Import->General->Existing projects into workspace->Archive file.)

You should see a project called YCP_Scheme in the Package Explorer.

Your Task

Your task is to complete the LexicalAnalyzerImpl class.

You can use the JUnit test class called LexicalAnalyzerImplTest to test your implementation. Once all of the tests pass, you can have a fairly high degree of confidence that your implementation works correctly.

You can also test your implementation by running the main method in the Main class. When you run the program, you can type text into the Eclipse Console window, and the program will use your lexical analyzer implementation to divide the text into tokens.

The LexicalAnalyzer interface

The LexicalAnalyzer interface, which LexicalAnalyzerImpl implements, is defined as follows:

/**
 * An object implementing the LexicalAnalyzer interface
 * returns a sequence of tokens read from the source text.
 * It supports the capability of looking ahead one token.
 */
public interface LexicalAnalyzer {
        /**
         * Look ahead for the next token to be returned by the next() method.
         *
         * @return the next Token to be returned by the next() method,
         *         or null if the end of input has been reached
         * @throws IOException
         * @throws LexicalAnalysisException if an invalid sequence of
         *         characters is encountered
         */
        public Token peek() throws IOException, LexicalAnalysisException;

        /**
         * Read one token from the input text and return it.
         * Throws a LexicalAnalysisException if the end of input has been
         * reached.
         *
         * @return a Token
         * @throws IOException
         * @throws LexicalAnalysisException if an invalid sequence of
         *         characters is encountered, or if the lexical analyzer
         *         is at the end of input
         */
        public Token next() throws IOException, LexicalAnalysisException;
}

The idea is that your (eventual) Scheme interpreter will use an object implementing this interface to read tokens of input.

The interface defines two methods:

  1. peek looks ahead to the next token in the input file, without consuming it. So, if the peek method is called repeatedly, it will return the same Token over and over, since peek does not advance the "cursor" that keeps track of where in the input file the lexical analyzer is positioned. It returns a Token object representing that token. If the lexical analyzer is at the end of input, null is returned.
  2. next reads the next token and returns a Token object representing it. This method does consume the token from the input. So, if the next method is called repeatedly, a sequence of Token objects representing the tokens in the input file will be returned.

An invariant can be defined describing the relationship between peek and next: if lexer is a reference to an object implementing the LexicalAnalyzer interface, then

lexer.peek() == null || lexer.peek().equals(lexer.next())

In other words, unless lexer is at the end of the input file, then the Token returned by peek should be the same one that will be returned by the next call to the next method.

Exceptions

Note that both the peek and next methods are defined as throwing IOException and LexicalAnalysisException. An IOException could be thrown if the Reader object used to read the characters of input from the input file encounters an I/O error. You will not need to throw IOException explicitly. A LexicalAnalysisException indicates one of two possibilities:

  1. a sequence of characters not corresponding to any legal token was encountered
  2. the next method was called, but the lexical analyzer is at the end of the input file

Token class, TokenType enumeration

An instance of the Token class represents a single token of input. A Token instance has three important pieces of information:

  1. the token type
  2. the lexeme
  3. the line number

The token type is a member of the TokenType enumeration. You will recognize the members of this enumeration as corresponding to the kinds of terminal symbols in Assignment 1.

The lexeme is the token's sequence of characters as they appear in the input file. The lexeme is significant because some kinds of tokens---for example, identifiers---are represented by many possible lexemes. For example, the strings "a", "+", and "eq?" are all identifiers, and should have TokenType.IDENTIFIER as their token type.

The line number is an integer indicating the line of input on which the token appears. It is useful for producing a meaningful error message when a syntax error is found by the parser. (Parsing will be the subject of the next assignment.)

LexicalAnalyzerImpl constructor

The LexicalAnalyzerImpl class should have a single constructor which takes a Reader object as its parameter. The lexical analyzer will use the Reader object to read characters of input from the input file.

Lexical structure

The subset of Scheme which your lexical analyzer should handle has the following lexical structure:

  • an LPAREN token has the lexeme "("
  • an RPAREN token has the lexeme ")"
  • an INTEGER_LITERAL token has a lexeme which is any sequence of one or more digits ('0' .. '9')
  • a BOOLEAN_LITERAL token has the lexemes "#t" and "#f"
  • a STRING_LITERAL token has any lexeme formed by a single double-quote (") character, followed by any sequence of zero or more non-double-quote characters, followed by a single double-quote (") character
  • a QUOTE_KEYWORD token has the lexeme "quote"
  • a LAMBDA_KEYWORD token has the lexeme "lambda"
  • a IF_KEYWORD token has the lexeme "if"
  • a DEFINE_KEYWORD token has the lexeme "define"
  • an AND_KEYWORD token has the lexeme "and"
  • an OR_KEYWORD token has the lexeme "or"
  • a NOT_KEYWORD token has the lexeme "not"
  • an IDENTIFIER token has any lexeme formed by one identifier-character, followed by any sequence of zero or more identifier-character-or-digit characters, where the entire lexeme would not match any other token type. (For example, the lexeme "quote" is a QUOTE_KEYWORD token, not an IDENTIFIER token.)

An identifier-character is a letter or any of the following characters:

! $ % & * + - . / : < = > ? @ ^ _ ~

An identifier-character-or-digit is a character that is either an itentifier-character or a digit ('0' .. '9').

Note that space characters (space, tab, newline, etc.) are not significant, except when they occur within a string literal. So, the input text

foo bar

is two tokens (both IDENTIFIERs, while

"foo bar"

is a single token, a STRING_LITERAL.

Example

Consider the following input text:

(define fact
  (lambda (n)
    (if (= n 1)
        1
        (* n (fact (- n 1))))))

When reading this input, your lexical analyzer should emit the following sequence of tokens:

Token type Lexeme
LPAREN (
DEFINE_KEYWORD define
IDENTIFIER fact
LPAREN (
LAMBDA_KEYWORD lambda
LPAREN (
IDENTIFIER n
RPAREN )
LPAREN (
IF_KEYWORD if
LPAREN (
IDENTIFIER =
IDENTIFIER n
INTEGER_LITERAL 1
RPAREN )
INTEGER_LITERAL 1
LPAREN (
IDENTIFIER *
IDENTIFIER n
LPAREN (
IDENTIFIER fact
LPAREN (
IDENTIFIER -
IDENTIFIER n
INTEGER_LITERAL 1
RPAREN )
RPAREN )
RPAREN )
RPAREN )
RPAREN )
RPAREN )

Submitting

Export your completed Eclipse project to a zip file by right-clicking on the name of the project (YCP_Scheme) and choosing Export->General->Archive File.

Upload the zip file to the submission server as assign2. The URL of the server is

https://camel.ycp.edu:8443/

IMPORTANT: after uploading, you should download a copy of your submission and double-check it to make sure that it contains the correct files. You are responsible for making sure your submission is correct. You may receive a grade of 0 for an incorrectly submitted assignment.