YCP Logo Assignment 1: Lexical Analysis

Due: Tuesday, Sept 14th by 11:59 PM

Update 9/2: Correction to TokenRecognizerImpl section

Assignment 1: Lexical Analysis

In this assignment you will implement a lexical analyzer for a subset of the Scheme programming language.

Please read this assignment description carefully.

Getting Started

Download CS340_Assign1.zip and import it into your Eclipse workspace (File->Import->General->Existing projects into workspace->Archive file.)

You should see a project called YCP_Scheme in the Package Explorer.

Your Task

Your task is to complete the TokenRecognizerImpl class. The idea is to add regular expression patterns to match all valid Scheme token types. The token types that you must recognizer are described below in the Lexical structure section.

You can use the JUnit test class called LexicalAnalyzerImplTest to test your implementation. Once all of the tests pass, you can have a reasonable degree of confidence that your implementation works correctly. However, I strongly encourage you to add some tests of your own.

You can also test your implementation by running the main method in the Main class. When you run the program, you can type text into the Eclipse Console window, and the program will use your lexical analyzer implementation to divide the text into tokens.

The TokenRecognizer interface

The TokenRecognizer interface, which is implemented by the TokenRecognizerImpl class, is defined as follows:

/**
 * An instance of TokenRecognizer can be used to recognize
 * a valid token at the beginning of a line of text.
 */
public interface TokenRecognizer {
        /**
         * Recognize a token at the beginning of the given line of text.
         *
         * @param text    a line of text
         * @param lineNum the line number of the line of text within its source file
         * @return a Token whose lexeme consists of one or more characters
         *         at the beginning of the given line of text
         * @throws LexicalAnalysisException if the characters at the
         *         beginning of the line of text do not form a valid token
         */
        public Token recognizeToken(String text, int lineNum) throws LexicalAnalysisException;
}

Objects implementing this interface can be used to recognize legal Scheme tokens at the beginning of a line of text. The lexical analyzer will use an instance of this interface to do the work of dividing the characters in an input source file into tokens.

The recognizeToken method is called with a line of text (as a string) and a line number as parameters, and returns an instance of the Token class. The returned Token object specifies the lexeme, token type, and line number of the token.

If the line of text passed to recognizeToken does not correspond to any legal token type, then the method should throw a LexicalAnalysisException.

TokenRecognizerImpl class

The TokenRecognizerImpl class is a concrete implementation of the TokenRecognizer interface.

It works by matching given input strings against a series of TokenPattern objects. Each TokenPattern object specifies a regular expression (as a Java Pattern object) and a TokenType value. If the input text matches the regular expression, then it is a valid token, and the TokenType value specifies what kind of token it is.

Update 9/2: There is a problem with the example regular expression patterns as distributed in the assignment skeleton. You can correct the problem either by re-downloading the assignment skeleton (a corrected version has been posted), or by adding \\b to the end of each regular expression pattern. E.g.: change

new TokenPattern(Pattern.compile("^lambda"), TokenType.LAMBDA_KEYWORD),

to

new TokenPattern(Pattern.compile("^lambda\\b"), TokenType.LAMBDA_KEYWORD),

This fix ensures that strings like "lambdafoo" are not matched as "lambda" followed by "foo".

Token class, TokenType enumeration

An instance of the Token class represents a single token of input. A Token instance has three important pieces of information:

  1. the token type
  2. the lexeme
  3. the line number

The token type is a member of the TokenType enumeration. You will recognize the members of this enumeration as corresponding to the kinds of terminal symbols in Assignment 1.

The lexeme is the token's sequence of characters as they appear in the input file. The lexeme is significant because some kinds of tokens---for example, identifiers---are represented by many possible lexemes. For example, the strings "a", "+", and "eq?" are all identifiers, and should have TokenType.IDENTIFIER as their token type.

The line number is an integer indicating the line of input on which the token appears. It is useful for producing a meaningful error message when a syntax error is found by the parser. (Parsing will be the subject of the next assignment.)

Lexical structure

The subset of Scheme which your lexical analyzer should handle has the following lexical structure:

  • an LPAREN token has the lexeme "("
  • an RPAREN token has the lexeme ")"
  • an INTEGER_LITERAL token has a lexeme which is any sequence of one or more digits ('0' .. '9')
  • a BOOLEAN_LITERAL token has the lexemes "#t" and "#f"
  • a STRING_LITERAL token has any lexeme formed by a single double-quote (") character, followed by any sequence of zero or more non-double-quote characters, followed by a single double-quote (") character
  • a QUOTE_KEYWORD token has the lexeme "quote"
  • a LAMBDA_KEYWORD token has the lexeme "lambda"
  • a IF_KEYWORD token has the lexeme "if"
  • a DEFINE_KEYWORD token has the lexeme "define"
  • an AND_KEYWORD token has the lexeme "and"
  • an OR_KEYWORD token has the lexeme "or"
  • a NOT_KEYWORD token has the lexeme "not"
  • an IDENTIFIER token has any lexeme formed by one identifier-character, followed by any sequence of zero or more identifier-character-or-digit characters, where the entire lexeme would not match any other token type. (For example, the lexeme "quote" is a QUOTE_KEYWORD token, not an IDENTIFIER token.)

An identifier-character is a letter or any of the following characters:

! $ % & * + - . / : < = > ? @ ^ _ ~

An identifier-character-or-digit is a character that is either an itentifier-character or a digit ('0' .. '9').

Note that space characters (space, tab, newline, etc.) are not significant, except when they occur within a string literal. So, the input text

foo bar

is two tokens (both IDENTIFIERs, while

"foo bar"

is a single token, a STRING_LITERAL.

Example

Consider the following input text:

(define fact
  (lambda (n)
    (if (= n 1)
        1
        (* n (fact (- n 1))))))

When reading this input, your lexical analyzer should emit the following sequence of tokens:

Token type Lexeme
LPAREN (
DEFINE_KEYWORD define
IDENTIFIER fact
LPAREN (
LAMBDA_KEYWORD lambda
LPAREN (
IDENTIFIER n
RPAREN )
LPAREN (
IF_KEYWORD if
LPAREN (
IDENTIFIER =
IDENTIFIER n
INTEGER_LITERAL 1
RPAREN )
INTEGER_LITERAL 1
LPAREN (
IDENTIFIER *
IDENTIFIER n
LPAREN (
IDENTIFIER fact
LPAREN (
IDENTIFIER -
IDENTIFIER n
INTEGER_LITERAL 1
RPAREN )
RPAREN )
RPAREN )
RPAREN )
RPAREN )
RPAREN )

Submitting

Export your completed Eclipse project to a zip file by right-clicking on the name of the project (YCP_Scheme) and choosing Export->General->Archive File.

Upload the zip file to the submission server as assign1. The URL of the server is

https://camel.ycp.edu:8443/

IMPORTANT: after uploading, you should download a copy of your submission and double-check it to make sure that it contains the correct files. You are responsible for making sure your submission is correct. You may receive a grade of 0 for an incorrectly submitted assignment.