CS 340 - Assignment 2

Lexical analysis

Lexical analysis is the process of reading in the stream of characters making up the source code of a program and dividing the input into tokens.

In this assignment, you will use regular expressions to implement a lexical analyzer ("lexer") for a small programming language called YCPL.

Valid tokens in YCPL

YCPL programs are composed of the following kinds of tokens:

Token type Format
Keyword One of the strings func, if, then, else, true, false
Identifier Sequence of 1 or more of the following characters:
letter (A-Z or a-z)
+ - * / ? = < >
Integer literal Sequence of 1 or more digits, optionally preceeded by a minus sign ("-")
Left parenthesis (
Right parenthesis )
Comma ,
Assignment ::=
Left brace {
Right brace }
Semicolon ;

For example, consider the following YCPL source code:

fact ::= func(n)
{
if =(n, 1)
then 1
else *(n, fact(-(n, 1)));
};

This source code fragment consists of 34 tokens:

Order Token Token type
1 fact identifier
2 ::= assignment
3 func keyword
4 ( left parenthesis
5 n identifier
6 ) right parenthesis
7 { left brace
8 if keyword
9 = identifier
10 ( left parenthesis
11 n identifier
12 , comma
13 1 integer literal
14 ) right parenthesis
15 then keyword
16 1 integer literal
17 else keyword
18 * identifier
19 ( left parenthesis
20 n identifier
21 , comma
22 fact identifier
23 ( left parenthesis
24 - identifier
25 ( left parenthesis
26 n identifier
27 , comma
28 1
29 ) right parenthesis
30 ) right parenthesis
31 ) right parenthesis
32 ; semicolon
33 } right brace
34 ; semicolon

Getting Started

The assignment is designed to be completed within the Eclipse Java IDE.

Download CS340_Assign2.zip.  Within Eclipse, choose File->Import...->Existing Projects into Workspace.  Click Select archive file, Browse..., choose CS340_Assign2.zip, and click Finish.  You should see a new project called ycpl in the Package Explorer.

Your Task

Your task is to define regular expressions that match each kind of token, and add them to the class LexerImpl.java, which implements the lexical analyzer for YCPL.  Each regular expression is associated with a particular kind of token.

Find the definition of the array field called TOKEN_PATTERN_LIST.  The array is initialized with a single entry:

new TokenPattern("(func|if|then|else|true|false)", TokenType.KEYWORD),

This means that when the regular expression

(func|if|then|else|true|false)

matches a sequence of characters in a YCPL source file, that character sequence is a keyword token.

You should add other TokenPatterns to the array for the other kinds of token supported in YCPL.  The TokenType enumeration defines the kinds of tokens.

Regular expressions in Java

Regular expressions in Java are more or less the same as the regular expression we have discussed in class.  There are a few differences you need to be aware of, however.

The definitive reference on Java regular expressions is the API documentation for the java.util.Pattern class.

No epsilon.  There is no explicit epsilon symbol (indicating a zero-length string) in Java regular expressions.  Instead, you can use an empty alternative with the disjunction ("|") operator.  For example, the Java regular expression

(|abc)

matches the empty string or the string abc.  Another possibility is to use the ? suffix operator, which matches zero or one occurrences of a regular expression.  For example, the following Java regular expression is equivalent to the one above:

(abc)?

Metacharacters must be quoted.  One thing you will note is that some of the characters that are legal in YCPL tokens are treated as operators (not literal symbols) in a regular expression.  For example, we want plus ("+") to be a legal character in an identifier, but we also want it to be available as the "one or more" repetition operator.

If you want a regular expression to match a character that would otherwise be a metacharacter (operator or other special character), it needs to be preceeded by a backslash ("\").  The backslash serves to "quote" the metacharacter so it is matched literally.

The following metacharacters must be quoted by preceeding them with a backslash:

+ * ? ( ) { }

Important note: in a Java string literal, you need to use two backslashes to generate a single "real" backslash.  For example, the following Java string constant is a regular expression pattern matching a single literal plus ("+") character:

"\\+"

Character ranges.  You can match a range of characters by using a character class.  For example, the following Java regular expression matches any upper-case letter:

[A-Z]
Character ranges are useful for matching letters and digits.

For a complete example of a Java regular expression (encoded in a Java string constant), see the definition of the constant called ID_CHAR_PATTERN in LexerImpl.java.  This string constant specifies a regular expression that matches any character that may be used in a YCPL identifier.

Testing

In the junit folder of the ycpl project, expand the edu.ycp.cs340.ycpl package.  Right click on LexerTest.java, and choose Run As->JUnit Test.  This runs a series of unit tests which try using your parser implementation to parse character strings containing YCPL tokens.

You are done when all tests pass.

Submitting

When you are done:

Export your project to a zip file.  In Eclipse, right-click on the project ycpl in the Package Explorer, and choose Export...->Archive File.  Enter the name/path of the zip file you want to save your project in.  Click Finish.

Upload your saved zip file to the Marmoset server as Project 2.