YCP Logo Assignment 1: Lexical Analysis

Preliminary assignment description: may be updated

Due: Monday, September 29th by 11:59 PM

Lexical analysis

Lexical analysis is the process of reading in the stream of characters making up the source code of a program and dividing the input into tokens.

In this assignment, you will use regular expressions to implement a lexical analyzer ("lexer") for a small programming language called YCPL.

Valid tokens in YCPL

YCPL programs are composed of the following kinds of tokens:

Token type Format
Keyword One of the strings func, if, then, else, true, false
Identifier

Sequence of 1 or more of the following characters:

  • letter (A-Z or a-z)
  • - * / ? = < >
Integer literal Sequence of 1 or more digits, optionally preceeded by a minus sign ("-")
Left parenthesis (
Right parenthesis )
Comma ,
Assignment ::=
Left brace {
Right brace }
Semicolon ;

For example, consider the following YCPL source code:

fact ::= func(n)
  {
    if =(n, 1)
      then 1
      else *(n, fact(-(n, 1)));
  };

This source code fragment consists of 34 tokens:

Order Token Token type
1 fact identifier
2 ::= assignment
3 func func keyword
4 ( left parenthesis
5 n identifier
6 ) right parenthesis
7 { left brace
8 if if keyword
9 = identifier
10 ( left parenthesis
11 n identifier
12 , comma
13 1 integer literal
14 ) right parenthesis
15 then then keyword
16 1 integer literal
17 else else keyword
18 * identifier
19 ( left parenthesis
20 n identifier
21 , comma
22 fact identifier
23 ( left parenthesis
24 - identifier
25 ( left parenthesis
26 n identifier
27 , comma
28 1 integer literal
29 ) right parenthesis
30 ) right parenthesis
31 ) right parenthesis
32 ; semicolon
33 } right brace
34 ; semicolon

Note that whitespace has no significance in YCPL other than to separate tokens. (foobar is one token, while foo bar are two tokens.)

Your Task

Your task is to write a program that reads an input text file, and constructs a list of the YCPL tokens in that file.

Assuming that the input file read contains the YCPL program shown above (starting fact ::= func(n), the output of the program should be similar to the following:

Read which file? example.ycpl
IDENTIFIER "fact" [line 1]
ASSIGN "::=" [line 1]
FUNC "func" [line 1]
LPAREN "(" [line 1]
IDENTIFIER "n" [line 1]
RPAREN ")" [line 1]
LBRACE "{" [line 2]
IF "if" [line 3]
IDENTIFIER "=" [line 3]
LPAREN "(" [line 3]
IDENTIFIER "n" [line 3]
COMMA "," [line 3]
INT_LITERAL "1" [line 3]
RPAREN ")" [line 3]
THEN "then" [line 4]
INT_LITERAL "1" [line 4]
ELSE "else" [line 5]
IDENTIFIER "*" [line 5]
LPAREN "(" [line 5]
IDENTIFIER "n" [line 5]
COMMA "," [line 5]
IDENTIFIER "fact" [line 5]
LPAREN "(" [line 5]
IDENTIFIER "-" [line 5]
LPAREN "(" [line 5]
IDENTIFIER "n" [line 5]
COMMA "," [line 5]
INT_LITERAL "1" [line 5]
RPAREN ")" [line 5]
RPAREN ")" [line 5]
RPAREN ")" [line 5]
SEMICOLON ";" [line 5]
RBRACE "}" [line 6]
SEMICOLON ";" [line 6]
END_OF_INPUT "" [line 7]
Done reading tokens

Requirements

Your program may be written in C, C++, Java, or C#. (If you would like to use a different programming language, please discuss it with me.)

Your program should create an object (or struct instance) for each token read. Here is an example of a Token class (in Java) you could use to represent occurrences of tokens:

public class Token {
        private TokenType tokenType;
        private String lexeme;
        private int lineNumber;

        public Token(TokenType tokenType, String lexeme, int lineNumber) {
                this.tokenType = tokenType;
                this.lexeme = lexeme;
                this.lineNumber = lineNumber;
        }

        public TokenType getTokenType() {
                return tokenType;
        }

        public String getLexeme() {
                return lexeme;
        }

        public int getLineNumber() {
                return lineNumber;
        }

        public String toString() {
                return tokenType + " \"" + lexeme + "\" [line " + lineNumber + "]";
        }
}

This class assumes that there is an enumeration type called TokenType defining the different types of tokens - identifiers, integer literals, etc.

Hints

Define a set of regular expressions

Both Java and C# have built-in classes for matching a regular expression against a string. Using these classes will greatly simplify your task!

Built-in regular expression support classes:

In Java In .NET (C#) Purpose
java.util.regex.Pattern System.Text.RegularExpressions.Regex Represents a regular expression
java.util.regex.Match System.Text.RegularExpressions.Match Records the part(s) of a text string that matched a regular expression

Similar functionality is available in C and C++ through the use of libraries.

Submitting

Submit a zip file containing your complete project (all source files, along with whatever other files are needed to compile them) to the submission server as assign1. The URL of the server is

https://camel.ycp.edu:8443/