Project 3: Regular Expression Matcher

In this project you and your teammate must implement a program that performs regular expression pattern matching. The program will be invoked from the command line as follows:

progName pattern file file ...

The first argument is a regular expression pattern, and each additional argument is the name of a file. The program should match the regular expression against each line of each file and print out all of the matching lines. The output should look similar to the output generated by the grep program: if more than one file argument was specified, then each line should be preceded by the name of the file and a colon:

% myGrep StateMachine *.java
RaftServer.java:    public static interface StateMachine {
RaftServer.java:    StateMachine stateMachine;
RaftServer.java:    public RaftServer(String[] servers, int id, StateMachine stateMachine) {
ServerMain.java:    protected static class StateMachine implements RaftServer.StateMachine {
%

If only one file is specified, then matching lines should be printed with no file name prefix.

Regular Expression Syntax

Your matcher must implement a subset of typical regular expression syntax. The required features are summarized below; you can consult various online resources, such as Wikipedia for additional information about regular expressions.

A regular expression consists of one or more atoms. The simplest atom is a single character, which matches that character, but there are several other atoms, including some that match multiple characters or zero characters. A pattern matches a sequence of characters if the first atom in the pattern matches the first character(s) in the sequence, and each subsequent atom matches the next character(s) in the sequence. A pattern matches a line if it matches any substring within the line (the entire pattern must be matched, but not the entire line).

Your regular expression matcher must support the following features:

AtomDescription
^Matches the starting position in the line (the match does not include any of the line's characters).
$Matches the ending position in the line (the match does not include any of the line's characters).
.Matches any single character.
[ ]Matches a single character if that character is listed between the brackets; "-" can be used to specify a range. For example, [xy] matches x or y, and [0-9] matches any decimal digit.
[^ ]Matches a single character as long as that character is not listed between the brackets.
+Matches one or more consecutive occurrences of the preceding atom. For example, [0-9]+ matches any positive decimal integer.
*Matches zero or more consecutive occurrences of the preceding atom.
?Matches zero or one occurrence of the preceding atom. For example, -?[0-9]+ matches a positive or negative decimal integer.
( )Delimits a subexpression; matches anything that matches the pattern between the parentheses.
|Matches either the pattern before the | or the pattern after the |. For example, abc|def matches either abc or def.

Two-Phase Implementation

Your regular expression manager should work in two phases. In the first phase, it compiles the regular expression into an internal representation such as a deterministic finite automaton (DFA) or nondeterministic finite automaton (NFA). In the second phase, it uses this internal representation to quickly check each line for a match. For an introduction to techniques for implementing regular expressions, I recommend Russ Cox's document Implementing Regular Expressions.

Note: if you are not careful in your implementation, you can easily end up with infinite loops. For example, consider the regular expression (y?)*x. If this expression is applied to a string that doesn't contain an x, the matcher could attempt to match the expression (y?) over and over, each time choosing the "match" where no character is consumed from the input. Thus, you will need a mechanism for pruning redundant searches.

Additional Notes and Requirements

Submitting Your Project

To submit your project, push all of your changes to GitHub (on the project3 branch) and then create a pull request on GitHub. The base for the pull request should be your master branch (which has nothing on it except your initial commit) and the comparison branch should be the head of your project3 branch. Use "Project 3" as the title for your pull request. If your project is not completely functional at the time you submit, describe what is and isn't working in the comments for the pull request.

If you are planning to use late days for this project, please send me an email before the project deadline so that I know how many late days you plan to use.