Project 3: A Simple Bash-Like Shell

In this project you and your teammate will implement a simple shell called clash ("class shell"). The behavior of clash is modeled after that of bash, but clash only supports a subset of the features in bash and the behaviors described below may not exactly duplicate bash. If the behavior for a particular situation is not defined below, you may pick any reasonable behavior.

Overview

Clash executes commands, where each command consists of a single line of text containing one or more words (you need not handle multi-line commands). Clash reads a line of input and parses it according to the syntax rules described below to produce the words for a command. Then it executes the command. Typically, the first word of a command selects an executable file; clash spawns a new process to execute that file and passes all of the words to the process as its argv arguments. Some commands are executed directly by clash (built-in commands), so they do not result in the creation of a new process. Processes can be grouped into pipelines, where data flows from one process to the next in the pipeline.

Syntax

Command separators: commands are separated by semicolons (";") and/or newline characters. The pipeline character ("|") also serves as a command separator.

Word separators: space and tab characters are treated as word separators. The characters ";", "<", ">", "|", and newline also act as word separators.

Backslash substitution: a backslash quotes the character that follows it, so that the following character becomes part of the current word and receives no special treatment; the backslash is dropped. For example, backslash followed by a space character causes the space character to appear in the current word; it does not act as a word separator. "\$" causes "$" to be inserted in the current word; it will not trigger variable substitution. It is a syntax error for a backslash to appear as the last character in a line.

Single quotes: when a single-quote character appears, all of the characters between that single-quote and the next single-quote will be incorporated literally into the current word; none of these characters will be given any special treatment (i.e. backslashes inside single-quotes do not invoke backslash substitution; they will appear in the word). The single quotes do not appear in the word.

Double quotes: when a double-quote character appears, all of the characters between that double-quote and the next unquoted double-quote will be incorporated into the current word. Command substitution, variable substitution, and backslash substitution are performed on the information between double quotes, but no other characters are treated specially. For example, spaces are not treated as word separators and ";" is not treated as a command separator. Backslash substitution between double quotes is slightly different than described above: backslash substitution is only performed if the character following the backslash is "$", "`" (backquote), double-quote, or backslash; in all other cases, the backslash and the following character are simply copied to the current word.

Variable substitution: when a dollar sign appears ("$"), it triggers variable substitution. The characters following the dollar sign are treated as the name of a variable, and the value of that variable is inserted in the current word in place of the name. The variable name can take three forms:

If the character after the $ is a letter, then the variable name consists of all characters up to the first one that is neither a letter nor a digit.
If the character after the $ is a digit, then the variable name consists of all characters up to the first one that is not a digit.
If the character after the $ is one of the characters "*", "#", or "?", then the variable name is that one character.
If the character after the $ is a "{" (open brace) then the variable name consists of all the characters between that open brace and the next "}" (close brace). No substitutions are performed on the characters between the braces.

If there is no variable by the given name, then clash should behave as if the variable existed with an empty value. White space in the value of a variable (spaces, tabs, or newlines) results in additional word breaks (unless the variable substitution occurs between double quotes).

Command substitution: when a backquote character appears ("`"), it triggers command substitution. The characters between the backquote character and the next unquoted backquote character are treated as a shell script (which can contain one or more pipelines) and executed by clash. The standard output generated by this execution is captured, the trailing newline is removed, if there is one, and the remainder is substituted into the current word in place of the command. While scanning for the terminating backquote character, no substitutions are performed except that "\`" is replaced with "`". Once the closing backquote has been found and the command has been executed, the result is copied into the current word; spaces, tabs, and newlines result in additional word breaks. This process is described in more detail below.

Pipelines and Redirection

If two commands are separated by "|" characters, then they form a pipeline: the standard output of the first command is written to a pipe, which serves as the standard input for the second command. A pipeline may contain any number of commands.

If an unquoted ">" character appears in a command, then the word following the ">" is treated as the name of a file and the standard output of the command will be redirected to that file. If the command is part of a pipeline, then the redirection applies to that command; this means that the standard input for the following command in the pipeline will be empty (i.e. immediate end-of-file condition). If an unquoted "<" character appears in a command, then the following word is treated as the name of a file and the standard input of the command will be read from that file. If the command is part of a pipeline, then any output from the previous pipeline stage will be discarded.

Note: file names following ">" or ">" are processed for substitutions just like normal words of a command.

Parsing Order

The syntax rules described above are fairly simple when considered in isolation, but they combine in ways that create a lot of complexity. The exact order in which the rules are applied is crucial to the behavior of clash. Your implementation of clash must behave as if the substitutions are performed in the following order (your implementation needn't necessarily follow this order, but it must produce the same results as if it did):

Command breaking: divide the input into pipelines and commands. Also, identify I/O redirection and separate out the words containing the file names for redirection. These words are expanded and substituted using the same steps described below for the rest of the command.
A single left-to-right scan to process backslashes, single quotes, double quotes, variable substitutions, and command substitutions.
Word breaking. Since this is performed after variable and command substitution, white space introduced by these substitutions will result in new words. It is also possible for words to disappear in this step. For example, if variable x has an empty value, then the command
```
echo $x $x
```
will have only a single word ("echo"). However, a word cannot be removed if it originally contained single or double quotes. For example, the command
```
echo "" ""
```
will have three words, of which the last two will be empty. When breaking words in this step, white space that has been quoted (it appeared in single quotes or double quotes, or it was backslashed) does not trigger word breaks. For example, the command
```
echo 'a b c'
```
has only two words.

Command Execution

Once a command has been parsed into words and all of the substitutions and expansions have been performed, the command is executed. In the normal case this is done by creating one subprocess for each command, using fork and exec. If the first word of a command contains a "/" character, then the contents of that word are used as the name of the executable file; otherwise, the PATH variable is used to select the executable as described below. Between the calls to fork and exec for each subprocess, you must modify file descriptors for the child in order to implement I/O redirection and pipelines. The environment variables passed to each child must reflect clash's variable bindings, as discussed below.

There are two situations in which clash executes a command directly, rather than spawning a subprocess: variable assignments and built-in commands. If a command contains a single word, and if the word starts with a valid variable name (an initial letter followed by any number of letters or digits) followed by "=", then the text following the "=" is used as the new value for the given variable. In this case, clash performs the variable assignment instead of creating a subprocess. All the usual substitutions are performed before executing a variable assignment. For example,

output="`grep auto raftServer.cc`"

will set the value of the variable output to the standard output produced by the grep command.

In addition to variable assignments, clash supports a collection of builtin commands, which it executes directly. Here are the builtin commands you must support:

cd dirName
Changes the current working directory of the shell, using dirName as the path to the new working directory. If there is no dirName argument, you may treat this as a syntax error.
exit status
Causes clash to exit, returning status as its exit status. If no status argument is provided, then the exit status will be 0.
export varName varName varName ...
Each argument is the name of a variable. If a variable exists by that name, it is marked for export, which means that it will be passed to subprocesses as an environment variable.
unset varName varName varName ...
Each argument is the name of a variable. If a variable exists by that name, it is deleted.

Reading Commands

If clash is invoked with no command-line arguments, it reads commands from its standard input, which can be either a terminal or a file. If standard input is a terminal (which can be determined with the isatty function), then clash outputs a "% " prompt before reading each line of input. Each line of input can be parsed and executed separately: it contains either a single command, a pipeline, or a collection of commands or pipelines separated by semicolons. You do not need to handle multi-line commands such as

echo x y "abc
def"

In this case, clash should generate a syntax error for the first line since the double-quote is not closed.

If the standard input for clash is not a terminal, clash still reads commands from its standard input and processes them as described above. However, in this case clash does not output prompts.

Command-Line Arguments

If clash is invoked with command-line arguments, then it doesn't read commands from its standard input. If the first command-line argument is "-c", then the second argument is a shell script (one or more commands separated by semi-colons). In this case, clash executes that script and then exits.

If the first command-line argument is not "-c", then it must be the name of a file. In this case, clash reads commands from that file instead of standard input; clash exits when it reaches the end of the file.

Variables and Environment Variables

Clash maintains a table of variable bindings, which are used for variable substitutions. Each variable has a string name and a string value. The following variable names have special meanings defined by clash:

PATH: a list of directories in which to search for executables; see below for more information.
$0: if clash is invoked with no arguments, this is value of its argv[0] (the program name under which clash was invoked). If clash is invoked with the "-c" option, this is the argument just after the argument containing the script, if there is one (if there are no arguments after the script, then $0 holds argv[0]. If clash is invoked with the name of a file to read commands from, $0 holds that file name.
$1, $2, etc.: the command-line arguments following $0.
$#: the number of command-line arguments after $0.
$*: the values of all the command-line arguments after $0, concatenated together and separated by spaces.
$?: the exit status returned by the last command; 0 indicates a normal return. In the case of a pipeline, $? reflects the exit status of the last command in the pipeline.

When clash starts, it creates one variable for each environment variable received from its parent. When clash creates subprocesses, it passes some of its variables to the subprocess as environment variables. These variables are referred to as exported. When clash creates initial variable bindings from its environment, it marks each of these variables as exported. The export built-in command can be used to mark additional variables as exported.

The PATH Variable and its Cache

The PATH variable is used by clash to locate executables for commands. It consists of any number of directory names separated by colons. To execute a command, clash searches each of the directories in the PATH variable to see if it contains a file whose name is the same as the first word of the command. If so, the first matching file is used.

However, searching through all the directories in the PATH variable is too expensive to perform for each command execution. Instead, you must keep a cache of mappings from command names to executable file names. You must ensure that the cache is updated whenever the PATH variable changes; you do not need to update the cache when files are created or deleted in directories in the path.

If no PATH variable exists, then use "/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin" as the default.

Note: the PATH mechanism is only used if the first word of a command contains no "/" characters. If the first word contains a "/" character, then the word is used directly as the name of the executable, bypassing the PATH mechanism.

Exit Status for Clash

If clash encounters an end-of-file condition on its standard input at a point where it has no partially-assembled command, then it exits. It returns the value of the $? variable (the exit status of the most recently executed command) as its exit status, unless the exit is caused by the exit built-in command. In that case, the exit status is determined by the argument to exit.

Error Handling

There are many error conditions that can occur in clash, such as trying to "cd" to a non-existent directory, or the presence of a ">" without a file to redirect to. You must detect these errors and handle them appropriately. In most cases, the right behavior is to abort the current command, print an appropriate message, and continue processing more commands.

Additional Notes and Requirements

You must implement the project in C++.
I will create a private GitHub repository for your team to use and send you information about this repository. Create a branch in this repo named project3 and use this branch for all of your work on the project.
The parsing rules for clash are complex and intertwined; your challenge for this project is to find a clean way of decomposing the implementation. This project is particularly prone to special cases and information leakage; try to minimize these.
Because of the complexity of the shell syntax, it's unlikely you will be able to get the design right on the first try: you probably won't see all of the issues until you began to write code. I suggest that you plan on dividing your design time: spend some time up front, but plan on a major redesign phase after you start coding.
You must write your own parser; you may not use a parser generator or any existing libraries to help with parsing. In general, you should be building from scratch, but you may use any of the C++ std:: classes as well as C library functions such as those listed below. If you have any questions about what existing packages it is OK to use, please check with me.
When implementing pipelines, you must use real pipes so that all the commands in the pipeline run simultaneously and communicate directly with each other using pipes. No stdin/stdout/stderr data should pass through the shell, except during command substitution.
During variable substitution, the contents of a variable should not be processed for backslashes, back-quotes, etc. The only parsing that occurs on variable contents is additional word breaks from white space.
Here are some functions that you will probably find useful in your implementation:
- isatty: determine whether a file descriptor is associated with a console terminal.
- fork, execve, waitpid: create subprocesses and wait for them to complete.
- opendir, readdir, closedir: read the list of file names in a given directory.
- access: determine whether a file is executable.
- open, dup, close, pipe: useful for implementing I/O redirection.
For at least one non-trivial source file, write the top-level declarations and interface comments before you fill in any of the method bodies. Create a commit on the project3 branch whose only change is the skeletal version of this file, and tag that commit commentsBeforeCode3. Make sure that the commit message also includes the name of this file. I recommend that you write comments before code for all your files, but I will only require it for one file.
As in the past, please use 4-space indents and keep lines to no more than 80 characters in length.
Use the compiler options "-Wall -Werror", which cause the compiler to check a variety of things more strictly and refuse to compile your program until you fix the issues. For standard Makefiles, you can do this by adding the following line to your Makefile:
```
CFLAGS += -Wall -Werror
```
The repo for this project will contain a file words.py. This script will print the elements of its argv, so that you can see exactly what is in each word. You may find this useful during testing.

Parsing Examples

The table below provides examples that illustrate some of the parsing rules. Be sure to consider these cases as you are designing your clash implementation. The first column shows one or more commands and the second column shows the output that should be generated as a result. Many of the examples use the program words.py discussed above, which just prints out the values of all its argument words except $0.

Command	Output	Feature(s) exercised
x=abc; words.py $x "$x" '$x' "\$x"	$1: abc	Variable substitution and quotes
	$2: abc
	$3: $x
	$4: $x

x=foo; echo file1 > zfoo.txt
cat < z$x.txt	file1	Variable in file name for redirection

y='a\nb'; words.py \"$y\"	$1: "a\nb"	Backslash in variable value

x=' a b '; words.py .$x.	$1: .	Whitespace in variables
	$2: a
	$3: b
	$4: .

x=' a '; words.py $x$x	$1: a	Whitespace in variables
	$2: a

x=' a b '; words.py ."$x".	$1: . a b .	Whitespace in variables

x=''; words.py $x $x		Empty variable

words.py "a `echo x y` \$x"	$1: a x y $x	Double quotes

x=""; words.py "" $x""	$1:	Empty double quotes
	$2:

x=abc; words.py '$x `echo z`'	$1: $x `echo z`	Single quotes

words.py `echo a; echo b c`d	$1: a	Command substitution
	$2: b
	$3: cd

x=abc; words.py 1"$x"2'$x'3`echo foo`	$1: 1abc2$x3foo	Mixed substitutions

echo>foo abc; cat foo	abc	Redirection (no spaces)

words.py "<"'>'\< `echo \<`	$1: <><	Quoted redirection
	$2: <

x=\;; words.py "a$x b; c\|d"	$1: a; b; c\|d	Quoted command separators

Submitting Your Project

To submit your project, push all of your changes to GitHub (on the project3 branch) and then create a pull request on GitHub. The base for the pull request should be your master branch (which has nothing on it except your initial commit) and the comparison branch should be the head of your project3 branch. Use "Project 3" as the title for your pull request. If your project is not completely functional at the time you submit, describe what is and isn't working in the comments for the pull request.

If you are planning to use late days for this project (or any project) please send me an email before the project deadline so that I know how many late days you plan to use.

CS 190: Software Design Studio (Winter 2021)