\documentclass[11pt]{article}
\usepackage{geometry}
\geometry{letterpaper}
%\geometry{landscape}
\usepackage[parfill]{parskip}
\usepackage{graphicx}
\usepackage{epstopdf}
\usepackage{algorithmic}
\usepackage{algorithm}
\usepackage{cite}

% math packages
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{amstext}
\usepackage{dsfont}

% some math commands
\newcommand{\T}{^T}
\newcommand{\reals}{\mathds{R}}
\newcommand{\imags}{\mathds{C}}

% some formatting commands
\newcommand{\fname}[1]{\texttt{#1}}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\server}[1]{\texttt{#1}}


\title{CME 212: Basic Statistics}
%\author{Nick Henderson}
%\date{}

\usepackage[pdftex,colorlinks]{hyperref}

\begin{document}
\maketitle


In this assignment you will implement several basic statistics functions
to compute the mean, variance, median and a histogram of a given real
number array. The due date for this assignment is February 1, 2012.

\vspace{-6pt}

\section{Assignment summary}

The major components of the assignment are

\begin{itemize}
\item \fname{stats.h} and \fname{stats.c}: header and implementation files for
  computing statistics.
\item \fname{runstats.c}: a driver file to compute statistics for input data.
\item \fname{makefile}: a GNU Make file to build the executable and compile the
  documentation.
\item \fname{git}: we expect you to use \href{http://git-scm.com/}{git} for
  version control.
\item \fname{readme.tex} and \fname{readme.pdf}: \LaTeX\ documentation for the
  assignment with answers to assignment questions.
\end{itemize}

\section{Assessment}

The assignment will be graded out of 100 points. A total of $50$ points will
be given for perfect functionality of code; $40$ points will be
awarded for
\begin{itemize}
\item design and implementation choices,
\item code readability and commenting,
\item well formatted and written readme with answers to questions,
\item and correctly following \code{make} and \code{git} guidelines.
\end{itemize}
The final $10$ points are assigned to the challenge part of the assignment.

\section{Basic statistics}

Your task is to implement code computing the mean, variance, and median of an
array of doubles.  In addition you are to construct a histogram from an array
of doubles and a given number of bins.  This section describes the function
prototypes and the histogram structure.  All function and type declarations
will reside in \fname{stats.h} with definitions in \fname{stats.c}.  Make sure
to use a proper header guard.

\subsection{Computing the mean}

\begin{verbatim}
// mean takes a pointer to an array of doubles with length n
// and returns the mean
double mean(const double *a, size_t n);
\end{verbatim}

\subsection{Computing the variance}

\begin{verbatim}
// var takes a pointer to an array of doubles with length n
// and returns the variance
// if samp_var == 0, then compute the population variance
// if samp_var != 0, then compute the sample variance
double var(const double *a, size_t n, int samp_var);
\end{verbatim}

\subsection{Computing the median}

For this part it is fine to use an algorithm from some reference
as long as that reference is cited in the readme document.

\begin{verbatim}
// median takes a pointer to an array of doubles with length n
// and stores the median value at med
// median returns 0 for any runtime exception or failure
// and 1 on successful completion
int median(const double *a, size_t n, double *med);
\end{verbatim}

\subsection{Constructing a histogram}

We will create a histogram structure to store the results of the computation.
Let's call this structure \code{hist\_t}, which is short for ``histogram type''.

\begin{verbatim}
// histogram type
typedef struct {
  size_t nbin; // number of bins
  size_t *count; // pointer to array of length nbin intended for counts
  double *bin;   // pointer to array of length nbin intended for bin centers
} hist_t;
\end{verbatim}

The histogram type contains two pointers intended for arrays that will store
the histogram data.  You will implement two simple memory management routines
for the histogram type.  The first, called \code{hist\_alloc}, will allocate
space for the histogram.  The second, called \code{hist\_free}, will free
memory associated with a histogram.

\begin{verbatim}
// hist_alloc allocates memory for the hist_t structure and corresponding
// arrays based on the input number of bins.  If successful, it returns a
// pointer to a hist_t structure.  If it is not, it returns a null pointer.
hist_t *hist_alloc(size_t nbins);

// hist_free frees memory that was allocated with hist_alloc
void hist_free(hist_t *h);
\end{verbatim}

Finally, the histogram will be computed with the \code{hist} function.

\begin{verbatim}
// hist constructs a histogram from an array of doubles of size n.  It stores
// the result in the preallocated hist_t structure pointed to by h.  The number
// of desired bins is specified by the nbin field in h (h->nbin).
// When finished, h->count contains the histogram counts and h->bin stores the
// bin centers.  The function returns 1 if successful or 0 if unsuccessful.
int hist(const double *a, size_t n, hist_t *h);
\end{verbatim}

\section{The driver file: \code{runstats.c}}

The main driver for the assignment will be called \code{runstats}.  It will be
invoked from the command line in the following manner:
\begin{verbatim}
$ ./runstats N file_name
\end{verbatim}

\subsection{Input}

\begin{itemize}
\item The first argument is an integer indicating the desired number of bins
  for the histogram.

\item The second argument is a file name of a binary file containing data of
  interest.  The first 8 bytes will store an unsigned integer indicating the
  length of the array.  Each subsequent 8 byte block will contain a 64-bit IEEE
  floating point number (a C \code{double}).  It will be a simple matter to
  read the data with \code{fread} function in \code{stdio.h}.
\end{itemize}

\subsection{Expected output}

The driver will write to standard out in the following manner:
\begin{verbatim}
$ ./runstats 5 data.dat
mean:    4.00e+00
var:     1.00e+01
median: -1.00e+00
histogram:
| bin       | count
| -1.00e+00 | 10
| -5.00e-01 | 4
|  0.00e+00 | 5
|  5.00e-01 | 2
|  1.00e+00 | 100
\end{verbatim}
The numbers shown in the above example are arbitrary.  Specifically, floating
point numbers must be aligned and printed in a 9 character field with 2
decimals of precision.  It is acceptable to use a 10 character field if the
number is large and negative (for example, \code{-4.00e+100}) even though this
will disrupt the formatting of the histogram table. In case of an unsuccessful
computation for any statistic print \code{n/a} in place of a numerical result.


\section{The \fname{makefile}}

The \fname{makefile} must satisfy the following requirements:
\begin{itemize}
\item Executing \code{make} in the directory must compile all code, produce the
  \fname{runstats} executable, and generate the \fname{readme.pdf}
  documentation.
\item Executing \code{make clean} must remove all object files, executables,
  pdfs, and \LaTeX\ temporary files.
\end{itemize}

\section{Version control}

You are expected to use \href{http://git-scm.com/}{git} for version control in
this and all subsequent programming assignments.  When you start working on the
assignment first initialize your local \code{git} repository with:
\begin{verbatim}
$ git init .
\end{verbatim}
After you create some files, add and commit them to the repository:
\begin{verbatim}
$ git add foo.c bar.c
$ git commit -m "added foo.c"
\end{verbatim}
After you make a change to a file, add and commit the changes:
\begin{verbatim}
$ emacs bar.c
$ git add bar.c
$ git commit -m "fixed nasty bug in bar.c"
\end{verbatim}
You can check the history of your work with:
\begin{verbatim}
$ git log
\end{verbatim}
We expect you to commit often as you work on the project.  It is not acceptable
to just have one commit at the completion of the assignment.

Some good references on the basic use of git are:
\begin{itemize}
\item \url{http://git-scm.com/documentation}
\item \url{http://gitref.org/}
\end{itemize}

\section{Documentation}

All of the documentation for the assignment will be presented in a
\fname{readme.tex}, which compiles to \fname{readme.pdf}.  The document shall
contain:
\begin{itemize}
\item Your name and SU-Net ID.
\item A description of each file in the directory.
\item Instructions how to build and run your code.
\item An indication of any external code you used.
\item A list of people you worked with or consulted besides course staff.
\item Answers to the questions in a following section.
\end{itemize}

\section{Questions}

\begin{description}
\item[Question 1] The function prototypes specified in this assignment use
  \code{size\_t} where it might seem natural to use \code{int}.  Why is this
  important?
\item[Question 2] Quickly describe the procedures you used to compute the
  median and construct the histogram.
\item[Question 3] The following piece of code is not safe.  Why?  Please
assume that the number of integers in array \code{a} is greater than or equal
to \code{n}.

\begin{verbatim}
void zero_array(int *a, int n) {
  for (size_t i = 0; i < n; ++i)
    a[i] = 0;
}
\end{verbatim}

\end{description}

\section{Challenge}

Challenge points will be awarded for completion of the following task:

\begin{itemize}
%\item The most common way of computing the median involves a sort, which
%  requires at least $O(n \log n)$ work.  Other algorithms are able to compute
%  the median in $O(n)$ time or to compute an estimate even faster.  Implement
%  such an algorithm and compare it's performance.  Cite and briefly describe
%  the algorithm you chose in the readme.
\item It is often desirable to produce figures from the result of a
  computation.  One way to do this is write a data file for plotting with
  another package like Matlab.  This challenge is to produce a nice figure
  directly from the C code.  There are several libraries one could use for this
  purpose. One example is \code{PLplot} (\url{http://plplot.sourceforge.net/});
  there are also many APIs available to the popular \code{gnuplot} package.
  The \server{corn.stanford.edu} system may not have the libraries you'd like
  to use and you are welcome to build and include a library archive and header
  file in your project.  
  %You are also welcome to submit code that worked on your local system, even though it will not compile and run on \server{corn}.
\item For this challenge you are to implement a function \code{plot} which
given a \code{hist\_t} type variable creates a plot of the histogram, i.e. a plot
of the bin value vs the count.  The look of the plot and any other additional 
functionality it up to you.  Your Makefile should generate an executable
called \code{plot} which runs your plot function on the input data outputting a plot image.  
You must also describe and document your method and include a plot figure 
in your readme.
\end{itemize}


\section{Submitting the assignment}

All of your files should be in your \code{CME212} work directory on
\code{corn.stanford.edu} under the subdirectory $\code{stats}$. Make sure
that
\begin{verbatim}
  $ ls -a ~/CME212/stats
\end{verbatim}
lists all of your files as well as the \code{.git} directory. Then,
\begin{enumerate}
\item Make sure your code compiles and produces an executable and a
readme document when the makefile utility is run as
\begin{verbatim}
$ make
\end{verbatim}
\item Submit your assignment via the command
\begin{verbatim}
$ /afs/ir/class/cme212/bin/submit stats
\end{verbatim}
\end{enumerate}


\end{document}

