For this homework please label your process flow nodes with the task numbers below.

1)      The American College of Surgeons keep a tally of cancer cases http://www.facs.org/cancer/ncdb/publicaccess.html.  Save the archived 2000 and 2006 data as html pages to your local computer (right click and save the files).

2)      Import the HTML pages into SAS into a library called HW2 which corresponds to c:\projects\hrp223\hw2.

·         Look at the imported data and tweak the length of the site variable if needed.

3)      Fix the project so that the files in library will be accessible every time the project starts.

4)      Make new datasets, named year2000 and year2006 in the c:\projects\hrp223\hw2 folder, that remove the TOTAL cancer row and column from both datasets.

Make a new process flow called Analysis that does the tasks listed below.  Keep in mind that after you point and click for the year 2000 you can, but don’t need to, copy and paste and slightly tweak the code and run it on year 2006.

5)      In each year calculate the average percentage of disease in each stage by disease.   That is, average down the columns, the percentages provided for each type of disease. 

6)      Make data sets, named broken2000 and broken2006, with variables for the disease name and total percentage (using the percentages provided in the source file) where the percentages across the stages do not add up to 100%. 

7)      Add a node with that report for each year.

8)      Make new datasets, named typical2000 and typical2006, which contain the year2000 and year2006 variables plus a new column called typical. Typical should hold the most common stage at presentation (ignoring ties) among stages 0 through 4.  Do this for every cancer.

·         For each cancer, find the stage with the maximum number of people across the five stage count variables.  Save that number into a column called theMax.

·         Check to see if theMax is equal to the stage 4 count.  If so, typical should be equal to 4 for that cancer.  Perform similar checks for the other stages.  If there are ties (i.e., themax = stage 1 count = stage 2 count) you want to call the highest matching stage the typical stage at diagnosis.

9)      To document ties, make a “binary list” variable (in the typical datasets) to indicate all instances where the theMax equals the count at a stage. For example, if the maximum of the stage count for hair cancer is 5 and there were 5 people with stage 0 hair cancer and 5 people with stage 3 the “binary list word” would be 0--3-.

10)   Add a node (for each year) with a report showing the cancer name, the maximum count and the binary list.  The report should have labeled variables a title and a footnote to explain the report.

11)   Make a new dataset called subjective.  The dataset should contain the counts of stage 0, 1, 2, 3 and 4 cancers but instead of displaying the counts each cell should display the following descriptors (which are in roughly alphabetical order).

If count >= 25000 cases then display appalling

If count >= 10000 display ghastly

If count >= 5000 display horrible

If count >= 1000 display terrible

If count >= 1 unacceptable

If count < 1 display broken

If count = missing display missing