For
this homework please label your process flow nodes with the task numbers below.
1)
The
American College of Surgeons keep a tally of cancer cases http://www.facs.org/cancer/ncdb/publicaccess.html. Save the archived 2000
and 2006
data as html pages to your local computer (right click and save the files).
2)
Import
the HTML pages into SAS into a library called HW2 which corresponds to
c:\projects\hrp223\hw2.
·
Look
at the imported data and tweak the length of the site variable if needed.
3)
Fix
the project so that the files in library will be accessible every time the
project starts.
4)
Make
new datasets, named year2000 and year2006 in the c:\projects\hrp223\hw2 folder,
that remove the TOTAL cancer row and column from both datasets.
Make a new process flow called Analysis that does the tasks
listed below. Keep in mind that after
you point and click for the year 2000 you can, but don’t need to, copy and
paste and slightly tweak the code and run it on year 2006.
5)
In
each year calculate the average percentage of disease in each stage by
disease. That is, average down the
columns, the percentages provided for each type of disease.
6)
Make
data sets, named broken2000 and broken2006, with variables for the disease name
and total percentage (using the percentages provided in the source file) where
the percentages across the stages do not add up to 100%.
7)
Add
a node with that report for each year.
8)
Make
new datasets, named typical2000 and typical2006, which contain the year2000 and
year2006 variables plus a new column called typical.
Typical should hold the most common stage at presentation (ignoring ties) among
stages 0 through 4. Do this for every
cancer.
·
For
each cancer, find the stage with the maximum number of people across the five
stage count variables. Save that number
into a column called theMax.
·
Check
to see if theMax
is equal to the stage 4 count. If so, typical should be equal to 4 for that
cancer. Perform similar checks for the
other stages. If there are ties (i.e., themax = stage 1 count = stage 2 count) you want to call
the highest matching stage the typical stage at diagnosis.
9)
To
document ties, make a “binary list” variable (in the typical datasets) to
indicate all instances where the theMax equals the
count at a stage. For example, if the maximum of the stage count for hair
cancer is 5 and there were 5 people with stage 0 hair cancer and 5 people with
stage 3 the “binary list word” would be 0--3-.
10)
Add
a node (for each year) with a report showing the cancer name, the maximum count
and the binary list. The report should
have labeled variables a title and a footnote to explain the report.
11)
Make
a new dataset called subjective. The
dataset should contain the counts of stage 0, 1, 2, 3 and 4 cancers but instead
of displaying the counts each cell should display the following descriptors
(which are in roughly alphabetical order).
If count >= 25000 cases then display appalling
If count >= 10000 display ghastly
If count >= 5000 display horrible
If count >= 1000 display terrible
If count >= 1 unacceptable
If count < 1 display broken
If count = missing display missing