For
this homework please label your process flow nodes with the task numbers below.
1)
The
American College of Surgeons keep a tally of cancer cases http://www.facs.org/cancer/ncdb/publicaccess.html. Save the archived 2000
and 2006
data as html pages to your local computer (right click and save the files).
2)
Import
the HTML pages into SAS into a library called HW2 which corresponds to
c:\projects\hrp223\hw2.
·
Look
at the imported data and tweak the length of the site variable if needed.
3)
Fix
the project so that the files in library will be accessible every time the
project starts.
4)
Make
new datasets, named year2000 and year2006 in the c:\projects\hrp223\hw2 folder,
that remove the TOTAL cancer row and column from both datasets.
HOMEWORK 2:
Make a new process flow called Analysis that does the tasks
listed below. Keep in mind that after
you point and click for the year 2000 you can, but don’t need to, copy and
paste and slightly tweak the code and run it on year 2006.
5)
In
each year calculate the average percentage of disease in each stage by
disease. That is, average down the
columns, the percentages provided for each type of disease.
6)
Make
data sets, named broken2000 and broken2006, with variables for the disease name
and total percentage (using the percentages provided in the source file) where
the percentages across the stages do not add up to 100%.
7)
Add
a node with that report for each year.
8)
Make
new datasets, named typical2000 and typical2006, which contain the year2000 and
year2006 variables plus a new column called typical.
Typical should hold the most common
stage at presentation (ignoring ties) among stages 0 through 4. Do this for every cancer.
·
For
each cancer, find the stage with the maximum number of people across the five
stage count variables. Save that number
into a column called theMax.
·
Check
to see if theMax
is equal to the stage 4 count. If so, typical should be equal to 4 for that
cancer. Perform similar checks for the
other stages. If there are ties (i.e., themax = stage 1 count = stage 2 count) you want to call
the highest matching stage the typical stage at diagnosis.
9)
To
document ties, make a “binary list” variable (in the typical datasets) to
indicate all instances where the theMax equals the
count at a stage. For example, if the maximum of the stage count for hair
cancer is 5 and there were 5 people with stage 0 hair cancer and 5 people with
stage 3 the “binary list word” would be 0--3-.
10)
Add
a node (for each year) with a report showing the cancer name, the maximum count
and the binary list. The report should
have labeled variables a title and a footnote to explain the report.
11)
Make
a new dataset called subjective. The
dataset should contain the counts of stage 0, 1, 2, 3 and 4 cancers but instead
of displaying the counts each cell should display the following descriptors
(which are in roughly alphabetical order).
If count >= 25000 cases then display appalling
If count >= 10000 display ghastly
If count >= 5000 display horrible
If count >= 1000 display terrible
If count >= 1 unacceptable
If count < 1 display broken
If count = missing display missing
HOMEWORK 3:
Add a new process flow called “Analysis across years” and do the following tasks:
12)
Write
SQL to make a dataset called everything
that has one column that lists all disease sites listed once. That is, if hair cancer is listed in 2000 and
hare cancer shows up in 2006 then the everything dataset should have a record for each
spelling. If navel is in both years it
should show up only once in the “everything”
dataset.

13)
Write
SQL except statements to confirm that no sites are missing from 2000 or 2006.
(Hint: The idea is to subtract sites listed in each year from the everything
dataset. You should end up with datasets that looks like the following for each
year:

14)
Make
a dataset, called Stage4, which has the percentage stage 4 diagnoses (using the
values provided by the website) from both years. The table will begin
with columns for site, percentage in 2000 (named stage4per_2000), and
percentage in 2006 (named stage4per_2006).
Also calculate the change in stage 4 diagnoses between the two years as
another variable (named diffPercent).

15)
Make
a report that describes the distribution of change scores. Tell the
distribution analysis to include a table for basic descriptive statistics and
extreme values.

16)
Select
the diseases that have at least a +/- 10% change in the incidence and store
them in a dataset called change in
the work library.

17)
Make
a new dataset called recodePrimSite2006 that has a column called originalPrimSite which has the primSite
values from the 2006 file and a new column called primSite
which has the 2006 lung cancers sites recoded to just lung. This file should have 3 lung records.

18)
Write
SQL code to make a dataset called master that has a master list of all the
diagnoses for both years (using the original 2000 file and the tweaked 2006
file recodePrimSite2006). You
should end up with a list that includes both lung and the 3 subtypes of lung.

19)
Make
a new dataset that has all the unique diagnoses in the master list then two
yes/no indicator variables to show if the diagnoses categories are present in
2000 and the tweaked 2006 file. This file should have 4 lung records.
·
Step
1 - Make the master list (Already done, from step 18)
·
Step
2 - Make a dataset called yn2000 based on hw2.year2000 which has a new
column called in2000 set to “Yes”
·
Step
3 - Make a dataset called yn2006 based recodePrimSite2006 which has a
new column called in2006 set to “Yes”
·
Step
4 - Make a dataset called almost that
left joins the datasets (you can use the GUI or copy/paste/tweak at the code
from step 7)
·
Step5
- Replace the missing yes values with no

20)
Take
my code in Node 9-10 in the HW2 solution (or your version if you did it in
code) which makes the binary list and shows the “Maximum number of cases
by site showing ties” report and
turn it into a macro called yearReport which takes
one argument which is the year of the report you want run and outputs the
report. Make a new program node and successfully call the macro on year 2000
and 2006.