For this homework please label your process flow nodes with the task numbers below.

1)      The American College of Surgeons keep a tally of cancer cases http://www.facs.org/cancer/ncdb/publicaccess.html.  Save the archived 2000 and 2006 data as html pages to your local computer (right click and save the files).

2)      Import the HTML pages into SAS into a library called HW2 which corresponds to c:\projects\hrp223\hw2.

·         Look at the imported data and tweak the length of the site variable if needed.

3)      Fix the project so that the files in library will be accessible every time the project starts.

4)      Make new datasets, named year2000 and year2006 in the c:\projects\hrp223\hw2 folder, that remove the TOTAL cancer row and column from both datasets.

HOMEWORK 2:

Make a new process flow called Analysis that does the tasks listed below.  Keep in mind that after you point and click for the year 2000 you can, but don’t need to, copy and paste and slightly tweak the code and run it on year 2006.

5)      In each year calculate the average percentage of disease in each stage by disease.   That is, average down the columns, the percentages provided for each type of disease. 

6)      Make data sets, named broken2000 and broken2006, with variables for the disease name and total percentage (using the percentages provided in the source file) where the percentages across the stages do not add up to 100%. 

7)      Add a node with that report for each year.

8)      Make new datasets, named typical2000 and typical2006, which contain the year2000 and year2006 variables plus a new column called typical. Typical should hold the most common stage at presentation (ignoring ties) among stages 0 through 4.  Do this for every cancer.

·         For each cancer, find the stage with the maximum number of people across the five stage count variables.  Save that number into a column called theMax.

·         Check to see if theMax is equal to the stage 4 count.  If so, typical should be equal to 4 for that cancer.  Perform similar checks for the other stages.  If there are ties (i.e., themax = stage 1 count = stage 2 count) you want to call the highest matching stage the typical stage at diagnosis.

9)      To document ties, make a “binary list” variable (in the typical datasets) to indicate all instances where the theMax equals the count at a stage. For example, if the maximum of the stage count for hair cancer is 5 and there were 5 people with stage 0 hair cancer and 5 people with stage 3 the “binary list word” would be 0--3-.

10)   Add a node (for each year) with a report showing the cancer name, the maximum count and the binary list.  The report should have labeled variables a title and a footnote to explain the report.

11)   Make a new dataset called subjective.  The dataset should contain the counts of stage 0, 1, 2, 3 and 4 cancers but instead of displaying the counts each cell should display the following descriptors (which are in roughly alphabetical order).

If count >= 25000 cases then display appalling

If count >= 10000 display ghastly

If count >= 5000 display horrible

If count >= 1000 display terrible

If count >= 1 unacceptable

If count < 1 display broken

If count = missing display missing

 

HOMEWORK 3:

Add a new process flow called “Analysis across years” and do the following tasks:

 

12)   Write SQL to make a dataset called everything that has one column that lists all disease sites listed once.  That is, if hair cancer is listed in 2000 and hare cancer shows up in 2006 then the everything dataset should have a record for each spelling.  If navel is in both years it should show up only once in the “everything” dataset.

 

13)   Write SQL except statements to confirm that no sites are missing from 2000 or 2006. (Hint: The idea is to subtract sites listed in each year from the everything dataset. You should end up with datasets that looks like the following for each year:

 

 

14)   Make a dataset, called Stage4, which has the percentage stage 4 diagnoses (using the values provided by the website) from both years.  The table will begin with columns for site, percentage in 2000 (named stage4per_2000), and percentage in 2006 (named stage4per_2006).   Also calculate the change in stage 4 diagnoses between the two years as another variable (named diffPercent). 

 

 

15)   Make a report that describes the distribution of change scores.  Tell the distribution analysis to include a table for basic descriptive statistics and extreme values. 

 

16)   Select the diseases that have at least a +/- 10% change in the incidence and store them in a dataset called change in the work library.

 

17)   Make a new dataset called recodePrimSite2006 that has a column called originalPrimSite which has the primSite values from the 2006 file and a new column called primSite which has the 2006 lung cancers sites recoded to just lung.  This file should have 3 lung records.

 

18)   Write SQL code to make a dataset called master that has a master list of all the diagnoses for both years (using the original 2000 file and the tweaked 2006 file recodePrimSite2006).    You should end up with a list that includes both lung and the 3 subtypes of lung.

 

19)   Make a new dataset that has all the unique diagnoses in the master list then two yes/no indicator variables to show if the diagnoses categories are present in 2000 and the tweaked 2006 file. This file should have 4 lung records.

·         Step 1 - Make the master list (Already done, from step 18)

·         Step 2 - Make a dataset called yn2000 based on hw2.year2000 which has a new column called in2000 set to “Yes”

·         Step 3 - Make a dataset called yn2006 based recodePrimSite2006 which has a new column called in2006 set to “Yes”

·         Step 4 - Make a dataset called almost that left joins the datasets (you can use the GUI or copy/paste/tweak at the code from step 7)

·         Step5 - Replace the missing yes values with no

20)   Take my code in Node 9-10 in the HW2 solution (or your version if you did it in code) which makes the binary list and shows the “Maximum number of cases by site showing ties” report and turn it into a macro called yearReport which takes one argument which is the year of the report you want run and outputs the report. Make a new program node and successfully call the macro on year 2000 and 2006.