HRP 223 - Data Management and Statistical Programming - 2012/2013 Edition

 

The materials on this course website are only for the use of students enrolled in this course for purposes associated with this course and may not be retained or further disseminated. The materials on this course website may be protected by copyright; any further use of this material may be in violation of federal copyright law.

 

The goal of the course is to provide hands on instruction in data management and analysis techniques.
Topics discussed include:

  1. Working with large databases - what makes a good database turn bad
  2. Data cleaning techniques
  3. Generating numerical and graphical presentations
  4. Descriptive statistics

Contact information

Professor

Teaching Assistant

Raymond R. Balise 
Redwood Bldg. T213D, MC 5092 
Stanford, California  94305-5405 

Balise at Stanford 
Voice (650) 724-2602 
Fax (650) 725-6951

Mike Hurley

 

 

hrp223 decimal 2012 at gmail 

Prerequisites

Admission to Health Research and Policy and a comfortable knowledge of a Windows XP/Vista/7.

Lectures                                                                                                             

Monday and Wednesday 11:30 ֠1:00 in LK203/204 starting September 24th.

Office Hours

By appointment in Redwood Building T213D. Directions can be found here: www.stanford.edu/~balise/FindBalise.htm

Newsgroup

The class will use a Google group.

Readings

The Little SAS Book for Enterprise Guide 4.2.

Common Statistical Methods for Clinical Research with SAS Examples

 

Read about SQL vs. data step code here:

http://www2.sas.com/proceedings/sugi31/250-31.pdf

 

Read about numeric precision here:

http://support.sas.com/techsup/technote/ts654.pdf.

 

Grading

Grades will be based on four homework problem sets.   If you take the course for 2 units you must pass at least three of the four homework assignments and you must not violate the virus policy below.  If you take the course for 3 units you must pass all four assignments.  There will be many quick assignments that will not directly affect grades.

Turning in Homework and Viruses

All assignments and homework will be submitted via email to balise at Stanford and hrp223.2012 at gmail. Any student that sends me or the TA a virus (or any other malicious code) will fail the course.  There will be no exceptions made.  Therefore, you are strongly advised to download the latest version of the Sophos Anti-Virus software. If you need virus protection check here http://www.stanford.edu/services/ess/ and you can download the software for free. If you have any questions ask!

Late policy

Each of the assignments will be due at the beginning of class on the day specified.

That said, there are unforeseen emergencies (illness, bike accidents, disk crashes, network troubles, childbirth, etc.). Instead of having to ask for special allowances on an individual basis, I give each of you the privilege of granting yourself a small extension in case of crisis. You will have two late days which you may use to extend the due dates of any assignments without penalty. To avoid any ambiguity, there are seven days in a week and each day ends at 5:00 PM. Thus, if your assignment was due on Wednesday but turned in the following Monday before 5:00, that assignment would be five days late. After the grace period is up each assignment is down weighted 20% per day.  In all cases, assignments will not be accepted more than one calendar week after the original assignment due date.

Computer Platforms

The programs that you turn in must run on Windows SAS 9.3 and/or Enterprise Guide 4.305.  I can provide good support for Windows or a Mac running parallels (http://www.parallels.com/).

Setting up and configuring SAS and SAS Enterprise Guide:

Pictures showing my SAS 9.3 TS1M1 and EG 5.1 install can be found here.

Notes on configuring SAS Enterprise Guide 5.1 can be found here.

Notes on using an IRT Server for SAS can be found here (updated 2012-10-09). Be sure to set up you EG using the notes here.

 

HRP 223 the movie!

You can see the class in color by clicking here.

Core Lecture Material

Note: All dates are approximate and subject to change except HW 4.

Preliminaries of Computing at Stanford and an Introduction to SAS Enterprise Guide (Sept 24st)

Reading: The tutorials A, B, and C in TLSB for EG.

 

Software somebody should have told you about a long time ago

Essential Stanford Software (free stuff)

Tools of the trade

Excel

REDCap

SAS or SAS/Enterprise Guide

R and R Commander

Other software I use (not officially endorsed by anybody and ask your IT security person for help)

CrashPlan

Notepad++

UltraEdit

UltraCompare

FileLocator Pro

Google sites

Using SAS Enterprise Guide as a calculator

 

The PowerPoint slides are here or a pdf version is here.

Assignment 0 is here. It should be done before class on September 26th.

Data and Data Collection (Sept 26th ֠Oct 1st)

Using your UNIX space

How

What is a database?

Critical registry teaks for Excel

How to organize data in Excel

            Variable names

            Dummy records

Using Excel

            Making tables

            Validation

            Formulas

            Quick counts

Collecting data in general

            How to score and store answers

            The value/danger of redundancy

            Using REDCap

                        What is it?

                        How to set it up.

 

The PowerPoint slides are here or a .pdf version is here.

The sample .htaccess file is here. (To save the file you may need to right click on the hyperlink and choose something like save Target as or save Link as …)

An Enterprise Guide project to easily load data from REDCap is here.

 Assignment 1 is here . It should be done before class on Oct 1st.

Assignment 2 is here. It should be done before class on Oct 3rd.

Working with Data in Windows (Oct 3rd - Oct 8th)

  Reading: Chapters 1, 2 and 4 in TLSB for EG. Skim chapter 3.

                       

How SAS works in Windows

Types of Files / Suffixes

Libraries vs Folders

            Loading data from Excel

            Simple subsets

            Creating new variables

            Converting character variables to numeric with the input function

            Basic numeric summaries

Making better summary tables

 

The PowerPoint slides are here or a pdf version is here.

An Excel 2007 spreadsheet showing mixed type issues is here and the EG project is here.

An Excel 2007 spreadsheet with fake drowning data is here and the EG project is here.

 

Homework 1 is here and is due Oct 15th.

Windows and EG configuration (Oct 10th)

            Setting up Windows

                        File extensions

                        Folders

                        Office

                        DOS   

            Configuring SAS Enterprise Guide

                        Application options

                        Keyboard macros

            Using SAS

                        Libraries

                        Subsets

                        Toy data sets

                        Loading real data sets

                        Simple descriptive statistics and graphics

           

The PowerPoint slides are here.

Beginning Data Manipulation (Oct 10th-Oct 17th)

Reading: TLSB for EG 4.2 chapter 4.

 

Procedures vs Functions

Code for data step

            Labels

            Formats

Code for SQL

            Labels

            Formats

GUI for SQL

Process flow sheets and project organization

Autoexec flow sheets

Creating Formats with the GUI

Data step if-else

            Data step select-when

            SQL case-when-then

            Showing combinations

 

The slides are here and the completed project is here.  (Open the project after the class.) 

Inventing and tweaking data (Oct 17th)

Reading: TLSB for EG 4.2 chapters 3, 4 and 9.

 

Creating data

Recoding values to make data

Project organization

Fixing bad values

            Using formats

 

The PowerPoint slides here.

An Excel spreadsheet is here and an EG project is here.

Assignment 3 is here.

Homework 2 is here and a solution is here.

Merging with SQL (Oct 29th)

Reading: Tutorial D TLSB for EG 4.2 chapters 4 and 6.

 

SQL joins

            Inner joins

            Left joins

The coalesce() and coalescec() functions

 

The slides are here.

An Enterprise Guide project that uses SQL to merge files is here.

Assignment 4 is here.

Lab 2 (Oct 31st)

Reading: Tutorial D TLSB for EG 4.2 and chapters 4 and 6.

           

Joins

            Summary functions with SQL

            Fixing bad values

                        By using functions like left

                        By recoding

Finding discrepancies when matches fail

 

The tasks for today are here.

More SQL and querying voodoo is covered in a project here.

This assignment uses information in an Excel workbook that can be found here.

The partially complete project is here.

Combine with SQL (Nov 7th)

Reading: Tutorial D TLSB for EG 4.2 and chapter 4 and 6.

 

            SQL to select subsets of records from multiple tables

                        Outer union and/or outer union corresponding ֠concatenate

Union ֠unique rows from both tables

Except ֠rows that are part of the first query

Intersect ֠rows common to both queries

 

The slides are not here.

The Enterprise Guide file for today is here.

Macro Language (Nov 7th)

EG prompts

macro variables

            %macro %mend

            Examples of macros

                        Dot/bar plot

                        Quantile plot

                        Calculating sensitivity, specificity, PPV, NPV

                        Binomial probabilities

                        Checking survival data

             

The slides are here.

An Enterprise Guide project showing macros is here.

An Enterprise Guide project showing macros used to make survival analysis datasets is here.

Homework 3 is here and is due before class on Nov 28th

Idiosyncratic SAS stuff like formats and Grouped Data (Nov 26th)

            proc format library = blah

options fmtsearch = (blah work);

proc cport memtype = all;

proc cimport

options nofmterr;

By processing

            Retain statements

            Finding the labs closest to an event

           

The slides are here.

An Enterprise Guide project showing working with grouped data is here.  

Assignment 5 is here and a solution will be here.

Database issues (Nov 26th)

            Queries and corrections when data is not normalized

            Normalizing data with arrays and loops in a data step

 

The slides are here.

An Enterprise Guide project is here

A dataset is here.

 

Output Delivery System (Nov 28th)

            Limiting out output from SAS

How SAS creates different types of output

Writing output to different file types

SAS output table names

 

The slides are here.

An Enterprise Guide project is here.

My SAS header file is here.

Graphics Data (Nov 28th, Dec 3rd)

Reading: TLSB for EG chapter 10

 

What makes a bad graphic bad

            The gurus of data visualization

            A good graphic

                        Sgplot vbox inset

            Visualizing categorical data

                        Bar charts

                        Dot plot

                        Waffle plots

            Visualizing continuous data

                        Histograms

                        Box plos

                        Violin plot

            How to get image formats for publication

            Working with the SAS ODS

            ODS Statistical Graphics

            SAS Graphics Editor

                        ods listing sge = on;

            Setting up R

                        Tinn-R

                        PERL 

            High quality SAS graphics

                        Sgplot hbox / category =

                        Sgplot scatter

                        Sgplot reg / group =

                        Sgplot histogram

                        Sgpanel panelby histogram

                        Sgpanel panelby scatter

            Lattice plots in SAS

            Ods graphics on

            Selecting a subset of ODS graphics

            ODS statistical graphics with many elements

            Truly custom ODS statistical graphics

                        proc template;

                                    define style sexE;

                                    parent = styles.Statistical;

                                                style graphdata1 / contrastColor=pink markersymbol = "star";

                                                style graphdata2 / contrastColor=blue markersymbol = "plus";

                                    end;

run;

ods listing style = sexE;

            Making big pictures:

goptions hsize=11in vsize=8.5in;

ods graphics / width=11in height=8.5in;

 

                       

The slides are here. (Do not print the entire set)

Enterprise Guide project is here.

SAS code showing ods graphics can be found here.

SAS code showing setting colors in graphics with style templates is here.

R code for common R graphics is here along with data sets here and here.

R code for categorical data is here.

Final Summary (Dec 5th)

The slides are not here.

Solution to HW3 will be here.

Homework 4 is here and is due Dec 10 before 11:31 am.

Other stuff

SAS 2012 keyboard macros can be found here (updated 2012-10-22).

A set of useful links can be found here.

Other Books

(You can get many of these from the Stanford library but preview them on Amazon or books.google.com)

A few of my old favorite books are listed here

My favorite scientific visualization books are here.