HRP 223 - Data Management and Statistical Programming - 2009/2010 Edition

The goal of the course is to provide hands on instruction in data management and analysis techniques.
Topics discussed include:

  1. Working with large databases - what makes a good database turn bad
  2. Data cleaning techniques
  3. Generating numerical and graphical presentations
  4. Descriptive statistics

Contact information

Professor

Teaching Assistant(s)

Raymond R. Balise 
Redwood Bldg. T213D, MC 5092 
Stanford, California  94305-5405 

Balise at Stanford 
Voice (650) 724-2602 
Fax (650) 725-6951

Kameelah Abdullah

 

 

 

kameelah at Stanford

 

 

Prerequisites

Admission to Health Research and Policy and a comfortable knowledge of a Windows XP/Vista.

Lectures                                                                                                             

Monday and Wednesday 11:30-1:00 Redwood Building T138B.

Office Hours

By appointment in Redwood Building T213D.  Directions can be found here: www.stanford.edu/~balise/FindBalise.htm

Newsgroup

If you would like to ask a question or help others please visit the course newsgroup which is named:  su.class.hrp223. While not truly required for the class, you will suffer if you don’t have access to the news.  If you do not know how to subscribe to a newsgroup and you use Windows http://www.stanford.edu/services/usenet/thunderbird_pc/index.html or a Mac http://www.stanford.edu/services/usenet/thunderbird_mac/index.html. Important notes of my setup can be found here: www.stanford.edu/class/hrp223/2009/newsgroup.ppt

Readings

There are no books published (other than the manuals) covering Enterprise Guide 4.2 or SAS 9.2 TS2.

Read about SQL vs. data step code here: http://www2.sas.com/proceedings/sugi31/250-31.pdf

 

Optional Books

The Little SAS Book for Enterprise Guide 4.1: http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=61054

SAS Programming for Enterprise Guide Users: http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=61179

The little SAS Book 3rd Edition : http://www.sas.com/apps/pubscat/bookdetails.jsp?pc=59216

Common Statistical Methods for Clinical Research with SAS Examples: http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=58086

Learning SAS in the Computer Lab http://www.amazon.com/Learning-SAS-Computer-Advanced-Cengage/dp/0495559687/ref=sr_1_1?ie=UTF8&s=books&qid=1256016343&sr=8-1

 

Grading

Grades will be based on four homework problem sets.   If you take the course for 2 units you must pass at least three of the four homework assignments and you must not violate the virus policy below.  If you take the course for 3 units you must pass all four assignments.  There will be many quick assignments that will not directly affect grades.

Turning in Homework and Viruses

All assignments and homework will be submitted via email to balise at Stanford and kameelah at Stanford. Any student that sends me a virus (or any other malicious code) will fail the course.  There will be no exceptions made.  Therefore, you are strongly advised to download the latest version of the Sophos Anti-Virus software. If you need virus protection check here http://www.stanford.edu/services/ess/ and you can download the software for free. If you have any questions ask!

Late policy

Each of the assignments will be due at the beginning of class on the day specified.

That said, there are unforeseen emergencies (illness, bike accidents, disk crashes, network troubles, childbirth, etc.). Instead of having to ask for special allowances on an individual basis, I give each of you the privilege of granting yourself a small extension in case of crisis. You will have two late days which you may use to extend the due dates of any assignments without penalty. To avoid any ambiguity, there are seven days in a week and each day ends at 5:00 PM. Thus, if your assignment was due on Wednesday but turned in the following Monday before 5:00, that assignment would be five days late. After the grace period is up each assignment is down weighted 20% per day.  In all cases, assignments will not be accepted more than one calendar week after the original assignment due date.

Computer Platforms

The programs that you turn in must run on Windows SAS 9.2 TS2 and/or Enterprise Guide 4.2.  I can provide good support for Windows or a Mac running parallels (http://www.parallels.com/).

 

Core Lecture Material

Preliminaries of Computing at Stanford and an Introduction to SAS Enterprise Guide (Sept 21st)

Software somebody should have told you about a long time ago

            Essential Stanford Software (free stuff)

Tools of the trade

            Excel

REDCap

            SAS or SAS/Enterprise Guide instructions on installing SAS are here.

            R and Rcommander

Other software I use (not officially endorsed by anybody and ask your IT security person for help)

            UltraEdit

            UltraCompare

            FileLocator Pro

            MyInfo

Using SAS Enterprise Guide as a calculator

 

The PowerPoint slides are here for Excel 2007/2008 or here for older versions.

Data and Data Collection (Sept 23rd into Sept 30th)

            What is a database?

Critical registry teaks for Excel

How to organize data in Excel

            Variable names

            Dummy records

Using Excel

            Making tables

            Validation

            Formulas

            Quick counts

            Subsets

            Random subsets

            Duplicates Discrepancies and Differences

            PivotTables    

Collecting data in general

            How to score and store answers

            The value/danger of redundancy

            Using REDCap

                        What is it?

                        How to set it up.

 

The PowerPoint slides are here for PowerPoint 2007 or here for PowerPoint 2003.

The REDCap data dictionary is here in Excel 2007 format or here for Excel 2003.  This is based on a file made by Vanderbilt University (http://www.project-redcap.org/).

 

An Excel 2007 spreadsheet showing Excel is here.

 

Assignment 1 is here and also watch the three REDCap tutorial videos found here: http://www.project-redcap.org/videos.php.

Assignment 2 is here and should be done before Oct 7th.

Working with Data in Windows (Sept 30th)

            Types of Files

            File suffixes

            Loading data from Excel

            Using libraries

            Simple subsets

            Basic numeric summaries

            Converting character variables to numeric with the input function

 

The PowerPoint slides are here for PowerPoint 2007 or here for PowerPoint 2003.

 

An Excel 2007 spreadsheet showing mixed type issues is here and the EG project is here.

 

Lab 1 (Oct 5th and Wednesday Oct 7th)

            Setting up Windows

                        File extensions

                        Folders

                        Office

                        DOS   

            Configuring SAS Enterprise Guide

                        Application options

                        Keyboard macros

            Using SAS

                        Libraries

                        Subsets

                        Toy data sets

                        Loading real data sets

                        Simple descriptive statistics and graphics

           

The slides are here for PowerPoint 2007 and here for PowerPoint 2003. They were revised on Oct 6th.  New/edited slides are marked with a splat in the upper left corner.

Homework 1 is here and is due Wednesday Oct 14th before class.

Beginning Data Manipulation (Monday Oct 12th)

            Working with a library

            Copying data

                        Into libraries

                        Selecting a few variables

                        Selecting a few records

            Calculations

                        Numeric variables

                        Complex math

            Basic text manipulation

 

The slides are here for PowerPoint 2007 and here for PowerPoint 2003.

 

A zip file with SAS programs showing the use of %include is here.  My general SAS program header is here.

An Enterprise Guide project that includes exporting and making fake data is here.

An Enterprise Guide project that shows SQL is here.

 

Reading The Little SAS Book for Enterprise Guide 4.1 Chapter 4.

Lab 2  (Oct 14th)

            Creating data

Recoding values to make data

Project organization

Fixing bad values

            Using formats

 

The Enterprise Guide project is located here. 

 

Reading The Little SAS Book for Enterprise Guide 4.1 Chapter 4.

 

Notes on Lab 2 and Merging with SQL (Oct 19th)

Recoding values to fix problems

Collapsing data into categories with a new variable or formats

SQL joins

            Inner joins

            Left joins

            The coalesce() and coalescec() functions

 

The slides are here for PowerPoint 2007 and here for PowerPoint 2003.

An Enterprise Guide project that uses SQL to merge files is here.

 

Reading The Little SAS Book for Enterprise Guide 4.1 Chapter 4.

Lab 3 (Oct 21th)

            Joins

            Summary functions with SQL

            Fixing bad values

                        By using functions like left

                        By recoding

Finding discrepancies when matches fail

 

More SQL and querying voodoo is covered in a project here.

            This assignment uses information in an Excel workbook that can be found here.

 

            Homework 2 is here and is due Oct 28th.

Notes on Lab3 and Combining with SQL (Oct 26th)

            Simulations with EG

            Creating macro variables as prompts

            SQL to select subsets of records from multiple tables

                        Outer union and/or outer union corresponding – concatenate

Union – unique rows from both tables

Except – rows that are part of the first query

Intersect – rows common to both queries

 

The slides are here for PowerPoint 2007 and here for PowerPoint 2003.

The Enterprise Guide file for today is here.

 

Midterm Summary and Basic Data Step Programming (Oct 28th)

            Creating fake data

            Importing

            Creating tables from a single table

                        Selecting variables

Recoded variables

Using NULL numeric values

Filters to subsets of records

Working with character data

            From multiple tables

                        SQL Joins

                        SQL set operations

            How the data step really works

 

The slides are here for PowerPoint 2007 and here for PowerPoint 2003.

The Enterprise Guide file for today is here.

 

A bit about Macros and Database Theory and Normalization (Nov 2nd)

            Queries and corrections when data is not normalized

            Normalizing data with arrays and loops in a data step

 

The slides are here for PowerPoint 2007 and here for PowerPoint 2003.

An Enterprise Guide project showing macros is here.

An Enterprise Guide project showing fixing wide tables with data step is here.

 

Homework 3 is here and is due November 18th before class.

Solution to HW2 (Nov 4th)

            A solution to HW2 is here.  This is not what I expected you to be able to produce.

            Information on numeric precision in a computer is here: http://support.sas.com/techsup/technote/ts654.pdf.

 More on HW2 and Grouped Data (Nov 9th)

            By processing

            Retain statements

           

The slides are here for PowerPoint 2007 and here for PowerPoint 2003.

            An Enterprise Guide project showing working with grouped data is here.  

Lab 4 (Nov 11th)

            Basic to advanced data step manipulation for data cleaning

Notes on Lab 4 and Graphics (Nov 16th-Nov 18th)

            What makes a bad graphic bad

            The gurus of data visualization

            Visualizing categorical data

                        Bar charts

                        Dot plot

                        Waffle plots

            Visualizing continuous data

                        Histograms

                        Box plos

                        Violin plot

            How to get image formats for publication

            Working with the SAS ODS

            ODS Statistical Graphics

            SAS Graphics Editor

            Setting up R

                        Tinn-R

                        PERL 

            High quality SAS graphics

                        Sgplot hbox / category =

                        Sgplot scatter

                        Sgplot reg / group =

                        Sgplot histogram

                        Sgpanel panelby histogram

                        Sgpanel panelby scatter

            Lattice plots in SAS

            Ods graphics on

            Selecting a subset of ODS graphics

            ODS statistical graphics with many elements

            Truly custom ODS statistical graphics

                       

The slides will be here for PowerPoint 2007 and here for PowerPoint 2003.

Lab 5 (Nov 18th)

            Graphics

Statistical Programming (Nov 30th)

            something

Final Summary (Dec 2nd)

            Something

            Homework 4 assigned and before 11:30 on December 11th.  This cannot be turned in late.

 

Other stuff

A set of useful links can be found here.

A few of my old favorite books are listed here.

SAS 2009 keyboard macros can be found here.