### Announcements

Dec 3, 2013:

The Kaggle deadline has changed to Friday, December 6 on the website. However, the winners of the competition will still be determined on Wednesday December 4 at 4pm PST. Please, make all your submissions by tomorrow.

Nov 4, 2013:

Your gradebook is now available in our Coursework site. This website will only be used for the gradebook functionality. Please, email the graders if you are missing grades for work you submitted.

Oct 31, 2013:

You may download the solutions to the midterm.

Oct 13, 2013:

• Your full name and SUNet ID.
• The homework and problem number.
• The number of points that you lost.
• A brief justification of why you think the grading is incorrect or unfair.

Oct 7, 2013:

Both exams will be closed-book and closed-notes. We will provide a cheat sheet with the equations necessary. You will not need to recall R commands, but we may ask you to interpret the output of R functions without documentation.

Sep 30, 2013:

If you have questions about homework or any of the lectures, please use our Piazza forum. You may join it using the link: www.piazza.com/stanford/fall2013/stats202.

Any other questions can he emailed to the staff mailing list: stats202-aut1314-staff@lists.stanford.edu. Please, do not use personal email addresses unless strictly necessary.

Stats 202 meets MWF 1:15-2:05 pm at Skilling Auditorium (note the location change!).

All lectures will be recorded on video by the Stanford Center for Professional Development and posted here.

### Course description

Stats 202 is an introduction to Data Mining. By the end of the quarter, students will:

• Understand the distinction between supervised and unsupervised learning, and be able to identify appropriate tools to answer different research questions.
• Become familiar with basic unsupervised procedures including clustering, and principal components analysis.
• Become familiar with the following regression and classification algorithms: linear regression, ridge regression, the lasso, logistic regression, linear discriminant analysis, K-nearest neighbors, splines, generalized additive models, tree-based methods, and support vector machines.
• Gain a practical appreciation of the bias-variance tradeoff, and apply model selection methods based on cross-validation and bootstrapping to a prediction challenge.
• Analyze a real dataset of moderate size using a combination of R and Python.
• Develop the computational skills for data wrangling, collaboration, and reproducible research.
• Be exposed to other topics in machine learning, such as missing data, prediction using time series and relational data, non-linear dimensionality reduction techniques, web-based data visualizations, anomaly detection, and representation learning.

### Staff and office hours

Consult this table for up-to-date office hour information.

Office hours Location
Instructor Sergio Bacallado Friday 2:15-3:45pm Sequoia 202
TA Rakesh Achanta (Unix, R, Python) Tuesday 3:45-5:45pm Sequoia 105
TA Jackson Gorham (Unix, Git, Python) Monday 3:45-5:45pm Sequoia 207
TA Jiyao Kou (R, Windows) Wednesday 3:45-5:45pm Sequoia "Fishbowl"
TA Minyong Lee (R) Tuesday 10am-12pm Sequoia 207
TA Jian Li Friday 10am-12pm Sequoia "Fishbowl"
TA Scott Powers (R, Mac, Windows) Thursday 2:10-4:10pm Bldg 320 Room 107

### Textbook

The only textbook required is An Introduction to Statistical Learning with applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (Springer, 1st ed., 2013).

The book is available at the Stanford Bookstore and free online through the Stanford Libraries. A hard copy of the book is in the reserves of the Mathematics and Statistics Library.

### Exams

• Midterm exam: Monday, October 28, 1:15 to 2:05 pm (in class).
• Final exam: Monday, December 9, 8:30 to 11:30 am, at Cubberley Auditorium.

If for extenuating circumstances you cannot take the midterm on October 28, you must email us by October 15. Since the midterm is during class, we cannot guarantee an opportunity to make it up.

The final exam is mandatory. If you cannot take it on the time indicated above, please drop the class.

### Homework

There will be 7 graded homework assignments, due at the start of class on the day indicated. You may:

• submit handwritten or printed solutions at the beginning of lecture, or
• submit the homework through this website by the beginning of lecture.

Late homework will not be accepted, but the lowest homework score will be ignored.

### Kaggle competition

An important part of the class will be a quarter-long prediction challenge hosted by Kaggle. This competition will allow you to apply the concepts learned in class and develop the computational skills to analyze data in a collaborative setting.

Your goal will be to predict the employment status of middle-aged individuals during the 2008 financial crisis. We will use data from an ambitious longitudinal study made available by the National Bureau of Labor Statistics.

Invitations to the competition have been sent! If you haven't received one, please contact us. You may use our Piazza forum to form teams.

• Homework: 40% (lowest score dropped).
• Midterm: 20%.
• Final: 35%.
• Kaggle competition: 5% (based on satisfactory participation).

The 3 teams who obtain the highest score in the Kaggle competition will be given the option of not taking the final exam (!). Their class grade would be based on midterm and homework scores alone.

### Outline

Day Topic Chapters Homework
Mon 9/23 Class logistics, HW 0 HW 0 out
Wed 9/25 Supervised and unsupervised learning 2 HW 1 out
Fri 9/27 Principal components analysis 10.1,10.2,10.4 HW 0 due
Mon 9/30 Clustering 10.3, 10.5
Wed 10/02 Linear regression 3.1-3.3 HW 1 due, HW 2 out
Fri 10/04 Linear regression 3.3-3.6
Mon 10/07 Classification, logistic regression 4.1-4.3
Wed 10/09 Linear discriminant analysis 4.4-4.5 HW 2 due, HW 3 out
Fri 10/11 Classification lab 4.6
Mon 10/14 Cross validation 5.1
Wed 10/16 The Bootstrap 5.2-5.3 HW 3 due, HW 4 out
Fri 10/18 Regularization 6.1, 6.5
Mon 10/21 Shrinkage 6.2
Wed 10/23 Shrinkage lab 6.6 HW 4 due
Fri 10/25 Dimension reduction 6.3, 6.7
Mon 10/28 Midterm exam
Wed 10/30 Splines 7.1-7.4 HW 5 out
Fri 11/01 Smoothing splines, GAMs, Local regression 7.5-7.7
Mon 11/04 Non-linear regression lab 7.8
Wed 11/06 Decision trees 8.1, 8.3.1-2 HW 5 due, HW 6 out
Fri 11/08 Bagging, random forests, boosting 8.2, 8.3.3-4
Mon 11/11 Support vector machines 9.1-9.2
Wed 11/13 Support vector machines 9.3-9.5 HW 6 due, HW 7 out
Fri 11/15 Support vector machines lab 9.6
Mon 11/18 Prediction with time series
Wed 11/20 Prediction with relational data HW 7 due
Fri 11/22 Data scraping, data wrangling
Mon 11/25 Thanksgiving
Wed 11/27 Thanksgiving
Fri 11/29 Thanksgiving
Mon 12/02 Web visualizations
Wed 12/04 Final review All chapters Kaggle deadline
Fri 12/06 Final review All chapters
Mon 12/09 Final exam

Some important dates:

• November 15: Deadline to withdraw from the class or change the grading basis.
• December 06: Last opportunity to arrange an Incomplete.