Announcements

Oct 5, 2015: Kaggle Part I

Part I of the Kaggle competition has begun! Please see the Kaggle menu link for more details.

Sep 23, 2015: HW 1 Correction

There was a small typo in HW 1 (which has now been corrected). Part b of problem 5 should read: "The bias component of this expected test MSE?".

Sep 23, 2015: Kaggle Data Terms and Conditions

A word from our ALS data partners: To participate in the course ALS prediction challenge, you will need to agree to these terms and conditions. If you cannot agree to those terms and conditions, please let us know by emailing the staff list.

Sep 23, 2015: Labs

Occasionally we will post links to "labs" which supplement the day's lecture. These labs feature code and output produced by the course staff to illustrate a concept. For instance, Lab 2 (under the Lectures tab) shows you how we generated the bias-variance decomposition example in today's lecture. Feel free to read through the lab to improve your understanding and to try your hand at recreating or modifying our examples.

Sep 21, 2015: Finding Kaggle Competition Teammates

We've created a pinned post on Piazza to help you find Kaggle competition teammates.

Meeting time and recorded lectures

Stats 202 meets MWF 1:30-2:20 pm in NVIDIA Auditorium.

All lectures will be recorded on video by the Stanford Center for Professional Development and posted here. If that link does not work for you, try logging into http://scpd.stanford.edu/ directly and navigating to the Stats 202 course. If you are unable to access the lecture videos, please contact SCPD to gain access.

Lecture slides will be posted on this site (see the Lectures link on the left).


Course description

Stats 202 is an introduction to Data Mining. By the end of the quarter, students will:

  • Understand the distinction between supervised and unsupervised learning and be able to identify appropriate tools to answer different research questions.
  • Become familiar with basic unsupervised procedures including clustering and principal components analysis.
  • Become familiar with the following regression and classification algorithms: linear regression, ridge regression, the lasso, logistic regression, linear discriminant analysis, K-nearest neighbors, splines, generalized additive models, tree-based methods, and support vector machines.
  • Gain a practical appreciation of the bias-variance tradeoff and apply model selection methods based on cross-validation and bootstrapping to a prediction challenge.
  • Analyze a real dataset of moderate size using a combination of R and Python.
  • Develop the computational skills for data wrangling, collaboration, and reproducible research.
  • Be exposed to other topics in machine learning, such as missing data, prediction using time series and relational data, non-linear dimensionality reduction techniques, web-based data visualizations, anomaly detection, and representation learning.

Prerequisites

Introductory courses in statistics or probability (e.g., Stats 60), linear algebra (e.g., Math 51), and computer programming (e.g., CS 105).


Communication

The vast majority of questions about homework, the lectures, or the course should be asked on our Piazza forum, as others will benefit from the responses. You can join the Piazza forum using the link www.piazza.com/stanford/fall2015/stats202. We strongly encourage students to respond to one another's questions!

Questions from which others cannot benefit can be emailed to the staff mailing list stats202-aut1516-staff@lists.stanford.edu.

Personal staff email addresses should only be used for sensitive matters (e.g., concerns about specific course staff).


Staff and office hours

Consult this table for up-to-date office hour information. For online office hours, we provide persistent meeting links which will be active at the advertised office hour times. Upon clicking the link, you will have the option of joining the meeting by phone, browser, or BlueJeans app.

Office hours Location
Instructor Lester Mackey W 11-11:55am, 2:30-3:30pm Sequoia 141
TA Murat Erdogdu Tu 9-10:55am Sequoia 206
TA Jackson Gorham M 6-7pm, Tu 6-7pm Online office hours: https://bluejeans.com/287714272
TA Minyong Lee M 10-12pm Sequoia 237
TA Paulo Orenstein Th 4-6pm Sequoia 105 (the library)
TA Charles Zheng Th 8-10pm (online), F 9-11am (Sequoia 207), F 6-8pm (online) Online office hours: https://bluejeans.com/889567289

Textbook

The only textbook required is An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (Springer, 1st ed., 2013). The book is available at the Stanford Bookstore and free online through the Stanford Libraries.

We may occasionally assign (optional) supplementary readings from the optional text The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman (Springer, 2nd ed.).

In our lecture notes, the abbreviation ISL = Introduction to Statistical Learning and ESL = Elements of Statistical Learning.


Exams

(If you are an online SCPD student, please see SCPD info for more information on remote exam instructions and timings.)

  • Midterm exam: Monday, October 26, 1:30-2:20 pm (in our normal classroom).
  • Final exam: Monday, December 7, 12:15-3:15 pm (in our normal classroom).

If for extenuating circumstances you cannot take the midterm on Oct. 26, you must email us by Oct. 14. Since the midterm is during class, we cannot guarantee an opportunity to make it up.

The final exam is mandatory. If you cannot take it at the time indicated above, please drop the class.


Homework

There will be 7 graded homework assignments, due on Wednesdays at the start of class. An ungraded assignment (Homework 0) will help you install and become familiar with the tools used in this course. The homework assignments and staff solutions will be posted on this website and will be accessible by enrolled students (see the Homework link on the left).

After attempting homework problems on an individual basis, you may discuss a homework assignment with up to two classmates. However, you must write up your own solutions individually and explicitly indicate with whom (if anyone) you discussed the homework problems at the top of your homework solutions. In your solutions, please show your work and include all relevant code written. Please also keep in mind the university honor code.

This quarter, we will be using the Gradescope online submission and scoring system for all homework submission. Gradescope will send a Stats 202 enrollment notification to your Stanford email address. If you have not received such a notification by Thursday Sep. 24 (Pacific Time), please add your enrollment information to this spreadsheet and allow 24 hours for us to process your enrollment.

Your problem sets should be submitted as PDF or image files through Gradescope. Here are some tips for scanning and submitting through Gradescope.

Any regrade requests should be submitted through Gradescope within one week of receiving your grade. Please, read the relevant solutions and review the relevant course material prior to sending a request and specify (1) the part(s) of the homework you believe were wrongly graded and (2) why you deserve additional credit. We reserve the right to regrade the entirety of any homework for which any regrade is requested.

Late homework will not be accepted, but the lowest homework score will be ignored.


Kaggle competition

An important part of the class will be an in-class prediction challenge hosted by Kaggle. This competition will allow you to apply the concepts learned in class and develop the computational skills to analyze data in a collaborative setting.

To learn more about the competition see the link on the left.


Grading

  • Homework: 40% (lowest score dropped).
  • Midterm: 20%.
  • Final: 35%.
  • Kaggle competition: 5% (based on satisfactory participation).

Tentative outline

Day Topic Chapters Homework
Mon 9/21 Class logistics, HW 0 HW 0 out
Wed 9/23 Supervised and unsupervised learning 2 HW 1 out
Fri 9/25 Principal components analysis 10.1,10.2,10.4 HW 0 due
Mon 9/28 Clustering 10.3, 10.5
Wed 9/30 Linear regression 3.1-3.3 HW 1 due, HW 2 out
Fri 10/02 Linear regression 3.3-3.6
Mon 10/05 Classification, logistic regression 4.1-4.3
Wed 10/07 Linear discriminant analysis 4.4-4.5 HW 2 due, HW 3 out
Fri 10/09 Classification lab 4.6
Mon 10/12 Cross validation 5.1
Wed 10/14 The Bootstrap 5.2-5.3 HW 3 due, HW 4 out
Fri 10/16 Regularization 6.1, 6.5
Mon 10/19 Shrinkage 6.2
Wed 10/21 Shrinkage lab 6.6 HW 4 due
Fri 10/23 Dimension reduction 6.3, 6.7
Mon 10/26 Midterm exam
Wed 10/28 Splines 7.1-7.4 HW 5 out
Fri 10/30 Smoothing splines, GAMs, Local regression 7.5-7.7
Mon 11/02 Non-linear regression lab 7.8
Wed 11/04 Decision trees 8.1, 8.3.1-2 HW 5 due, HW 6 out
Fri 11/06 Bagging, random forests, boosting 8.2, 8.3.3-4
Mon 11/09 Support vector machines 9.1-9.2
Wed 11/11 Support vector machines 9.3-9.5 HW 6 due, HW 7 out
Fri 11/13 Support vector machines lab 9.6
Mon 11/16 Non-linear dimensionality reduction
Wed 11/18 Wavelets HW 7 due
Fri 11/20 Data scraping, data wrangling
Mon 11/23 Thanksgiving
Wed 11/25 Thanksgiving
Fri 11/27 Thanksgiving
Mon 11/30 Web visualizations
Wed 12/02 Final review All chapters Kaggle deadline
Fri 12/04 Final review All chapters
Mon 12/07 Final exam