{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Course Introduction and Review\n",
"\n",
"## Outline\n",
"\n",
"* What is a regression model?\n",
"\n",
"* Descriptive statistics -- numerical\n",
"\n",
"* Descriptive statistics -- graphical\n",
"\n",
"* Inference about a population mean\n",
" \n",
"* Difference between two population means"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"options(repr.plot.width=5, repr.plot.height=3)\n",
"set.seed(0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# What is course about?\n",
"\n",
"* It is a course on applied statistics.\n",
"\n",
"* Hands-on: we use [R](http://cran.r-project.org), an open-source statistics software environment.\n",
"\n",
"* Course notes will be [jupyter](http://jupyter.org) notebooks.\n",
"\n",
"* We will start out with a review of introductory statistics to see `R` in action.\n",
" \n",
"* Main topic is *(linear) regression models*: these are the *bread and butter* of applied statistics.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## What is a regression model? \n",
"\n",
"\n",
"A regression model is a model of the relationships between some \n",
"*covariates (predictors)* and an *outcome*.\n",
"\n",
"\n",
"Specifically, regression is a model of the *average* outcome *given or having fixed* the covariates. \n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Heights of mothers and daughters\n",
"\n",
"\n",
" \n",
"* We will consider the [heights](http://www.stat.cmu.edu/~roeder/stat707/=data/=data/data/Rlibraries/alr3/html/heights.html) of mothers and daughters collected \n",
"by Karl Pearson in the late 19th century.\n",
"\n",
"* One of our goals is to understand height of the daughter, `D`, knowing the height of the\n",
"mother, `M`.\n",
"\n",
"\n",
"* A mathematical model might look like\n",
" $$\n",
" D = f(M) + \\varepsilon$$\n",
" where $f$ gives the average height of the daughter\n",
" of a mother of height `M` and\n",
" $\\varepsilon$ is *error*: not *every* daughter has the same height.\n",
"\n",
"* A statistical question: is there *any*\n",
"relationship between covariates and outcomes -- is $f$ just a constant?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Let's create a plot of the heights of the mother/daughter pairs. The data is in an `R` package that can be downloaded\n",
"from [CRAN](http://cran.r-project.org/) with the command:\n",
"\n",
" install.packages(\"alr3\")\n",
" \n",
"If the package is not installed, then you will get an error message when calling `library(alr3)`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"library(alr3)\n",
"data(heights)\n",
"M = heights$Mheight\n",
"D = heights$Dheight\n",
"plot(M, D, pch = 23, bg = \"red\", cex = 2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"In the first part of this course we'll talk about fitting a line to this data. Let's do that and remake the plot, including\n",
"this \"best fitting line\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"plot(M, D, pch = 23, bg = \"red\", cex = 2)\n",
"height.lm = lm(D ~ M)\n",
"abline(height.lm, lwd = 3, col = \"yellow\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Linear regression model\n",
"\n",
"* How do we find this line? With a model.\n",
"\n",
"* We might model the data as\n",
"$$\n",
"D = \\beta_0+ \\beta_1 M + \\varepsilon.\n",
"$$\n",
"\n",
"* This model is *linear* in $(\\beta_0, \\beta_1)$, the intercept and the coefficient of `M` (the mother's height), it is a \n",
"*simple linear regression model*.\n",
"\n",
"* Another model:\n",
"$$\n",
"D = \\beta_0 + \\beta_1 M + \\beta_2 M^2 + \\beta_3 F + \\varepsilon\n",
"$$\n",
"where $F$ is the height of the daughter's father.\n",
"\n",
"* Also linear (in $(\\beta_0, \\beta_1, \\beta_2, \\beta_3)$, the coefficients of $1,M,M^2,F$).\n",
"\n",
"* Which model is better? We will need a tool to compare models... more to come later.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# A more complex model\n",
"\n",
"* Our example here was rather simple: we only had one independent variable.\n",
"\n",
"* Independent variables are sometimes called *features* or *covariates*.\n",
"\n",
"* In practice, we often have many more than one independent variable."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Right-to-work\n",
"\n",
"This example from the text considers the effect of right-to-work legislation (which varies by state) on various\n",
"factors. A [description](http://www.ilr.cornell.edu/~hadi/RABE4/Data4/P005.txt) of the data can be found here.\n",
"\n",
"The variables are:\n",
"\n",
"* Income: income for a four-person family\n",
"\n",
"* COL: cost of living for a four-person family\n",
"\n",
"* PD: Population density\n",
"\n",
"* URate: rate of unionization in 1978\n",
"\n",
"* Pop: Population\n",
"\n",
"* Taxes: Property taxes in 1972\n",
"\n",
"* RTWL: right-to-work indicator\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"In a study like this, there are many possible questions of interest. Our focus will be on the\n",
"relationship between `RTWL` and `Income`. However, we recognize that other variables\n",
"have an effect on `Income`. Let's look at some of these relationships."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"url = \"http://www1.aucegypt.edu/faculty/hadi/RABE4/Data4/P005.txt\"\n",
"rtw.table <- read.table(url, header=TRUE, sep='\\t')\n",
"print(head(rtw.table))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"A graphical way to \n",
"visualize the relationship between `Income` and `RTWL` is the *boxplot*."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"attach(rtw.table) # makes variables accessible in top namespace\n",
"boxplot(Income ~ RTWL, col='orange', pch=23, bg='red')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"One variable that may have an important effect on the relationship between\n",
" is the cost of living `COL`. It also varies between right-to-work states."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"boxplot(COL ~ RTWL, col='orange', pch=23, bg='red')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"We may want to include more than one plot in a given display. The first line of the\n",
"code below achieves this."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"options(repr.plot.width=7, repr.plot.height=7)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"par(mfrow=c(2,2))\n",
"plot(URate, COL, pch=23, bg='red', main='COL vs URate')\n",
"plot(URate, Income, pch=23, bg='red')\n",
"plot(URate, Pop, pch=23, bg='red')\n",
"plot(COL, Income, pch=23, bg='red')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"`R` has a builtin function that will try to display all pairwise relationships in a given dataset, the function `pairs`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"pairs(rtw.table, pch=23, bg='red')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"In looking at all the pairwise relationships. There is a point that stands out from all the rest.\n",
"This data point is New York City, the 27th row of the table. (Note that `R` uses 1-based instead of 0-based indexing for rows and columns of arrays.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"print(rtw.table[27,])\n",
"pairs(rtw.table[-27,], pch=23, bg='red')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"options(repr.plot.width=5, repr.plot.height=3)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Right-to-work example\n",
"\n",
"## Building a model\n",
"\n",
"Some of the main goals of this course:\n",
"\n",
"* Build a statistical model describing the *effect* of `RTWL` on `Income`.\n",
"\n",
"* This model should recognize that other variables also affect `Income`.\n",
"\n",
"* What sort of *statistical confidence* do we have in our \n",
"conclusion about `RTWL` and `Income`?\n",
"\n",
"* Is the model adequate do describe this dataset?\n",
"\n",
"* Are there other (simpler, more complicated) better models?\n",
" \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Numerical descriptive statistics\n",
"\n",
"## Mean of a sample\n",
"\n",
"Given a sample of numbers $X=(X_1, \\dots, X_n)$ the sample mean, \n",
"$\\overline{X}$ is\n",
"$$\n",
"\\overline{X} = \\frac1n \\sum_{i=1}^n X_i.$$\n",
" \n",
"There are many ways to compute this in `R`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"X = c(1,3,5,7,8,12,19)\n",
"print(X)\n",
"print(mean(X))\n",
"print((X[1]+X[2]+X[3]+X[4]+X[5]+X[6]+X[7])/7)\n",
"print(sum(X)/length(X))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"We'll also illustrate thes calculations with part of an example we consider below, on differences\n",
"in blood pressure between two groups."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"url = 'http://www.stanford.edu/class/stats191/data/Calcium.html' # from DASL\n",
"calcium.table = read.table(url, header=TRUE, skip=26, nrow=21)\n",
"attach(calcium.table)\n",
"treated = Decrease[(Treatment == 'Calcium')]\n",
"placebo = Decrease[(Treatment == 'Placebo')]\n",
"treated\n",
"mean(treated)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Standard deviation of a sample\n",
"\n",
"Given a sample of numbers $X=(X_1, \\dots, X_n)$ the sample \n",
"standard deviation $S_X$ is\n",
"$$\n",
"S^2_X = \\frac{1}{n-1} \\sum_{i=1}^n (X_i-\\overline{X})^2.$$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"S2 = sum((treated - mean(treated))^2) / (length(treated)-1)\n",
"print(sqrt(S2))\n",
"print(sd(treated))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Median of a sample\n",
"\n",
" Given a sample of numbers $X=(X_1, \\dots, X_n)$ the sample median is\n",
" the `middle` of the sample:\n",
" if $n$ is even, it is the average of the middle two points.\n",
" If $n$ is odd, it is the midpoint.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"X\n",
"print(c(X, 13))\n",
"median(c(X, 13))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Quantiles of a sample\n",
"\n",
"Given a sample of numbers $X=(X_1, \\dots, X_n)$ the $q$-th quantile is\n",
"a point $x_q$ in the data such that $q \\cdot 100\\%$ of the data lie to the \n",
"left of $x_q$.\n",
"\n",
"### Example\n",
"\n",
"The $0.5$-quantile is the median: half \n",
"of the data lie to the right of the median.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"quantile(X, c(0.25, 0.75))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Graphical statistical summaries\n",
"\n",
"We've already seen a boxplot. Another common statistical summary is a \n",
"histogram."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hist(treated, main='Treated group', xlab='Decrease', col='orange')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Inference about a population mean\n",
"\n",
"## A testing scenario\n",
"\n",
"* Suppose we want to determine the efficacy of a new drug on blood pressure.\n",
"\n",
"* Our study design is: we will treat\n",
"a large patient population (maybe not so large: budget constraints limit it $n=20$) with the drug and measure their\n",
"blood pressure before and after taking the drug.\n",
"\n",
"* We conclude that the drug is effective if the blood pressure has decreased on average. That is,\n",
"if the average difference between before and after is positive.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Setting up the test\n",
"\n",
"\n",
"\n",
"* The *null hypothesis*, $H_0$ is: *the average difference is less\n",
"than zero.*\n",
"\n",
"* The *alternative hypothesis*, $H_a$, is: *the average difference \n",
"is greater than zero.*\n",
"\n",
"* Sometimes (actually, often), people will test the alternative, $H_a$: *the\n",
"average difference is not zero* vs. $H_0$: *the average difference is zero.*\n",
"\n",
"* The test is performed by estimating\n",
"the average difference and converting to standardized units.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Drawing from a box\n",
"\n",
"* Formally, could set up the above test as drawing from a box of *differences\n",
"in blood pressure*.\n",
"\n",
"* A box model is a useful theoretical device that describes the experiment\n",
"under consideration. In our example, we can think of the sample of decreases\n",
"drawn 20 patients at random from a large population (box) containing all the possible\n",
"decreases in blood pressure.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## A simulated box model\n",
"\n",
"* In our box model, we will assume that the decrease is an integer drawn at random from \n",
"$-3$ to 6.\n",
"\n",
"* We will draw 20 random integers from -3 to 6 with replacement and test whether the mean\n",
"of our \"box\" is 0 or not."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"mysample = sample(-3:6, 20, replace=TRUE)\n",
"mysample"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"The test is usually a $T$ test that uses the statistic\n",
"$$\n",
" T = \\frac{\\overline{X}-0}{S_X/\\sqrt{n}} \n",
" $$\n",
" \n",
"The formula can be read in three parts:\n",
"\n",
"- estimating the mean: $\\overline{X}$;\n",
"\n",
"- comparing to 0: subtracting 0 in the numerator;\n",
"\n",
"- converting difference to standardized units: dividing by $S_X/\\sqrt{n}$ our estimate of the variability of $\\overline{X}$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"T = (mean(mysample) - 0) / (sd(mysample) / sqrt(20))\n",
"T"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"This $T$ value is often compared to a table for the appropriate $T$ distribution (in this case there are 19 *degrees of freedom*) and the 5% cutoff is"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"cutoff = qt(0.975, 19)\n",
"cutoff"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Strictly speaking the $T$ distribution should be used when the values in the box\n",
"are spread similarly to a normal curve. This is not the case here, but if $n$ is large enough,\n",
"there is not a huge difference."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"qnorm(0.975)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"The result of the two-sided test is"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"reject = (abs(T) > cutoff)\n",
"reject"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"If `reject` is `TRUE`, then we reject $H_0$ the mean is 0 at a level of 5%, while if it is `FALSE` we do not reject. Of course, in this example we know the mean in our \"box\" is not 0, it is 1.5.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"This rule can be visualized with the $T$ density. The total grey area is 0.05=5%, and the cutoff is chosen to be symmetric\n",
"around zero and such that this area is exactly 5%.\n",
"\n",
"For a test of size $\\alpha$ we write this cutoff $t_{n-1,1-\\alpha/2}$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"library(ggplot2)\n",
"alpha = 0.05\n",
"df = 19\n",
"xval = seq(-4,4,length=101)\n",
"q = qt(1-alpha/2, df)\n",
"\n",
"rejection_region = function(dens, q_lower, q_upper, xval) {\n",
" fig = (ggplot(data.frame(x=xval), aes(x)) +\n",
" stat_function(fun=dens, geom='line') +\n",
" stat_function(fun=function(x) {ifelse(x > q_upper | x < q_lower, dens(x), NA)},\n",
" geom='area', fill='#CC7777') + \n",
" labs(y='Density', x='T') +\n",
" theme_bw())\n",
" return(fig)\n",
"}\n",
"\n",
"T19_fig = rejection_region(function(x) { dt(x, df)}, -q, q, xval) + \n",
" annotate('text', x=2.5, y=dt(2,df)+0.3, label='Two sided rejection region, df=19')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"T19_fig"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Reasoning behind the test\n",
"\n",
"Suppose $H_0$ was true -- say the mean of the box was zero.\n",
"\n",
"For example, we might assume the difference is drawn at random from integers -5 to 5 inclusive."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# Generate a sample from a box for which the null is true\n",
"null_sample = function(n) {\n",
" return(sample(-5:5, n, replace=TRUE))\n",
"}\n",
"\n",
"# Compute the T statistic\n",
"null_T = function(n) {\n",
" cur_sample = null_sample(n) \n",
" return((mean(cur_sample) - 0) / (sd(cur_sample) / sqrt(n)))\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Type I error\n",
"\n",
"When the null hypothesis is true, like in our simulation,\n",
"we expect that the $T$ statistic will exceed the cutoff only about 5% of the time.\n",
"\n",
"If we use the cutoff $t_{19,0.975}$ to decide in favor or against $H_0$, rejecting\n",
"$H_0$ when the absolute value is larger than this value, then we have a test whose\n",
"**Type I error** is about 5%.\n",
"\n",
"It is exactly 5% if the sample were drawn from a box whose values follow a normal curve..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"results = numeric(10000)\n",
"for (i in 1:10000) {\n",
" results[i] = null_T(20)\n",
"}\n",
"mean(abs(results) >= qt(0.975, 19))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"We use the $T$ curve (close to the normal curve) because when $H_0$\n",
"is true, the distribution of the T statistic is close to the $T$ curve"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"plot(density(results), lwd=3)\n",
"xval = seq(-4,4,length=201)\n",
"lines(xval, dt(xval, 19), col='red', lwd=3) # T_19 density\n",
"lines(xval, dnorm(xval), col='blue', lwd=3) # Normal(0,1) density"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"`R` will compute this $T$ statistic for you, and many other things. `R` will use the $T$ distribution."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"t.test(mysample)\n",
"T\n",
"2 * pt(abs(T), 19, lower=FALSE)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"As mentioned above, sometimes tests are one-sided. If the null hypothesis we tested was that the mean is less than 0, then we would reject this hypothesis if our observed mean was much larger than 0. This corresponds to a positive $T$ value."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"cutoff = qt(0.95, 19)\n",
"T19_pos = rejection_region(function(x) { dt(x, df)}, -Inf, cutoff, xval) + \n",
" annotate('text', x=2.5, y=dt(2,df)+0.3, label='One sided rejection region, df=19')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"T19_pos"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"The rejection rules are affected by the degrees of freedom. Here is the rejection region\n",
"when we only have 5 samples from our \"box\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"df = 4\n",
"cutoff = qt(0.975, df)\n",
"T4_fig = rejection_region(function(x) { dt(x, df)}, -cutoff, cutoff, xval) + \n",
" annotate('text', x=2.5, y=dt(2,19)+0.3, label='Two sided rejection region, df=4')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"T4_fig"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Confidence intervals\n",
"\n",
"* Instead of testing a particular hypothesis, we might be interested\n",
"in coming up with a reasonable range for the mean of our \"box\".\n",
"\n",
"* Statistically, this is done via a *confidence interval*.\n",
"\n",
"* If the 5% cutoff is $q$ for our test, then the 95% confidence interval is\n",
"$$\n",
"[\\bar{X} - q S_X / \\sqrt{n}, \\bar{X} + q S_X / \\sqrt{n}]\n",
"$$\n",
"where we recall $q=t_{n-1,0.975}$ with $n=20$. \n",
"\n",
"* If we wanted 90% confidence interval, we would use $q=t_{19,0.95}$. Why?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"cutoff = qt(0.975, 19)\n",
"L = mean(mysample) - cutoff*sd(mysample)/sqrt(20)\n",
"U = mean(mysample) + cutoff*sd(mysample)/sqrt(20)\n",
"data.frame(L, U) \n",
"t.test(mysample)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Note that the endpoints above depend on the data. Not every interval will cover\n",
"the true mean of our \"box\" which is 1.5. Let's take a look at 100 intervals of size 90%. We would expect\n",
"that roughly 90 of them cover 1.5."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"cutoff = qt(0.975, 19)\n",
"L = c()\n",
"U = c()\n",
"covered = c()\n",
"box = -3:6\n",
"for (i in 1:100) {\n",
" mysample = sample(box, 20, replace=TRUE)\n",
" l = mean(mysample) - cutoff*sd(mysample)/sqrt(20)\n",
" u = mean(mysample) + cutoff*sd(mysample)/sqrt(20)\n",
" L = c(L, l)\n",
" U = c(U, u)\n",
" covered = c(covered, (l < mean(box)) * (u > mean(box)))\n",
"}\n",
"sum(covered)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"A useful picture is to plot all these intervals so we can see the randomness\n",
"in the intervals, while the true mean of the box is unchanged."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"mu = 1.5\n",
"plot(c(1, 100), c(-2.5+mu,2.5+mu), type='n', ylab='Confidence Intervals', xlab='Sample')\n",
"for (i in 1:100) {\n",
" if (covered[i] == TRUE) {\n",
" lines(c(i,i), c(L[i],U[i]), col='green', lwd=2)\n",
" }\n",
" else {\n",
" lines(c(i,i), c(L[i],U[i]), col='red', lwd=2)\n",
" } \n",
"}\n",
"abline(h=mu, lty=2, lwd=4)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Blood pressure example\n",
"\n",
"* A study was conducted to study the effect of calcium supplements\n",
"on blood pressure.\n",
"\n",
"\n",
"\n",
"* We had loaded the data above, storing the two samples in the variables `treated` and `placebo`.\n",
"\n",
"* Some questions might be:\n",
" - What is the mean decrease in BP in the treated group? placebo group?\n",
" - What is the median decrease in BP in the treated group? placebo group?\n",
" - What is the standard deviation of decrease in BP in the treated group? placebo group?\n",
" - Is there a difference between the two groups? Did BP decrease more in the treated group?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"summary(treated)\n",
"summary(placebo)\n",
"boxplot(Decrease ~ Treatment, col='orange', pch=23, bg='red')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## A hypothesis test\n",
"\n",
"In our setting, we have two groups that we have reason to believe are \n",
"different.\n",
"\n",
"* We have two samples:\n",
" - $(X_1, \\dots, X_{10})$ (`treated`)\n",
" - $(Z_1, \\dots, Z_{11})$ (`placebo`)\n",
" \n",
"* We can answer this statistically by testing the null hypothesis \n",
"$$H_0:\\mu_X = \\mu_Z.$$\n",
"\n",
"* If variances are equal, the *pooled $t$-test* is appropriate."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pooled $t$ test\n",
"\n",
"* The test statistic is $$ T = \\frac{\\overline{X} - \\overline{Z} - 0}{S_P \\sqrt{\\frac{1}{10} + \\frac{1}{11}}}, \\qquad S^2_P = \\frac{9 \\cdot S^2_X + 10 \\cdot S^2_Z}{19}.$$\n",
" \n",
"* For two-sided test at level $\\alpha=0.05$, reject if $|T| > t_{19, 0.975}$.\n",
" \n",
"* Confidence interval: for example, a $90\\%$ confidence interval\n",
"for $\\mu_X-\\mu_Z$ is $$ \\overline{X}-\\overline{Z} \\pm S_P \\sqrt{\\frac{1}{10} + \\frac{1}{11}} \\cdot t_{19,0.95}.$$\n",
"\n",
"* T statistic has the same form as before!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"sdP = sqrt((9*sd(treated)^2 + 10*sd(placebo)^2)/19)\n",
"T = (mean(treated)-mean(placebo)-0) / (sdP * sqrt(1/10+1/11))\n",
"c(T, cutoff)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"`R` has a builtin function to perform such $t$-tests."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"t.test(treated, placebo, var.equal=TRUE)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"If we don't make the assumption of equal variance, `R` will give a slightly different result."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"t.test(treated, placebo)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pooled estimate of variance\n",
"\n",
"* The rule for the $SD$ of differences is\n",
" $$\n",
" SD(\\overline{X}-\\overline{Z}) = \\sqrt{SD(\\overline{X})^2+SD(\\overline{Z})^2}$$\n",
" \n",
"* By this rule, we might take our estimate to be\n",
" $$\n",
" \\widehat{SD(\\overline{X}-\\overline{Z})} = \\sqrt{\\frac{S^2_X}{10} + \\frac{S^2_Z}{11}}.\n",
" $$\n",
" \n",
"* The pooled estimate assumes $\\mathbb{E}(S^2_X)=\\mathbb{E}(S^2_Z)=\\sigma^2$ and replaces\n",
" the $S^2$'s above with $S^2_P$, a better estimate of\n",
" $\\sigma^2$ than either $S^2_X$ or $S^2_Z$.\n",
"\n",
"## Where do we get $df=19$?\n",
"\n",
"Well, the $X$ sample has $10-1=9$ degrees of freedom\n",
" to estimate $\\sigma^2$ while the $Z$ sample\n",
" has $11-1=10$ degrees of freedom.\n",
" \n",
"Therefore, the total degrees of freedom is $9+10=19$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Our first regression model\n",
"\n",
"* We can put the two samples together:\n",
" $$Y=(X_1,\\dots, X_{10}, Z_1, \\dots, Z_{11}).$$\n",
"\n",
"* Under the same assumptions as the pooled $t$-test:\n",
" $$\n",
" \\begin{aligned}\n",
" Y_i &\\sim N(\\mu_i, \\sigma^2)\\\\\n",
" \\mu_i &=\n",
" \\begin{cases}\n",
" \\mu_X & 1 \\leq i \\leq 10 \\\\ \\mu_Z & 11 \\leq i \\leq 21.\n",
" \\end{cases}\n",
" \\end{aligned}\n",
" $$\n",
" \n",
"* This is a (regression) model for the sample $Y$. The\n",
" (qualitative) variable `Treatment` is\n",
" called a *covariate* or *predictor*.\n",
" \n",
"* The decrease in BP is the *outcome*.\n",
"\n",
"* We assume that the relationship between treatment and average\n",
" decrease in BP is simple: it depends only on which group a subject is in.\n",
" \n",
"* This relationship is *modelled* through the mean\n",
" vector $\\mu=(\\mu_1, \\dots, \\mu_{21})$.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"print(summary(lm(Decrease ~ Treatment)))\n",
"print(sdP*sqrt(1/10+1/11))\n",
"print(sdP)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
}