5_linear_regression

Linear Regression

Table of Content

1. Linear Regression 1.1. R-squared 1.2. Test for significance 1.2.1. When to use a t-test? 1.2.2. Which t-test? 1.2.3. How to do a t-test? 1.2.4. p-value from t-test 1.2.5. t-test critical values 1.2.6. One-sample t-test 1.2.7. Two-sample t-test 1.2.8. Two-sample t-test if variances are equal 1.2.9. Two-sample t-test if variances are unequal (Welch's t-test) 1.2.10. Paired t-test 1.2.11. t-test vs. z-test 1.3. Multicollinearity 1.3.1. How to detect multicollinearity? 1.4. Feature interaction 1.5. Simpson's paradox

1. Linear Regression• Linear regression answers the question of how do we find the line of best fit for the data?

y = 𝛽_{0} + 𝛽_{1} x_{1} + 𝛽_{2} x_{2} + \dots + 𝛽_{n} x_{n} + 𝜀

• The challenge is to figure out what the best line is which summarizes the data best.• Note: Just as a reminder, if the confidence interval for a coefficient contains a zero, then that coefficient cannot be statistically significant → A confidence interval that contains zero is not certainty that there is no treatment effect, but it is uncertain whether there is a treatment effect. – Having zero in one's confidence interval implies that a treatment effect could have a positive/negative effect on the outcome of interest. 1.1. R-squared•

R^{2}

(or coefficient of determination) is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

\begin{array}{c} S S_{r e s} = \sum_{i}^{} (y_{i} - f_{i})^{2} = \sum_{i}^{} e_{i}^{2} \\ S S_{t o t} = \sum_{i}^{} (y_{i} - \bar{y})^{2} \\ R^{2} = 1 - \frac{S S_{r e s}}{S S_{t o t}} = 1 - \frac{v a r (e r r o r s)}{v a r (y)} \\ a d j R^{2} = 1 - \frac{n - 1}{n - 2} (1 - R^{2}) \end{array}

1.2. Test for significance• We test for significance by performing a t-test for the regression coefficients.• In other words, we will test a claim about the population regression line because there is a strong correlation observed.• We will carry out a t-test for the slope by calculating the p-value and comparing it with the desired significance level.• The null hypothesis is:–

H_{0} : 𝛽_{i} = 0

→ the coefficient is equal to zero–

H_{a} : 𝛽_{i} \neq 0

→ the coefficient is NOT equal to zero• The

t

-statistic is calculated as follows•

\begin{array}{c} S E_{c o e f} = \sqrt{\frac{\frac{1}{n - 2} \sum_{n}^{} (\hat{y} - y_{i})^{2}}{\sum_{n}^{} (x_{i} - \bar{x})^{2}}} \\ t - s t a t i s t i c = \frac{c o e f}{S E_{c o e f}} \\ p - v a l u e = s u m (95 % t a i l a r e a s) u n d e r t - d i s t r i b u t i o n \\ 95 % C I = [c o e f - 1.96. S E_{c o e f}, c o e f + 1.96. S E_{c o e f}] \end{array}

1.2.1. When to use a t-test?• A t-test is one of the most popular statistical tests for location, i.e., it deals with the population(s) mean value(s).• There are different types of t-tests that you can perform:– One-sample t-test– Two-sample t-test– Paired t-test• Note: Remember that a t-test can only be used for one or two groups. If you need to compare three (or more) means, use the analysis of variance (ANOVA) method.• The t-test is a parametric test, meaning that your data has to fulfill some assumptions:– The data points are independent; AND– The data, at least approximately, follow a normal distribution.• Note: If your sample doesn't fit these assumptions, you can resort to a non-parametric alternatives, e.g., the Mann–Whitney U test (a.k.a. the Wilcoxon rank-sum test), the Wilcoxon signed-rank test or the sign test. 1.2.2. Which t-test?• Your choice of t-test depends on whether you are studying one group or two groups:– One sample t-test* Choose the one-sample t-test to check if the mean of a population is equal to some pre-set hypothesized value.* Example: The average volume of a drink sold in 0.33

m l

cans - is it really equal to 330

m l

?* Example: The average weight of people from a specific city - is it different from the national average?– Two sample t-test* Choose the two-sample t-test to check if the difference between the means of two populations is equal to some pre-determined value, when the two samples have been chosen independently of each other.* In particular, you can use this test to check whether the two groups are different from one another.* Example: The average difference in weight gain in two groups of people: one group was on a high-carb diet and the other on a high-fat diet.* Example: The average difference in the results of a math test from students at two different universities.* Note: This test is sometimes referred to as an independent samples t-test, or an unpaired samples t-test.– Paired t-test* A paired t-test is used to investigate the change in the mean of a population before and after some experimental intervention, based on a paired sample, i.e., when each subject has been measured twice: before and after treatment.* In particular, you can use this test to check whether, on average, the treatment has had any effect on the population.* Example: The change in student test performance before and after taking a course.* Example: The change in blood pressure in patients before and after administering some drug. 1.2.3. How to do a t-test?• Decide on the alternative hypothesis– Use a two-tailed t-test if you only care whether the population's mean (or, in the case of two populations, the difference between the populations' means) agrees or disagrees with the pre-set value.– Use a one-tailed t-test if you want to test whether this mean (or difference in means) is greater/less than the pre-set value.• Compute your t-score value– Formulas for the test statistic in t-tests include the sample size, as well as its mean and standard deviation. The exact formula depends on the t-test type - check the sections dedicated to each particular test for more details.• Determine the degrees of freedom for the t-test– The degrees of freedom are the number of observations in a sample that are free to vary as we estimate statistical parameters. In the simplest case, the number of degrees of freedom equals your sample size minus the number of parameters you need to estimate. Again, the exact formula depends on the t-test you want to perform - check the sections below for details. • The degrees of freedom are essential, as they determine the distribution followed by your t-score (under the null hypothesis)• If there are

d

degrees of freedom, then the distribution of the test statistics is the t-Student distribution with d degrees of freedom.• This distribution has a shape similar to

N (0, 1)

(bell-shaped and symmetric) but has heavier tails.• Note: If the number of degrees of freedom is large (

> 30

), which generically happens for large samples, the t-Student distribution is practically indistinguishable from

N (0, 1)

Figure 1:Density of t-distribution with

𝜈

degrees of freedom

• Fun Fact: The t-Student distribution owes its name to William Sealy Gosset, who, in 1908, published his paper on the t-test under the pseudonym "Student". Gosset worked at the famous Guinness Brewery in Dublin, Ireland, and devised the t-test as an economical way to monitor the quality of beer. 1.2.4. p-value from t-test• Recall that the p-value is the probability (calculated under the assumption that the null hypothesis is true) that the test statistic will produce values at least as extreme as the t-score produced for your sample. • As probabilities correspond to areas under the density function, p-value from t-test can be nicely illustrated with the help of the following pictures:

• The following formulae say how to calculate p-value from t-test. •

C D F_{t, d}

→ Cumulative Distribution Function (CDF) of the t-student distribution with

d

degrees of freedom:– p-value from left-tailed t-test →

C D F_{t, d} (t_{s c o r e})

– p-value from right-tailed t-test →

1 - C D F_{t, d} (t_{s c o r e})

– p-value from two-tailed t-test →

2 \times C D F_{t, d} (- | t_{s c o r e} |)

2 - 2 \times C D F_{t, d} (| t_{s c o r e} |)

• Note: However, the CDF of the t-distribution is given by a somewhat complicated formula.– To find the p-value by hand, you would need to resort to statistical tables, where approximate CDF values are collected, or to specialized statistical software. 1.2.5. t-test critical values• Recall, that in the critical values approach to hypothesis testing, you need to set a significance level,

𝛼

, before computing the critical values, which in turn give rise to critical regions (a.k.a. rejection regions).• Formulas for critical values employ the quantile function of t-distribution, i.e., the inverse of the CDF:– Critical value for left-tailed t-test →

C D F_{t, d}^{- 1} (𝛼)

* Critical region →

(- \infty, C D F_{t, d}^{- 1} (𝛼))

– Critical value for right-tailed t-test →

C D F_{t, d}^{- 1} (1 - 𝛼)

* Critical region →

(C D F_{t, d}^{- 1} (1 - 𝛼), \infty)

– Critical value for two-tailed t-test →

\pm C D F_{t, d}^{- 1} (1 - 𝛼 ⁄ 2)

* Critical region →

(- \infty, - C D F_{t, d}^{- 1} (1 - 𝛼 ⁄ 2)] \cup [C D F_{t, d}^{- 1} (1 - 𝛼 ⁄ 2), \infty)

• • Note: To decide the fate of the null hypothesis, just check if your t-score lies within the critical region:– If your t-score belongs to the critical region, reject the null hypothesis and accept the alternative hypothesis.– If your t-score is outside the critical region, then you don't have enough evidence to reject the null hypothesis. 1.2.6. One-sample t-test• The null hypothesis is that the population mean is equal to some value

𝜇_{0}

.• The alternative hypothesis is that the population mean is:– different from

𝜇_{0}

;– smaller than

𝜇_{0}

; or– greater than

𝜇_{0}

\begin{array}{c} t = \frac{\bar{x} - 𝜇_{0}}{s} . \sqrt{n} \\ 𝜇_{0} \to mean postulated in H_{0} \\ n \to sample size \\ \bar{x} \to sample mean \\ s \to sample standard deviation \end{array}

• Note: Number of degrees of freedom in one-sample t-test →

n - 1

. 1.2.7. Two-sample t-test• The null hypothesis is that the actual difference between these groups' means,

𝜇_{1}

and

𝜇_{2}

, is equal to some pre-set value,

𝛥

.• The alternative hypothesis is that the difference

𝜇_{1} - 𝜇_{2}

is:– different from

𝛥

;– smaller than

𝛥

; or– greater than

𝛥

. • In particular, if this pre-determined difference is zero (

𝛥 = 0

) → The null hypothesis is that the population means are equal.• The alternate hypothesis is that the population means are:–

𝜇_{1}

and

𝜇_{2}

are different from one another;–

𝜇_{1}

is smaller than

𝜇_{2}

; and–

𝜇_{1}

is greater than

𝜇_{2}

. • Note: Formally, to perform a t-test, we should additionally assume that the variances of the two populations are equal (this assumption is called the homogeneity of variance).• There is a version of a t-test which can be applied without the assumption of homogeneity of variance: it is called a Welch's t-test. For your convenience, we describe both versions. 1.2.8. Two-sample t-test if variances are equal• Use this test if you know that the two populations' variances are the same (or very similar).

\begin{array}{c} t = \frac{{\bar{x}}_{1} - {\bar{x}}_{2} - 𝛥}{s_{p} . \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}} \\ s_{p} \to pooled standard deviation \\ s_{p} = \sqrt{\frac{(n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2}}{n_{1} + n_{2}}} \end{array}

• Note: Number of degrees of freedom in t-test (two samples, equal variances) =

n_{1} + n_{2} - 2

. 1.2.9. Two-sample t-test if variances are unequal (Welch's t-test)• Two-sample Welch's t-test formula if variances are unequal:

t = \frac{{\bar{x}}_{1} - {\bar{x}}_{2} - 𝛥}{\sqrt{\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}}}}

• Note: The number of degrees of freedom in a Welch's t-test (two-sample t-test with unequal variances) is very difficult to count. We can be approximate it with help of the following Satterthwaite formula:

\frac{(\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2} ⁄ n_{1})^{2}}{n_{1} - 1} + \frac{(s_{2}^{2} ⁄ n_{2})^{2}}{n_{2} - 1}}

• Alternatively, you can take the smaller of

n_{1} - 1

and

n_{2} - 1

as a conservative estimate for the number of degrees of freedom. • Fun Fact: The Satterthwaite formula for the degrees of freedom can be rewritten as a scaled weighted harmonic mean of the degrees of freedom of the respective samples:

n_{1} - 1

and

n_{2} - 1

, and the weights are proportional to the standard deviations of the corresponding samples. 1.2.10. Paired t-test• As we commonly perform a paired t-test when we have data about the same subjects measured twice (before and after some treatment), let us adopt the convention of referring to the samples as the pre-group and post-group.• The null hypothesis is that the true difference between the means of pre and post populations is equal to some pre-set value,

𝛥

.• The alternative hypothesis is that the actual difference between these means is:– different from

𝛥

;– smaller than

𝛥

; or– greater than

𝛥

. • Typically, this pre-determined difference is zero. We can then reformulate the hypotheses as follows:– The null hypothesis is that the pre and post means are the same, i.e., the treatment has no impact on the population.– The alternative hypothesis:* The pre and post means are different from one another (treatment has some effect);* The pre mean is smaller than post mean (treatment increases the result); or* The pre mean is greater than post mean (treatment decreases the result). • In fact, a paired t-test is technically the same as a one-sample t-test! Let us see why it is so. Let

x_{1}, \dots, x_{n}

be the pre observations and

y_{1}, \dots, y_{n}

the respective post observations. That is,

x_{i}

y_{i}

are the before and after measurements of the

i

-th subject.• For each subject, compute the difference,

d_{i} = x_{i} - y_{i}

. All that happens next is just a one-sample t-test performed on the sample of differences

d_{1}, \dots, d_{n}

. Take a look at the formula for the t-score:

t = \frac{\bar{x} - 𝛥}{s} . \sqrt{n}

• Note: Number of degrees of freedom in t-test (paired):

n - 1

1.2.11. t-test vs. z-test• We use a z-test when we want to test the population mean of a normally distributed dataset, which has a known population variance. If the number of degrees of freedom is large, then the t-Student distribution is very close to

N (0, 1)

.• Hence, if there are many data points (at least 30), you may swap a t-test for a z-test, and the results will be almost identical. However, for small samples with unknown variance, remember to use the t-test because, in such case, the t-Student distribution differs significantly from the

N (0, 1)

! 1.3. Multicollinearity• Multicollinearity happens where there's a correlation between the some of independent variables. In other words, some of the independent variables are not that independent.• Collinearity won't affect the performance of the model → The

R^{2}

remains unchanged.– Also, the model can still make effective predictions.– However, the way we interpret the coefficients will have to change. 1.3.1. How to detect multicollinearity?• We can look at the features VIF (Variance Inflation Factor).– VIF is derived from finding the correlation itself between certain features.* VIF = 1 → no collinearity* 1 < VIF < 5 → moderate collinearity* VIF

\geq

5 → severe collinearity → need mitigation strategy like centering the features.1.4. Feature interaction• This simply means multiplying the two features together.• After introducing the interaction term, if the

R^{2}

goes up and the p-value of the interaction term is significant → then you can be reasonably confident that the interaction terms are in fact interacting.• Can we multiply a feature by itself?– Yes! → But why would we want to do that?– Because now we can fit polynomial relationships.• Note: When adding interaction terms, be noted not to overfit the data. 1.5. Simpson's paradox• Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.• A good way to avoid it is to add as many dimensions to your model which segment the data you're trying to predict.

Figure 2:Simpson's paradox