Given a set of points
\(\{(x_{1},y_{1}), (x_{2},y_{2}), \ldots, (x_{n},y_{n})\}\) and a line
\(\hat{y}=a+bx\), we calculate
\(\bar{x}\) and \(s_{x}\) for \(\{x_{1}, x_{2}, \ldots, x_{n}\}\),
\(\bar{y}\) and \(s_{y}\) for \(\{y_{1}, y_{2}, \ldots, y_{n}\}\)
expected values: \(\hat{y}_{i}=a+bx_{i}\), for each \(x_{i}\),
\(1 \le i \le n\)
the \(i^{th}\) residual (\(i^{th}\) error):
\(\epsilon_{i}=y_{i}-\hat{y}_{i}\), \(1 \le i \le n\)
the Sum of the Squared Errors (\(SSE\)): \(\sum_{i=1}^{n} \epsilon_{i}^{2}\)
The values of \(a\) and \(b\) that minimizes the \(SSE\) are
\(\displaystyle{b=\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}}\)
and \(a=\bar{y}-b\bar{x}\)
The line \(\hat{y}_{i}=a+bx_{i}\) with this \(a\) and \(b\), which minimizes the
Sum of the Squared Residuals is called the line of best fit or the
least-squares regression line.
The measure of how well the least-squares regression line fits the set of points
is the correlation coefficient, \(r\)
values of \(r\) closer to \(-1\) or \(1\) indicate a stronger linear
relationship (called correlation)
\(r=0\) would indicate no linear relationship
\(r=1\) or \(r=-1\) would indicate that all data points lie on the same
line
\(r<0\) or \(r>0\) indicate that the least-squares regression line has
a negative or positive slope, respectively
The square of the correlation coefficient, \(r^{2}\), is called the
determination coefficient
\(r^{2}\) (expressed as a percentage or a proportion) represents that
portion of variation in the dependent variable, \(y\), that is due to
variation in the independent variable, \(x\)
\(1-r^{2}\) (expressed as a percentage or a proportion) represents that
portion of variation in the dependent variable, \(y\), that is not due to
variation in the independent variable, \(x\)
12.4 Testing the Significance of the Correlation Coefficient
recommended:
28-30
We can perform a hypothesis test of the significance of the correlation
coefficient, \(r\), to decide whether the linear relationship in the sample data
is strong enough to model the relationship in the population.
let \(\rho\) be the (unknown) correlation coefficient of population and set a
two-tailed test (\(t\) test) with level of significance \(\alpha=0.05\)
\(H_{0}: \rho=0\)
\(H_{1}: \rho \neq 0\)
test statistic (\(t\)-score)
\(\displaystyle{\frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}}\)
\(df=n-2\)
If we "reject \(H_{0}\)", then we sah that "\(r\) is significant".
\(r\) is significant and the scatterplot shows a linear trend, then the
least-squares regression line can be used to predict the values of \(y\) for
values of \(x\) that lie within the observed values of \(x\), otherwise the
least-squares regression line should not be used to predict values of \(y\)
the least-squares regression line may not be appropriate or reliable for
prediction outside the observed values of \(x\), even if \(r\) si significant
and the scatterplot shows a linear trend
Assumptions in testing the significance of the correlation coefficient:
there is a linear relationship in the population that models the average
value of \(y\) in terms of the value of \(x\)
the \(y\)-values for any given \(x\) value are normally distributed about
the value of the least-squares regression line at \(x\)
the standard deviation of these distributions are equal for each value of
\(x\)
the residual errors are mutually independent
the data come form a random sample or well-designed randomized experiment
12.5 Prediction
recommended:
31-50, 67-71
If we decide that the correlation coefficient is significant, we can make
predictions with the least-squares regression line.
Making predictions inside the observed values of \(x\) is called
interpolation, outside the observed values of \(x\) is called
extrapolation
12.6 Outliers
recommended:
51-56, 72-77
In linear regression, outliers are observed data points that are far from the
least-squares regression line. ``Far'' in this context means more than two
standard deviations from the least-squares regression line.
Influential points are observed data points that are far from the other
observed data points in the horizontal directions. These data points may have a
large effect on the slope of the regression line.
You can test to see if a point may be an influential point by removing it from
the data set and seeing if the slope of the least-squares regression line is
changed significantly.
To check for potential outliers, we use the standard deviation of the residuals,
with \(n-2\) degrees of freedom.
If the potential outlier reflects an error in the data, then we can either
correct the error or remove the potential outlier. If the potential outlier
is correct, then it remains in the data.
Either way, a researcher should document the inquiry, their findings, and the
outcome, as part of the record.