Residual sum of squares linear regression
Let's continue with the example of trying to use a dog's adult weight to predict its height. You have done random sampling and your best to make sure your sample is representative of the overall adult dog population. The information you have gathered is in the table below, where the weight is in pounds and the height is in inches.
Table 1 - Dog Weights (in pounds) and Heights (in inches)
Weight | Height | Weight | Height | Weight | Height |
\(10\) | \(10\) | \(75\) | \(23\) | \(12\) | \(12\) |
\(63\) | \(25\) | \(80\) | \(25\) | \(45\) | \(22\) |
\(60\) | \(23\) | \(20\) | \(15\) | \(50\) | \(18\) |
\(100\) | \(26\) | \(46\) | \(24\) | \(36\) | \(17\) |
\(6\) | \(12\) | \(62\) | \(23\) | \(95\) | \(27\) |
\(48\) | \(20\) | \(45\) | \(18\) | \(34\) | \(24\) |
\(40\) | \(19\) | \(32\) | \(17\) | \(57\) | \(21\) |
\(50\) | \(21\) | \(19\) | \(10\) | \(37\) | \(23\) |
The first thing to do is make a scatter plot.
Fig. 1 - Scatter plot of the data in the table of dog weights and heights.
Next, you would check for any unusual points in the data.
Unusual Data Points
Let's take a look at the kinds of unusual points you might see that would affect your linear regression analysis.
Outliers
Remember that an outlier is a data point that is an abnormal distance from other points in the sample. In other words, the response variable (in this case the height of the dog) does not follow the general trend of the other data. Who gets to decide what points are outliers? The person looking at the data of course! In the scatter plot of the data above you can see that there doesn't appear to be any real outliers in the data.
High Leverage Points
What makes a data point of your sample a high leverage point?
A high leverage point is one that has an unusually large distance between it and the mean.
A high leverage point can either be above or below the mean. Points like this can have a large effect on linear regression.
Influential Points
Influence is a way to measure how much impact an outlier or a high leverage point has on your regression model.
A point is considered to be influential if it unduly influences any part of your regression analysis, like the line of best fit.
While outliers and high leverage points could be influential points, they are not always influential points. In order to say if an outlier or a high leverage point is actually influential, you would need to remove it from the data set, recalculate the linear regression, and then see how much it changed. The best way to check is to see if the \(R^2\) value has changed.
For a reminder about the \(R^2\) value, see the articles Linear Regression and Residuals.
Residual sum of squares geometric interpretation
Once you have made a scatter plot of the data, you can check to see if it looks linear. In this case, it might be, but the question is how to draw the line. As you can see in the picture below, any of the three lines drawn look like they might fit the data pretty well.
Fig. 2 - Scatter plot showing three potential lines through the data.
So what makes a line the "best" line? You want a line that is as close to as many data points in the sample as possible. For that, you need to look at the deviation, also called the residual. The residual of a data point is simply how far away the data point is from the potential line of best fit.
Fig. 3 - Scatter plot showing the deviation of two of the data points.
A negative residual means the point is below the line, and a positive residual means the point is above the line. If a point lies exactly on the line the residual would be zero. Because the residual could be positive or negative, it is common to look at the square of the residual so things don't get accidentally cancelled out.
Residual sum of squares definition
Let's look at the actual definition of the residual sum of squares. You will notice that it can be defined for any line \(y=a+bx\), not just for the line of best fit.
For \(n\) data points,
\[(x_1, y_1), (x_2, y_2), \dots (x_n, y_n),\]
one way to measure the fit of a line \(y=bx+a\) to bivariate data is the sum of squared residuals using the formula
\[\sum\limits_{i=1}^n (y_i - (a+bx_i))^2.\]
The goal is to make the sum of squared residuals as small as possible.
For an explanation of why the residual sum of squares is the best way to go about things, see the article Minimising the Sum of Squares Residual.
You might see the residual at point \((x_i,y_i)\) written as \(\epsilon_i\).
Formula for residual sum of squares
Now you can define the line of best fit, also known as the least-squares regression line.
The least-squares regression line is the line that minimises the sum of squared deviations to the sample data.
You still need a way to find the least-squares regression line! Thankfully other people have done all the math to find the slope and intercept of the line. The notation in the formulas is:
\(n\) number of sample points;
\(\bar{x}\) the average of the \(x_i\) values; and
\(\bar{y}\) the average of the \(y_i\) values.
The slope of the least-squares regression line is
\[ b = \frac{\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{ \sum\limits_{i=1}^n(x_i - \bar{x})^2 } = \frac{S_{xy}}{S_{xx}} ,\]
the \(y\)-intercept is
\[ a = \bar{y} - b\bar{x},\]
and the equation of the least-squares regression line is
\[ \hat{y} = a+bx,\]
where \(\hat{y}\) is the predicted value that results from substituting a given \(x\) into the equation.
\(S_{xx}\) and \(S_{xy}\) are called summary statistics, and their formulas may show up depending on what learning tools you are using.
Let's look at an example.
Going back to the table with the dog weights and heights, the dependent variable is the height (these would be the \(y_i\) values), and the independent variable is the weight (these would be the \(x_i\) values). There are \(24\) data points in the table, so \(n=24\). You can calculate
- \( \bar{x} = 46.75\) and
- \(\bar{y} = 19.79\),
rounded to two decimal places. Generally, you will use a spreadsheet or calculator to find the values of \(b\) and \(a\), especially when there are lots of data points! Here
- \( a =11.69\) and
- \(b = 0.17\),
where both have been rounded to two decimal places. So the equation of the least-squares regression line is
\[ \hat{y} = 11.69 + 0.17x.\]
Fig. 4 - Scatter plot with the line of best fit, also known as the least-squares regression line.
Now that you have a formula for the line, you can find the residual sum of squares deviation for this line. Using the formula,
\[\sum\limits_{i=1}^24 (y_i - (a+bx_i))^2 \approx 160.58.\]
In fact, the \(R^2\) value, also known as the coefficient of determination, is about \(R^2 = 0.73\), or \(73\%\).
Now let's look for influential points.
Going back to the table of data, if you look at the deviation for each point in the sample, one of them seems to contribute quite a bit more than the others to the sum of squares deviation. That data point is \( (37, 23)\) with a deviation of almost \(24\). That is considerably more than any of the other sample points, with the next highest being less than \(12\). This implies that the data point \( (37, 23)\) is a high leverage point, but you do need to show whether or not it is an influential point.
It might be the case that \( (37, 23)\) is an influential point. If you remove that point from the sample and then calculate the new \(R^2\) value, you get about \(0.77\), or \(77\%\), with a least-squares regression line of
\[\hat{y} = 11.31 + 0.18x,\] and a residual sum of squares deviation of \(135.36\).
Remember that the coefficient of determination, \(R^2\), is a measure of the variability in \(y\) that can be explained by a linear relationship between \(x\) and \(y\). The closer to \(1\) that \(R^2\) is, the closer to linear your sample data is. So by removing one point from the data set, you have changed the \(R^2\) value from \(73\%\) to \(77\%\), which is a big change! That means the data point \( (37, 23)\) is in fact an influential point.
Remember that variability can be decreased by increasing the sample size. See Unbiased Point Estimates for more information.
Once you have the least-squares regression line, what can you do with it?
Examples of residual sum of squares
There are a couple of important things to consider when using the least-squares regression line to make a prediction.
The least-squares regression line is a predictor of the population, not an individual.
Using the least-squares regression line to make a prediction for a value outside the range of the collected data might not work very well.
Let's look at an example of the kinds of problems that can occur when these considerations are ignored.
Fig. 5 - Bulldogs are an example of why you can't necessarily make a prediction about an individual from a least-squares regression line.
Going back to the dog weight/height information, and using the least-squares regression line
\[\hat{y} = 11.31 + 0.18x,\]
you what can you predict about the height of a bulldog that weighs \(65\) pounds?
Answer:
Simply plugging in the weight of the bulldog, you get
\[\hat{y} = 11.31 + 0.18(65) = 23.01,\]
so the least-squares regression line predicts that the bulldog would be \(23.01\) inches tall. However, a bulldog of this weight will actually be about \(15\) inches tall, which is quite a difference! This is an example of why you can use the least-squares regression line to make a prediction about dogs in general (i.e. the population of dogs) and not about specific dogs.
What about a dog that has a weight of more than \(100\) pounds?
Fig. 6 - Bull mastiff dogs are definitely one to a kid sized wading pool!
A male bull mastiff dog can easily weigh \(130\) pounds. This is outside the range of the data collected in the table. When you use the least-squares regression line to make a prediction, you find that a bull mastiff dog should be
\[\hat{y} = 11.31 + 0.18(130) = 34.71\, \text{in},\]
tall. However in general this dog won't be more than \(27\) inches tall, which is considerably less than what the least-squares regression line predicts! That is because the weight of the dog is quite far outside of the data collected, so the least-squares regression line isn't a very good predictor.
Residual Sum of Squares - Key takeaways
- The residual of a data point is how far away the data point is from the potential line of best fit. Deviation can be positive or negative.
For \(n\) data points,
\[(x_1, y_1), (x_2, y_2), \dots (x_n, y_n),\]
one way to measure the fit of a line \(y=mx+b\) to bivariate data is the residual sum of squared deviations using the formula
\[\sum\limits_{i=1}^n (y_i - (a+bx_i))^2.\]
- The least-squares regression line is the line that minimises the residual sum of squares.
- The slope of the least-squares regression line is
\[ \begin{align} b &=\frac{S_{xy}}{S_{xx}} \\ & = \frac{\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{ \sum\limits_{i=1}^n(x_i - \bar{x})^2 }, \end{align}\]
the \(y\)-intercept is
\[ a = \bar{y} - b\bar{x},\]
and the equation of the least-squares regression line is
\[ \hat{y} = a+bx,\]
where \(\hat{y}\) is the predicted value that results from substituting a given \(x\) into the equation.
How we ensure our content is accurate and trustworthy?
At StudySmarter, we have created a learning platform that serves millions of students. Meet
the people who work hard to deliver fact based content as well as making sure it is verified.
Content Creation Process:
Lily Hulatt is a Digital Content Specialist with over three years of experience in content strategy and curriculum design. She gained her PhD in English Literature from Durham University in 2022, taught in Durham University’s English Studies Department, and has contributed to a number of publications. Lily specialises in English Literature, English Language, History, and Philosophy.
Get to know Lily
Content Quality Monitored by:
Gabriel Freitas is an AI Engineer with a solid experience in software development, machine learning algorithms, and generative AI, including large language models’ (LLMs) applications. Graduated in Electrical Engineering at the University of São Paulo, he is currently pursuing an MSc in Computer Engineering at the University of Campinas, specializing in machine learning topics. Gabriel has a strong background in software engineering and has worked on projects involving computer vision, embedded AI, and LLM applications.
Get to know Gabriel