You show in a regression analysis if other variables impact a certain variable (dependent) though it is made known that certain specific variables (explanatory) may have a relationship or explains it. This is explained by a concept called residuals. Let’s take a look at residuals in this lesson.
Residuals in Math
For instance, assuming you want to find out how climate changes affect yield from a farm. You may specify climate variables in the model such as rainfall and temperature. However, other factors such as land size cultivated, and fertilizer use, among others, also affect farm yield. Hence, the question becomes, “is the model accurately predicting the level of yield considering climate changes as an explanatory variable?”. So how do you measure how much impact a given factor has? Let's look at a short and informal definition of a residual.
For any observation, the residual of that observation is the difference between the predicted value and the observed value.
You can lean on the size of the residual to inform you about how good your prediction model is. That means you consider the value of the residual to explain why the prediction is not precisely as the actual.
In mathematics, residual value is usually used in terms of assets and in statistics (basically, in regression analysis as discussed in previous sections). The worth of an asset after a specified use-time explains the residual value of the asset.
For instance, the residual value for renting out a factory machine for \(10\) years, is how much the machine will be worth after \(10\) years. This can be referred to as the salvage value or scrap value of the asset. Thus, how much an asset is worth after its lease term or productive/useful lifespan.
So, formally you can define residuals as below.
Definition of Residual
The residual is the vertical distance between the observed point and the predicted point in a linear regression model. A residual is termed as the error term in a regression model, though it is not an error, but the difference in the value. Here is the more formal definition of a residual in terms of a regression line.
The difference between the actual value of a dependent variable and its associated predicted value from a regression line (trendline) is called residual. A residual is termed as the error term in a regression model. It measures the accuracy with which the model was estimated with the explanatory variables.
Mathematically, you can estimate the residual by deducting the estimated values of the dependent variable \((\hat{y})\) from the actual values given in a dataset \((y)\).
For a reminder about regression lines and how to use them, see the articles Linear Correlation, Linear Regression and Least-Squares Regression
The residual is represented by \(\varepsilon \). That will mean
\[\varepsilon =y-\hat{y}.\]
The predicted value \((\hat{y})\) is obtained by substituting \(x\) values in the least-square regression line.
Residuals for data points
In the above graph, the vertical gap between a data point and the trendline is referred to as residual. The spot the data point is pinned determines whether the residual will be positive or negative. All points above the trendline show a positive residual and points below the trendline indicate a negative residual.
Residual in Linear Regression
For simplicity's sake let's look at residuals for bivariate data. In linear regression, you include the residual term to estimate the margin of error in predicting the regression line which passes through the two sets of data. In simple terms, residual explains or takes care of all other factors that may influence the dependent variable in a model other than what the model states.
Residuals are one way to check the regression coefficients or other values in linear regression. If the residual plot some unwanted patterns, then some values in the linear coefficients cannot be trusted.
You should make the following assumptions about the residuals for any regression model:
Assumptions of Residuals
They have to be independent – no one residual at a point influences the next point’s residual value.
Constant variance is assumed for all residuals.
The mean value of all residuals for a model should equate to \(0\).
Residuals should be normally distributed/follow a normal distribution – plotting them will give a straight line if they are normally distributed.
Residual Equation in Math
Given the linear regression model which includes the residual for estimation, you can write:
\[y=a+bx+\varepsilon ,\]
where \(y\) is the response variable (independent variable), \(a\) is the intercept, \(b\) is the slope of the line, \(x\) is
the explanatory variable (dependent variable) and \(\varepsilon\) is the residual.
Hence, the predicted value of \(y\) will be:
\[\hat{y} = a+bx .\]
Then using the definition, the residual equation for the linear regression model is
\[\varepsilon =y-\hat{y}\]
where \(\varepsilon\) represents residual, \(y\) is the actual value and \(\hat{y}\) is the predicted value of y.
For \(n\) observations of data, you can represent predicted values as,
\[ \begin{align}\hat{y}_1&=a+bx_1 \\ \hat{y}_2&=a+bx_2 \\ &\vdots \\ \hat{y}_n&=a+bx_n\\\end{align}\]
And with these \(n\) predicted quantities residuals can be written as,
\[ \begin{align}\varepsilon _1&=y_1-\hat{y}_1 \\ \varepsilon _2&=y_2-\hat{y}_2 \\ &\vdots \\ \varepsilon _n&=y_n-\hat{y}_n \\ \end{align}\]
This equation for residuals will be helpful in finding residuals from any given data. Note that, the order of subtraction is important when finding residuals. It is always the predicted value taken from the actual value. That is
residual = actual value – predicted value.
How to Find Residuals in Math
As you have seen, residuals are errors. Thus, you want to find out how accurate your prediction is from the actual figures considering the trendline. To find the residual of a data point:
First, know the actual values of the variable under consideration. They may be presented in a table format.
Secondly, identify the regression model to be estimated. Find the trendline.
Next, using the trendline equation and the value of the explanatory variable, find the predicted value of the dependent variable.
Finally, subtract the estimated value from the actual given.
This means if you have more than one data point; for example, \(10\) observations for two variables, you will be estimating the residual for all \(10\) observations. That is \(10\) residuals.
The linear regression model is considered to be a good predictor when all the residuals add up to \(0\).
You can understand it more clearly by taking a look at an example.
A production plant produces varying numbers of pencils per hour. Total output is given by
\[y=50+0.6x ,\]
where \(x\) is the input used to produce pencils and \(y\) is the total output level.
Find the residuals of the equation for the following number of pencils produced per hour:
\(x\) | \(500\) | \(550\) | \(455\) | \(520\) | \(535\) |
\(y\) | \(400\) | \(390\) | \(350\) | \(355\) | \(371\) |
Table 1. Residuals of the example.
Solution:
Given the values in the table and the equation \(y=50+0.6x\), you can proceed to find the estimated values by substituting the \(x\) values into the equation to find the corresponding estimated value of \(y\).
\(X\) | \(Y\) | \(y=50+0.6x\) | \(\varepsilon =y-\hat{y}\) |
\(500\) | \(400\) | \(350\) | \(50\) |
\(550\) | \(390\) | \(380\) | \(10\) |
\(455\) | \(350\) | \(323\) | \(27\) |
\(520\) | \(355\) | \(362\) | \(-7\) |
\(535\) | \(365\) | \(365\) | \(0\) |
Table 2. Estimated values.
The results for \(\varepsilon =y-\hat{y}\) shows you the trendline under-predicted the \(y\) values for \(3\) observations (positive values), and over-predict for one observation (negative value). However, one observation was accurately predicted (residual = \(0\)). Hence, that point will lie on the trendline.
You can see below how to plot the residuals in the graph.
Residual Plot
The residual plot measures the distance data points have from the trendline in the form of a scatter plot. This is obtained by plotting the computed residual values against the independent variables. The plot assists you to visualize how perfectly the trendline conforms to the given data set.
Fig. 1. Residuals without any pattern.
The desirable residual plot is the one which shows no pattern and the points are scattered at random. You can see from the above graph, that there is no specific pattern between points, and all the data points are scattered.
A small residual value results in a trendline that better fits the data points and vice versa. So larger values of the residuals suggest the line is not the best for the data points. When the residual is \(0\) for an observed value, it means that the data point is precisely on the line of best fit.
A residual plot can at times be good to identify potential problems in the regression model. It can much easier to show the relationship between two variables. The points far above or below the horizontal lines in residual plots show the error or unusual behavior in the data. And some of these points are called outliers regarding the linear regression lines.
Note that the regression line might not be valid for a wider range of \(x\) as sometimes it might give poor predictions.
Considering the same example used above, you can plot the residual values below.
Using the results in the production of pencils example for the residual plot, you can tell that the vertical distance of the residuals from the line of best fit is close. Hence, you can visualize that, line \(y=50+0.6x\) is a good fit for the data.
Fig. 2. Residual plot.
From below, you can see how to work out the residual problem for different scenarios.
Residual Examples in Math
You can understand how to calculate residuals more clearly by following the residual examples here.
A shop attendant earns \(\$800.00\) per month. Assuming the consumption function for this shop attendant is given by \(y=275+0.2x\), where \(y\) is consumption and \(x\) is income. Assuming further, that the shop attendant spends \(\$650\) monthly, determine the residual.
Solution:
First, you have to find the estimated or predicted value of \(y\) using the model \(y=275+0.2x\).
Hence, \[\hat{y}=275+0.2(800) =\$435.\]
Given \(\varepsilon =y-\hat{y}\), you can compute the residual as:
\[\varepsilon =\$650-\$435 =\$215 .\]
Therefore, the residual equals \(\$215\). This means you predicted the shop attendant spends lesser (that is, \(\$435\)) than they actually spend (that is, \(\$650\)).
Consider another example to find the predicted values and residuals for the given data
A production function for a factory follows the function \(y=275+0.75x\). Where \(y\) is the output level and \(x\) is the material used in kilograms. Assuming the firm uses \(1000\, kg\) of input, find the residual of the production function.
Solution:
The firm uses \(1000kg\) of input, so it will also be the actual value \(y\). You want to find the estimated output level. So
\[ \begin{align}\hat{y}&=275+0.75x \\ &=275+0.75(1000) \\ &=1025 . \\ \end{align}\]
Then you can estimate the residual or error of prediction:
\[ \begin{align}\varepsilon &=y-\hat{y} \\ &=1000-1025 \\ &=(-)25\, kg .\\ \end{align}\]
Therefore, the predicted output level is larger than the actual level of \(1000kg\) by \(25kg\).
The following example will show the plotting of residuals in the graph.
Sam collected data on the time taken to study, and the scores obtained after the given test from the class. Find the residuals for the linear regression model \(y=58.6+8.7x\). Also, plot the residuals in the graph.
Study time \((x)\) | \(0.5\) | \(1\) | \(1.5\) | \(2\) | \(2.5\) | \(3\) | \(3.5\) |
Test scores \((y)\) | \(63\) | \(67\) | \(72\) | \(76\) | \(80\) | \(85\) | \(89\) |
Table 3. Study time example.
Solution:
You can create a table with the above data and calculate predicted values by using \(y=58.6+8.7x\).
Study time \((x)\) | Test scores \((y)\) | Predicted values (\(\hat{y}=58.6+8.7x\)) | Residuals (\(\varepsilon =y-\hat{y}\)) |
\(0.5\) | \(63\) | \(62.95\) | \(0.05\) |
\(1\) | \(67\) | \(67.3\) | \(-0.3\) |
\(1.5\) | \(72\) | \(71.65\) | \(0.35\) |
\(2\) | \(76\) | \(76\) | \(0\) |
\(2.5\) | \(80\) | \(80.35\) | \(-0.35\) |
\(3\) | \(85\) | \(84.7\) | \(0.3\) |
\(3.5\) | \(89\) | \(89.05\) | \(-0.05\) |
Table 4. Example with study time, test scores, predicted values and residuals data.
Using all the residuals and \(x\) values, you can make the following residual plot.
Fig. 3. Residual plot for the given data
Residuals - Key takeaways
- The difference between the actual value of a dependent variable and its associated predicted value from a regression line (trendline) is called residual.
- All points above the trendline shows a positive residual and points below the trendline indicate a negative residual.
- Residuals are one way to check the regression coefficients or other values in linear regression.
- Then the residual equation is, \(\varepsilon =y-\hat{y}\).
- The predicted value of \(y\) will be \(\hat{y} = a+bx\) for linear regression \(y=a+bx+\varepsilon \).
- A residual plot can at times be good to identify potential problems in the regression model.
How we ensure our content is accurate and trustworthy?
At StudySmarter, we have created a learning platform that serves millions of students. Meet
the people who work hard to deliver fact based content as well as making sure it is verified.
Content Creation Process:
Lily Hulatt is a Digital Content Specialist with over three years of experience in content strategy and curriculum design. She gained her PhD in English Literature from Durham University in 2022, taught in Durham University’s English Studies Department, and has contributed to a number of publications. Lily specialises in English Literature, English Language, History, and Philosophy.
Get to know Lily
Content Quality Monitored by:
Gabriel Freitas is an AI Engineer with a solid experience in software development, machine learning algorithms, and generative AI, including large language models’ (LLMs) applications. Graduated in Electrical Engineering at the University of São Paulo, he is currently pursuing an MSc in Computer Engineering at the University of Campinas, specializing in machine learning topics. Gabriel has a strong background in software engineering and has worked on projects involving computer vision, embedded AI, and LLM applications.
Get to know Gabriel