You are now ready to apply this method to a possible exam question.
The number of hours students studied and their exam results are recorded in the table below.
Time studied in hours | \(1\) | \(2\) | \(3\) | \(4\) | \(5\) |
Exam result | \(49\) | \(81\) | \(71\) | \(83\) | \(99\) |
a. Calculate \(S_{xy}\) and \(S_{xx}\).
b. Find the regression line of \(y\) on \(x\).
c. Plot the data points and the regression line on the same graph.
d. Interpret the meaning of \(a=10.2\) and \(b=46\) in the context of the question.
e. Predict the grade for a student who studies for
i) \(2.5\) hours
ii) \(8\) hours.
f. Comment on your answers for part e).
Solution
a. Using your calculator, you can easily find the following results,
\(\sum x=15\) \(\sum x^2=55\) \(\bar{x}=3\) \(\sum xy=1,251\) \(\sum y=383\) \(\sum y^2=30,693\) \(\bar{y}=76.6\).
Simply plug these results into the formulae detailed above to get the summary statistics.
\( \begin{align} S_{xx} &=\sum x^2 - \dfrac{(\sum x)^2}{n} \\&= 55 - \dfrac{15^2}{5} \\&= 10. \end{align}\)
\( \begin{align} S_{xy} &= \sum xy - \dfrac{\sum x \sum y}{n}\\&= 1251 - \dfrac{15 \times 383}{5} \\&= 102. \end{align}\)
b. Starting with \(a\), the gradient of the line,
\[a=\dfrac{S_{xy}}{S_{xx}}=\frac{102}{10}=10.2.\]
Then, the \(y\)-intercept is
\(b=\bar{y}-a\bar{x}=76.6-10.2 \times 3=46\).
Therefore, the regression line is \(y=10.2x+46\).
c. This is a great question for double-checking your working - it'll be pretty obvious if you've made any serious calculation errors!
Least square regression line, example
d. Since \(a=10.2\), for every extra hour increase along the \(x\)-axis, the student receives \(10.2\) more marks in the exam.
Since \(b=46\), if a student weren't to study at all, they would still (according to the regression line) receive 46 marks.
e. Simply input the above numbers for \(x\).
i) If \(x=2.5\), \(y=10.2\times 2.5+46=71.5\).
ii) If \(x=8\), \(y=10.2\times 8+46=127.6\).
f. There is a fundamental problem for part ii): since the exams are graded in percentages, the grade \(127.6\) doesn't exist! The truth is, for any amount of time longer than 5 hours, the data doesn't have any information on what happens to the grades of the students.
While you could deduce that for any length of time above 5 hours, 100% would be a good prediction, this is beyond the scope of the data and the linear regression model.
You should keep in mind that using a regression line should only ever be used to predict the values that fall within the range of the data from which you are deriving said regression line, i.e. interpolation.
If you attempt to make predictions outside of this range, it would be called extrapolation and is less reliable since the data may behave differently.
The most difficult thing in this topic is making sure you enter the correct numbers into your calculator! Make sure you double-check your calculations in the exam so you don't lose easy marks.