janmr blog

Simple Linear Regression

We now turn to the general case of fitting a line to a set of points in the plane. The two previous posts considered the special cases where the points' center of mass was at the origin and where the line had to pass through the origin.

Let the points be given as (xi,yi)(x_i, y_i) for i=1,,ni=1, \ldots, n, where n2n \geq 2 is the number of points. We will furthermore require that the xix_i are not all equal.

A set of points in the plane
Figure 1. A set of points in the plane.

Again, we will look for a least squares definition of best. We seek a line y=ax+by = a x + b that minimizes the following error function:

J=i=1n(axi+byi)2.J = \sum_{i=1}^n (a x_i + b - y_i)^2.

Note this very important observation: Translating the points and the line by the same amount does not change the value of the error function. This means that we can translate the points so their center of mass is at the origin, compute the best fitting line for the translated points, and then translate the line and points back to the original position.

So let us set

x~i=xixˉandy~i=yiyˉ\tilde{x}_i = x_i - \bar{x} \quad \text{and} \quad \tilde{y}_i = y_i - \bar{y}

where xˉ=1nsx\bar{x} = \tfrac{1}{n} s_x and yˉ=1nsy\bar{y} = \tfrac{1}{n} s_y with sx=i=1nxis_x = \sum_{i=1}^n x_i and sy=i=1nyis_y = \sum_{i=1}^n y_i.

We now have i=1nx~i=i=1ny~i=0\sum_{i=1}^n \tilde{x}_i = \sum_{i=1}^n \tilde{y}_i = 0, so we can apply the results from the previous posts to find the best fitting line y~=a~x~\tilde{y} = \tilde{a} \tilde{x} for the translated points and we get

a~=i=1nx~iy~ii=1nx~i2.\tilde{a} = \frac{\sum_{i=1}^n \tilde{x}_i \tilde{y}_i}{\sum_{i=1}^n \tilde{x}_i^2}.

Rewriting the equation for the line,

y~=a~x~yyˉ=a~(xxˉ)y=a~x+yˉa~xˉ,\tilde{y} = \tilde{a} \tilde{x} \quad \Leftrightarrow \quad y - \bar{y} = \tilde{a} (x - \bar{x}) \quad \Leftrightarrow \quad y = \tilde{a} x + \bar{y} - \tilde{a} \bar{x},

we see that the line we seek is given by y=ax+by = a x + b with

a=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2  ,b=yˉaxˉ  .a = \frac{\displaystyle \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\displaystyle \sum_{i=1}^n (x_i - \bar{x})^2} \; , \quad b = \bar{y} - a \bar{x} \; .

We can simplify this expression for aa. First, the numerator:

i=1n(xixˉ)(yiyˉ)=i=1nxiyiyˉi=1nxixˉi=1nyi+nxˉyˉ=i=1nxiyi1ni=1nxii=1nyi=sxy1nsxsy  ,\begin{align*} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) &= \sum_{i=1}^n x_i y_i - \bar{y} \sum_{i=1}^n x_i - \bar{x} \sum_{i=1}^n y_i + n \bar{x} \bar{y} \\ &= \sum_{i=1}^n x_i y_i - \tfrac{1}{n} \sum_{i=1}^n x_i \sum_{i=1}^n y_i \\ &= s_{xy} - \tfrac{1}{n} s_x s_y \; , \end{align*}

where sxy=i=1nxiyis_{xy} = \sum_{i=1}^n x_i y_i. Next, the denominator:

i=1n(xixˉ)2=i=1nxi22xˉi=1nxi+xˉ2=sxx1nsx2  ,\begin{align*} \sum_{i=1}^n (x_i - \bar{x})^2 &= \sum_{i=1}^n x_i^2 - 2 \bar{x} \sum_{i=1}^n x_i + \bar{x}^2 \\ &= s_{xx} - \tfrac{1}{n} s_x^2 \; , \end{align*}

where sxx=i=1nxi2s_{xx} = \sum_{i=1}^n x_i^2. Putting it all together, we have the following expressions for aa and bb (equivalent to the ones above):

a=nsxysxsynsxxsx2  ,b=(syasx)/n  .a = \frac{n s_{xy} - s_x s_y}{n s_{xx} - s_x^2} \; , \quad b = (s_y - a s_x) / n \; .

Note how, by construction, the line always passes through the center of mass of the points (xˉ,yˉ)(\bar{x}, \bar{y}).

The line that best fits a set of points
Figure 2. The line that best fits a set of points (the center of mass is shown in green).
Feel free to leave any question, correction or comment in this Mastodon thread.