janmr blog

Linear Regression Basics

Linear regression is a common and powerful method for modelling the relationship between some input vectors and some output scalars.

To be more specific, assume we have a data set (xi,yi)(\mathbf{x}_i, y_i), i=1,,ni=1,\ldots,n, where the xi\mathbf{x}_i's are pp-element vectors xiRp\mathbf{x}_i \in \mathbb{R}^p. We now seek a function F:RpRF: \mathbb{R}^p \mapsto \mathbb{R} such that

F(xi)yifor i=1,,n.F(\mathbf{x}_i) \approx y_i \quad \text{for $i=1,\ldots,n$.}

But what is FF and what does \approx mean? First, FF must be linear (which is the reason for the name linear regression). This means, by definition:

  1. F(u+v)=F(u)+F(v)F(\mathbf{u} + \mathbf{v}) = F(\mathbf{u}) + F(\mathbf{v}) for all u,vRp\mathbf{u}, \mathbf{v} \in \mathbb{R}^p,
  2. F(tu)=tF(u)F(t \mathbf{u}) = t F(\mathbf{u}) for all tRt \in \mathbf{R} and uRp\mathbf{u} \in \mathbb{R}^p.

Using these two rules we get

F((u1,u2,,up))=F((u1,0,,0))+F((0,u2,,0))++F((0,0,,up))=u1F((1,0,,0))+u2F((0,1,,0))++upF((0,0,,1))=u1f1+u2f2++upfp\begin{aligned} &F((u_1,u_2,\ldots,u_p)) \\ &= F((u_1,0,\ldots,0)) + F((0,u_2,\ldots,0)) + \ldots + F((0,0,\ldots,u_p)) \\ &= u_1 F((1,0,\ldots,0)) + u_2 F((0,1,\ldots,0)) + \ldots + u_p F((0,0,\ldots,1)) \\ &= u_1 f_1 + u_2 f_2 + \ldots + u_p f_p \end{aligned}

so FF is uniquely determined by the vector f=(f1,f2,,fp)\mathbf{f} = (f_1, f_2, \ldots, f_p). The output of FF will always be a linear combination of the coefficients of the input vector,

F(u)=uTfF(\mathbf{u}) = \mathbf{u}^T \mathbf{f}

(all vectors are treated as column vectors).

That was FF, but what about F(xi)yiF(\mathbf{x}_i) \approx y_i? It means that we would like each yiF(xi)|y_i - F(\mathbf{x}_i)| to be as small as possible. To be more precise, we wish to solve the following optimization problem:

arg minfRpyXTf\argmin_{\mathbf{f \in \mathbb{R}^p}} \| \mathbf{y} - \mathbf{X}^T \mathbf{f} \|

with y=(y1,y2,,yn)\mathbf{y} = (y_1, y_2, \ldots, y_n), X=[x1  x2    xn]Rp×n\mathbf{X} = [ \mathbf{x}_1 \; \mathbf{x}_2 \; \cdots \; \mathbf{x}_n ] \in \mathbb{R}^{p \times n} and some norm \| \cdot \| to measure the magnitude of the error.

Any norm will do, but the most common choice is the Euclidean norm, also called the 2-norm, (u1,,un)2=u12++un2\| (u_1, \ldots, u_n) \|_2 = \sqrt{u_1^2 + \ldots + u_n^2}. This norm has the advantage that the solution can be computed exactly and in a fairly efficient way (see, e.g., Section 5.5 in Matrix Computations). A solution to this optimization problem is also computed by the NumPy function numpy.linalg.lstsq.

See the following post for some examples of how to apply linear regression to some problems.