janmr blog

Linear Regression Basics

Linear regression is a common and powerful method for modelling the relationship between some input vectors and some output scalars.

To be more specific, assume we have a data set (xi,yi)(\mathbf{x}_i, y_i), i=1,,ni=1,\ldots,n, where the xi\mathbf{x}_i's are pp-element vectors xiRp\mathbf{x}_i \in \mathbb{R}^p. We now seek a function F:RpRF: \mathbb{R}^p \mapsto \mathbb{R} such that

F(xi)yifor i=1,,n.F(\mathbf{x}_i) \approx y_i \quad \text{for $i=1,\ldots,n$.}

But what is FF and what does \approx mean? First, FF must be linear (which is the reason for the name linear regression). This means, by definition:

  1. F(u+v)=F(u)+F(v)F(\mathbf{u} + \mathbf{v}) = F(\mathbf{u}) + F(\mathbf{v}) for all u,vRp\mathbf{u}, \mathbf{v} \in \mathbb{R}^p,
  2. F(tu)=tF(u)F(t \mathbf{u}) = t F(\mathbf{u}) for all tRt \in \mathbf{R} and uRp\mathbf{u} \in \mathbb{R}^p.

Using these two rules we get

F((u1,u2,,up))=F((u1,0,,0))+F((0,u2,,0))++F((0,0,,up))=u1F((1,0,,0))+u2F((0,1,,0))++upF((0,0,,1))=u1f1+u2f2++upfp\begin{aligned} &F((u_1,u_2,\ldots,u_p)) \\ &= F((u_1,0,\ldots,0)) + F((0,u_2,\ldots,0)) + \ldots + F((0,0,\ldots,u_p)) \\ &= u_1 F((1,0,\ldots,0)) + u_2 F((0,1,\ldots,0)) + \ldots + u_p F((0,0,\ldots,1)) \\ &= u_1 f_1 + u_2 f_2 + \ldots + u_p f_p \end{aligned}

so FF is uniquely determined by the vector f=(f1,f2,,fp)\mathbf{f} = (f_1, f_2, \ldots, f_p). The output of FF will always be a linear combination of the coefficients of the input vector,

F(u)=uTfF(\mathbf{u}) = \mathbf{u}^T \mathbf{f}

(all vectors are treated as column vectors).

That was FF, but what about F(xi)yiF(\mathbf{x}_i) \approx y_i? It means that we would like each yiF(xi)|y_i - F(\mathbf{x}_i)| to be as small as possible. To be more precise, we wish to solve the following optimization problem:

arg minfRpyXTf\argmin_{\mathbf{f \in \mathbb{R}^p}} \| \mathbf{y} - \mathbf{X}^T \mathbf{f} \|

with y=(y1,y2,,yn)\mathbf{y} = (y_1, y_2, \ldots, y_n), X=[x1  x2    xn]Rp×n\mathbf{X} = [ \mathbf{x}_1 \; \mathbf{x}_2 \; \cdots \; \mathbf{x}_n ] \in \mathbb{R}^{p \times n} and some norm \| \cdot \| to measure the magnitude of the error.

Any norm will do, but the most common choice is the Euclidean norm, also called the 2-norm, (u1,,un)2=u12++un2\| (u_1, \ldots, u_n) \|_2 = \sqrt{u_1^2 + \ldots + u_n^2}. This norm has the advantage that the solution can be computed exactly and in a fairly efficient way (see, e.g., Section 5.5 in Matrix Computations). A solution to this optimization problem is also computed by the NumPy function numpy.linalg.lstsq.

See the following post for some examples of how to apply linear regression to some problems.

Commenting is not possible for this post, but feel free to leave a question, correction or any comment by using the contact page