In [1]:
#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None

mp.rcParams['animation.html'] = 'jshtml';

Announcements¶

  • Remember our in-class test will be on Friday, March 31
  • A sample test has been posted online in the Piazza resources tab

Recap from last lecture¶

To describe the orbit of Ceres, Gauss had to construct the equation for its ellipse:

a1x21+a2x22+a3x1x2+a4x1+a5x2+a6=0.a1x12+a2x22+a3x1x2+a4x1+a5x2+a6=0.

He had many measurements of (x1,x2)(x1,x2) pairs and had to find the a1,…,a6.a1,…,a6. This is actually a linear system:

⎡⎢ ⎢ ⎢ ⎢ ⎢⎣x211x221x11x21x11x211x212x222x12x22x12x221⋮⋮⋮⋮⋮⋮x21nx22nx1nx2nx1nx2n1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣a1a2a3a4a5a6⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=0[x112x212x11x21x11x211x122x222x12x22x12x221⋮⋮⋮⋮⋮⋮x1n2x2n2x1nx2nx1nx2n1][a1a2a3a4a5a6]=0

According to Newton, this is a consistent linear system. But there are two unappealing options to solve it.

  • Take only six independent measurements and try to solve the system with a 6×66×6 matrix. We will likely get a unique solution. But the solution is likely overfitting to the measurement errors from the six samples.

  • Take a larger number of measurements nn and construct a n×6n×6 matrix with all of them. But due to measurement errors, it is unlikely that a single ellipse will go through all of them. We now have an inconsistent system with no solutions.

The key idea is: the fact that our measurements include small, unbiased errors does not make them worthless! We simply need a principled approach to doing the best job we can given the errors in our measurements.

If a system of equations Ax=bAx=b has no solutions, that is because bb is not in the column space of AA.

So let's find the closest vector ^b=A^xb^=Ax^ that is in the column space and find the solution xx to that equation.

(Note that there is a unique closest vector ^bb^, although there might be multiple solutions ^xx^ such that ^b=A^xb^=Ax^.)

In [2]:
#
fig = ut.three_d_figure((24, 3), 'Least Squares looks for the Closest Point in Col A',
                        -15, 15, -15, 15, -15, 15, (8, 8), qr = qr_setting)
a2 = np.array([5.0,-13.0,-3.0])
a1 = np.array([1,-2.0,3])
v = -3*a1 + a2
y = np.array([12.0, 8.0, -5.0])
A = np.array([a1, a2]).T
#
# plotting the span of v
fig.plotSpan(a1, a2, 'Green')
#
yhat = A @ np.linalg.inv(A.T @ A) @ A.T @ y
fig.plotPoint(yhat[0], yhat[1], yhat[2], 'b')
fig.text(yhat[0]+0.5, yhat[1]+0.5, yhat[2]+0.5, r'$\mathbf{A\hat{x}}$', 'Ax-hat', size=18)
#
fig.plotPoint(y[0], y[1], y[2],'b')
fig.text(y[0]+0.5, y[1]+0.5, y[2]+0.5, r'$\bf b$', 'b', size=18)
#
# origin seems unhelpful
#fig.plotPoint(0, 0, 0, 'b')
#fig.text(0-1.5, 0-1.5, 0-1.5, r'$\bf 0$', '0', size=18)
#
x1 = y + np.array([8, 8, 8])
x1hat = A @ np.linalg.inv(A.T @ A) @ A.T @ x1
fig.plotPoint(x1hat[0], x1hat[1], x1hat[2], 'r')
#
x2 = y + np.array([-4, -6, 6])
x2hat = A @ np.linalg.inv(A.T @ A) @ A.T @ x2
fig.plotPoint(x2hat[0], x2hat[1], x2hat[2], 'r')
#
wpos = -4.5*a1 
fig.text(wpos[0], wpos[1], wpos[2], r'Col $A$', 'Col A', size=22)
#
# lines
fig.plotLine([y, yhat], 'b', '-')
fig.plotLine([y, x1hat], 'r', '--')
fig.plotLine([y, x2hat], 'r', '--')

#ut.plotPoint3d(ax,0,0,0,'b')
fig.ax.view_init(azim=100,elev=-20.0)
fig.hideAxes() 
fig.set_title(r'Least Squares looks for the Closest Point in Col $A$', 'Least Squares looks for the Closest Point in Col A', size = 16)
fig.save();

If A is m×nm×n and bb is in RmRm, then the general least-squares problem is to find an ^xx^ that makes ∥A^x−b∥‖Ax^−b‖ as small as possible. Equivalently, our goal is to minimize the sum of squared error:

∥A^x−b∥2=∑i(^bi−bi)2‖Ax^−b‖2=∑i(b^i−bi)2

where ^b=A^xb^=Ax^ contains the estimated values and bb contains the measured values.

The best estimated value ^bb^ must be at the orthogonal projection of bb onto the subspace ColACol⁡A. That is the closest vector to the measured value bb.

Any other vector vv will be further away because its distance to bb will be the hypotenuse of a right triangle.

In [3]:
#
fig = ut.three_d_figure((24, 5), 'Comparison of v to the projection of b',
                        -15, 15, -15, 15, -15, 15, (9,9), qr = qr_setting)
a2 = np.array([5.0,-13.0,-3.0])
a1 = np.array([1,-2.0,3])
v = -3*a1 + a2
y = np.array([12.0, 8.0, -5.0])
A = np.array([a1, a2]).T
#
# plotting the span of v
fig.plotSpan(a1, a2, 'Green')
#
yhat = A.dot(np.linalg.inv(A.T.dot(A))).dot(A.T).dot(y)
fig.plotPoint(yhat[0], yhat[1], yhat[2], 'b')
fig.text(yhat[0]+0.5, yhat[1]+0.5, yhat[2]+0.5, r'$\mathbf{\hat{b}}$', 'y-hat', size=18)
#
fig.plotPoint(y[0], y[1], y[2],'b')
fig.text(y[0]+0.5, y[1]+0.5, y[2]+0.5, r'$\bf b$', 'y', size=18)
#
fig.plotPoint(v[0], v[1], v[2], 'b')
fig.text(v[0]-1.5, v[1]-1.5, v[2]-1.5, r'$\bf v$', 'v', size=18)
#
m1 = (y + v) / 2
m1 = m1 + [2,-6,0]
fig.text(m1[0], m1[1], m1[2], r'$\Vert \bf b - \bf v\Vert$', '||b - v||', size=18)
m2 = (y + yhat) / 2
m2 = m2 + [0,1,0]
fig.text(m2[0], m2[1], m2[2], r'$\Vert \bf b - \hat{\bf b}\Vert$', '||b - b-hat||', size=18)
m3 = (yhat + v) / 2
m3 = m3 + [-1,0,-4]
fig.text(m3[0], m3[1], m3[2], r'$\Vert \hat{\bf b} - \bf v\Vert$', '||b - v||', size=18)
#
wpos = -4.5*a1 
fig.text(wpos[0], wpos[1], wpos[2], r'$W$', 'W', size=22)
#
# lines
fig.plotLine([y, yhat], 'b')
fig.plotLine([v, yhat], 'b')
fig.plotLine([y, v], 'b')
#ut.plotPoint3d(ax,0,0,0,'b')
fig.ax.view_init(azim=57.0,elev=-69.0)
fig.hideAxes() 
fig.set_title(r'Relationship of $\mathbf{b}$ to $\hat{\mathbf{b}} = \operatorname{Proj}_W \mathbf{b}$', 
              'Relationship of b Proj_W b', size = 16)
qrcode = fig.save();

24.4 Projection Solves Least Squares¶

The Normal Equations¶

So: how are we going to find this projection ^bb^?

Here is the key idea:

We know that the projection ^bb^ has the property that ^b−bb^−b is orthogonal to ColA.Col⁡A.

Suppose ^bb^ is projColAb,projCol⁡Ab, and that ^xx^ satisfies A^x=^bAx^=b^.

So A^x−bAx^−b is orthogonal to each column of AA.

If ajaj is any column of AA, then

a⊤j(A^x−b)=0.aj⊤(Ax^−b)=0.

Now, each a⊤jaj⊤ is a row of A⊤A⊤.

We can collect all of the equations for all the ajaj as:

A⊤(A^x−b)=0.A⊤(Ax^−b)=0.

So

A⊤A^x−A⊤b=0A⊤Ax^−A⊤b=0

So

A⊤A^x=A⊤bA⊤Ax^=A⊤b

Looking at this, we see that A⊤bA⊤b is a vector, and A⊤AA⊤A is a matrix, so this is a standard linear system.

This linear system is called the normal equations for Ax=b.Ax=b.

It's solution is usually denoted ^xx^.

Theorem. The set of least-squares solutions of Ax=bAx=b is equal to the (nonempty) set of solutions of the normal equations A⊤Ax=A⊤b.A⊤Ax=A⊤b.

Proof.

(1) The set of solutions is nonempty. The matrix on the left has the same column space as A⊤A⊤ and the vector on the right is a vector in the column space of A⊤.A⊤.

And, by the arguments above, any least-squares solution of Ax=bAx=b must satisfy the normal equations A⊤Ax=A⊤b.A⊤Ax=A⊤b.

(2) Now let's show that any solution of A⊤Ax=A⊤bA⊤Ax=A⊤b is a least squares solution of Ax=bAx=b.

If ^xx^ satisfies A⊤Ax=A⊤b,A⊤Ax=A⊤b, then A⊤(A^x−b)=0,A⊤(Ax^−b)=0,

which shows that A^x−bAx^−b is orthogonal to the rows of A⊤A⊤, and so is orthogonal to the columns of AA.

So the vector A^x−bAx^−b is orthogonal to ColACol⁡A.

So the equation

b=A^x−(b−A^x)b=Ax^−(b−Ax^)

is a decomposition of bb into the sum of a vector in ColACol⁡A and a vector orthogonal to ColACol⁡A.

Since the orthogonal decomposition is unique, A^xAx^ must be the orthogonal projection of bb onto the column space of AA.

So A^x=^bAx^=b^ and ^xx^ is a least-squares solution.

Example. Find the least squares solution of the inconsistent system Ax=bAx=b for

A=⎡⎢⎣400211⎤⎥⎦,b=⎡⎢⎣2011⎤⎥⎦.A=[400211],b=[2011].

Solution.

We will use the normal equations A⊤A^x=A⊤b.A⊤Ax^=A⊤b.

A⊤A=[401021]⎡⎢⎣400211⎤⎥⎦=[17115]A⊤A=[401021][400211]=[17115]A⊤b=[401021]⎡⎢⎣2011⎤⎥⎦=[1911]A⊤b=[401021][2011]=[1911]

So the normal equations are:

[17115][^x1^x2]=[1911][17115][x^1x^2]=[1911]

We can solve this using row operations, or by inverting A⊤AA⊤A.

(A⊤A)−1=184[5−1−117](A⊤A)−1=184[5−1−117]

And we can then solve A⊤A^x=A⊤bA⊤Ax^=A⊤b as

^x=(A⊤A)−1A⊤bx^=(A⊤A)−1A⊤b
=184[5−1−117][1911]=184[84168]=[12].=184[5−1−117][1911]=184[84168]=[12].

So we conclude that ^x=[12]x^=[12] is the vector that minimizes ∥Ax−b∥.‖Ax−b‖.

More formally,

^x=argminx∥Ax−b∥.x^=arg⁡minx‖Ax−b‖.

That is, ^xx^ is the least-squares solution of Ax=bAx=b.

When the Normal Equations have Multiple Solutions¶

We have seen that the normal equations always have a solution.

Is there always a unique solution?

No, there can be multiple solutions that all minimize ∥Ax−b∥.‖Ax−b‖.

Let's remind ourselves of what is going on when a linear system has multiple solutions.

We know that a linear system has multiple solutions when there are columns that are not pivot columns.

Equivalently, when A^x=^bAx^=b^ has multiple solutions, the columns of AA are linearly dependent.

Here is a picture of what is going on. In this case, AA is 3×33×3.

But note, that ColACol⁡A is only two-dimensional because the three columns are linearly dependent.

In [12]:
#
fig = ut.three_d_figure((24, 9), 'Multiple Solutions to the Normal Equations', 
                        -15, 15, -15, 15, -15, 15, (7,7), qr = qr_setting)
a2 = np.array([5.0,-13.0,-3.0])
a1 = np.array([1,-2.0,3])
a3 = -2*a1 + a2
b = np.array([6.0, 8.0, -5.0])
A = np.array([a1, a2]).T
bhat = A.dot(np.linalg.inv(A.T.dot(A))).dot(A.T).dot(b)
fig.text(a1[0], a1[1], a1[2], r'$\bf a_1$', 'a_1', size=20)
fig.text(a2[0], a2[1], a2[2], r'$\bf a_2$', 'a_2', size=20)
fig.text(a3[0], a3[1], a3[2], r'$\bf a_3$','a_3', size=20)
fig.text(b[0], b[1], b[2], r'$\bf b$', 'b', size=20)
fig.text(bhat[0], bhat[1], bhat[2], r'$A\mathbf{\hat{x}} = \mathbf{\hat{b}}$', 'b-hat', size=20)
#ax.text(1,-4,-10,r'Span{$\bf a,b$}',size=16)
#ax.text(0.2,0.2,-4,r'$\bf 0$',size=20)
# plotting the span of v
fig.plotSpan(a1, a2, 'Green')
fig.plotPoint(a1[0], a1[1], a1[2], 'r')
fig.plotPoint(a2[0], a2[1], a2[2], 'r')
fig.plotPoint(a3[0], a3[1], a3[2], 'r')
fig.plotPoint(b[0], b[1], b[2], 'b')
fig.plotPoint(bhat[0], bhat[1], bhat[2], 'b')
fig.plotLine([b, bhat], 'b', '--')
#ut.plotPoint3d(ax,0,0,0,'b')
fig.set_title(r'When there are Multiple Solutions to the Normal Equations', 
              'When there are Multiple Solutions to the Normal Equations', size=20)
fig.ax.view_init(azim=26.0,elev=-77.0)
fig.save();

Example.

Find a least-squares solution for Ax=bAx=b for

A=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣110011001010101010011001⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,b=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣−3−10251⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.A=[110011001010101010011001],b=[−3−10251].

Solution. Compute

ATA=⎡⎢ ⎢ ⎢⎣111111110000001100000011⎤⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣110011001010101010011001⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢⎣6222220020202002⎤⎥ ⎥ ⎥⎦ATA=[111111110000001100000011][110011001010101010011001]=[6222220020202002]ATb=⎡⎢ ⎢ ⎢⎣111111110000001100000011⎤⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣−3−10251⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢⎣4−426⎤⎥ ⎥ ⎥⎦ATb=[111111110000001100000011][−3−10251]=[4−426]

To solve ATA^x=ATb,ATAx^=ATb, we'll use row reduction. The augmented matrix [ATA|ATb][ATA|ATb] is:

⎡⎢ ⎢ ⎢ ⎢⎣622242200−42020220026⎤⎥ ⎥ ⎥ ⎥⎦∼⎡⎢ ⎢ ⎢ ⎢⎣10013010−1−5001−1−200000⎤⎥ ⎥ ⎥ ⎥⎦[622242200−42020220026]∼[10013010−1−5001−1−200000]

In the reduced echelon form matrix, there are 4 column vectors, but only the first three of them are linearly independent and the last one can be written as a linear combination of them.

Hence, the columns of A⊤AA⊤A are linearly dependent. This happens because the columns of AA are linearly dependent.

You can see this as follows: if AA has a non-trivial null space, then A⊤AA⊤A also has a nontrival null space.

So there is a free variable.

The general solution is then x1=3−x4x1=3−x4, x2=−5+x4x2=−5+x4, x3=−2+x4x3=−2+x4, and x4x4 is free.

So the general least-squares solution of A^x=bAx^=b has the form

^x=⎡⎢ ⎢ ⎢⎣3−5−20⎤⎥ ⎥ ⎥⎦+x4⎡⎢ ⎢ ⎢⎣−1111⎤⎥ ⎥ ⎥⎦x^=[3−5−20]+x4[−1111]

Keep in mind that the orthogonal projection ^bb^ is always unique.

The reason that there are multiple solutions to this least squares problem is that there are multiple ways to construct ^bb^.

The reason that there are multiple ways to construct ^bb^ is that the columns of AA are linearly dependent, so any vector in the column space of AA can be constructed in multiple ways.

Here is a theorem that allows use to identify when there are multiple least-squares solutions.

Theorem. Let AA be an m×nm×n matrix. The following statements are equivalent:

  1. The equation Ax=bAx=b has a unique least-squares solution for each bb in Rm.Rm.
  2. The columns of AA are linearly independent.
  3. The matrix A⊤AA⊤A is invertible.

When these statements are true, the least-squares solution ^xx^ is given by:

^x=(A⊤A)−1A⊤bx^=(A⊤A)−1A⊤b

Finding ^bb^ directly.

When A⊤AA⊤A is invertible, and ^bb^ is unique, we can put together the two equations

^x=(A⊤A)−1A⊤bx^=(A⊤A)−1A⊤b

and

A^x=^bAx^=b^

to get:

^b=A(A⊤A)−1A⊤bb^=A(A⊤A)−1A⊤b

Let's stop and look at this another way. Up until now we have seen how to project a point onto a line, or on to a subspace with an orthogonal basis.

Now here is a formula for projection onto a subspace (namely ColACol⁡A) given an arbitrary basis. This is a general formula that can be very useful.

Lecture 25: Linear Regressions¶

[This lecture is based on Prof. Crovella's CS 132 lecture notes.]

25.1 Least Squares Example¶

Here is a graph showing the number of BU computer science majors over many years, from 2004 to 2017. We'd like to have some estimate of where the numbers might be going.

In [21]:
#
cohortMatrix = np.array([
[56, 42, 49, 39, 40, 31, 33, 46, 55, 91, 83, 129, 153, 178],
[43, 36, 27, 34, 24, 29, 39, 56, 74, 69, 111, 136, 168, 171],
[32, 24, 22, 21, 26, 25, 44, 64, 52, 77, 105, 130, 139, 175],
[25, 16, 19, 28, 24, 30, 37, 40, 49, 56, 79, 93, 90, 126]])

nyears = np.shape(cohortMatrix)[1]
# index rows by time, and columns by cohort    
Year = pd.DateOffset(years=1)
# need to fliplr because spreadsheet rows are in decreasing cohort order
datestrs=['09-2004','09-2005','09-2006','09-2007','09-2008','09-2009','09-2010','09-2011','09-2012','09-2013','09-2014','09-2015','09-2016','09-2017']
dates = [datetime.strptime(x, '%m-%Y') for x in datestrs[:nyears]]
cohorts = pd.DataFrame(np.fliplr(cohortMatrix.T),index=dates,columns=pd.Index(['U1','U2','U3','U4']))
# learning the model
b = np.array(cohorts[1:])
a = np.array(cohorts[:-1])
x, resid, rank, s = np.linalg.lstsq(a,b,rcond=None)
A = x.T
#
cohorts.sum(axis=1).plot()
plt.ylim(ymin=0)
# plt.legend(['Model','Actual'],loc='best')
plt.title('Number of BU CS Majors',size=20);   

Modeling the number of students in the CS major is challenging because students enter and leave the major at various points during their undergraduate degree.

Here is the raw data of the number of declared CS majors, broken down into 1st, 2nd, 3rd, and 4th year students.

In [7]:
cohortMatrix = np.array([
[56, 42, 49, 39, 40, 31, 33, 46, 55, 91, 83, 129, 153, 178],
[43, 36, 27, 34, 24, 29, 39, 56, 74, 69, 111, 136, 168, 171],
[32, 24, 22, 21, 26, 25, 44, 64, 52, 77, 105, 130, 139, 175],
[25, 16, 19, 28, 24, 30, 37, 40, 49, 56, 79, 93, 90, 126]])

We will use a simple linear model that we will train on historical data.

Our state variable is

xt=⎡⎢ ⎢ ⎢ ⎢⎣x1,tx2,tx3,tx4,t⎤⎥ ⎥ ⎥ ⎥⎦xt=[x1,tx2,tx3,tx4,t]

where xi,txi,t is the number of students who are declared CS majors and in their iith year at BU, in the fall of year tt. Our model is a linear dynamical system (or linear recurrence):

xt+1=Axt.xt+1=Axt.

Given historical data which are measurements of our state variable from past years, we now know how to estimate AA using least squares!

In [22]:
# learning the model
cohort = pd.DataFrame(np.fliplr(cohortMatrix.T))
b = np.array(cohort[1:])
a = np.array(cohort[:-1])
In [23]:
print(a)
[[ 25  32  43  56]
 [ 16  24  36  42]
 [ 19  22  27  49]
 [ 28  21  34  39]
 [ 24  26  24  40]
 [ 30  25  29  31]
 [ 37  44  39  33]
 [ 40  64  56  46]
 [ 49  52  74  55]
 [ 56  77  69  91]
 [ 79 105 111  83]
 [ 93 130 136 129]
 [ 90 139 168 153]]
In [24]:
print(b)
[[ 16  24  36  42]
 [ 19  22  27  49]
 [ 28  21  34  39]
 [ 24  26  24  40]
 [ 30  25  29  31]
 [ 37  44  39  33]
 [ 40  64  56  46]
 [ 49  52  74  55]
 [ 56  77  69  91]
 [ 79 105 111  83]
 [ 93 130 136 129]
 [ 90 139 168 153]
 [126 175 171 178]]
In [26]:
x, resid, rank, s = np.linalg.lstsq(a,b,rcond=None)
A = x.T
print(A)
[[ 0.62110633  0.28792787  0.0204491   0.10003897]
 [ 0.65239203  0.55574243  0.30323349 -0.16349735]
 [ 0.33101614  1.24636712 -0.26153545  0.07684781]
 [ 0.49319575 -0.30684656  1.00419585  0.07532737]]

Now, using AA, we can predict growth into the future via xt+1=Axt.xt+1=Axt.

In [4]:
OneYearAhead = cohorts.dot(A.T).shift(1,freq=Year)
OneYearAhead.columns=pd.Index(['U1','U2','U3','U4'])
OneYearAhead.sum(axis=1).plot()
cohorts.sum(axis=1).plot()
plt.ylim(ymin=0)
plt.legend(['Model','Actual'],loc='best');

Now, the matrix AA captures the complex way that one year's student population relates to the next year's population.

It can be hard to understand what is going on just by looking at the values of AA.

But, when we look at the plot above, it appears that there is a relatively simple kind of exponential growth going on.

25.2 Modeling via Regression¶

One of the most fundamental kinds of machine learning is the construction of a model that can be used to summarize a set of data.

A model is a concise description of a dataset or a real-world phenomenon.

For example, an equation can be a model if we use the equation to describe something in the real world.

The most common form of modeling is regression, which means constructing an equation that describes the relationships among variables.

For example, we may look at these points and observe that they approximately lie on a line.

So we could decide to model this data using a line.

In [6]:
#
ax = ut.plotSetup(-10, 10, -10, 10, size = (7, 7))
ut.centerAxes(ax)
line = np.array([1, 0.5])
xlin = -10.0 + 20.0 * np.random.random(100)
ylin = line[0] + (line[1] * xlin) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xlin, ylin, 'ro', markersize=6);

We may look at these points and decide to model them using a quadratic function.

In [7]:
#
ax = ut.plotSetup(-10, 10, -10, 20, size = (7, 7))
ut.centerAxes(ax)
quad = np.array([1, 3, 0.5])
xquad = -10.0 + 20.0 * np.random.random(100)
yquad = quad[0] + (quad[1] * xquad) + (quad[2] * xquad * xquad) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xquad, yquad, 'ro', markersize=6); 

And we may look at these points and decide to model them using a logarithmic function.

In [8]:
#
ax = ut.plotSetup(-10, 10, -10, 15, size = (7, 7))
ut.centerAxes(ax)
log = np.array([1, 4])
xlog = 10.0 * np.random.random(100)
ylog = log[0] + log[1] * np.log(xlog) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xlog, ylog, 'ro', markersize=6);

Clearly, none of these datasets agrees perfectly with the proposed model. So the question arises:

How do we find the best linear function (or quadratic function, or logarithmic function) given the data?

The Framework of Linear Models¶

The regression problem has been studied extensively in the field of statistics and machine learning.

Certain terminology is used:

  • Some values are referred to as "independent," and
  • Some values are referred to as "dependent."

The basic regression task is:

  • given a set of independent variables
  • and the associated dependent variables,
  • estimate the parameters of a model (such as a line, parabola, etc) that describes how the dependent variables are related to the independent variables.

The independent variables are collected into a matrix X,X, which is called the design matrix.

The dependent variables are collected into an observation vector y.y.

The parameters of the model (for any kind of model) are collected into a parameter vector β.β.

In [9]:
#
ax = ut.plotSetup(-10, 10, -10, 10, size = (7, 7))
ut.centerAxes(ax)
line = np.array([1, 0.5])
xlin = -10.0 + 20.0 * np.random.random(100)
ylin = line[0] + (line[1] * xlin) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xlin, ylin, 'ro', markersize = 6)
ax.plot(xlin, line[0] + line[1] *xlin, 'b-')
plt.text(-9, 3, r'$y = \beta_0 + \beta_1x$', size=20); 

25.3 Fitting a Line to Data¶

The first kind of model we'll study is a linear equation:

y=β0+β1x.y=β0+β1x.

This is the most commonly used type of model, particularly in fields like economics, psychology, biology, etc.

The reason it is so commonly used is that experimental data often produce points (x1,y1),…,(xn,yn)(x1,y1),…,(xn,yn) that seem to lie close to a line.

The question we must confront is: given a set of data, how should we "fit" the equation of the line to the data?

Our intuition is this: we want to determine the parameters β0,β1β0,β1 that define a line that is as "close" to the points as possible.

Let's develop some terminology for evaluating a model.

Suppose we have a line y=β0+β1xy=β0+β1x. For each data point (xj,yj),(xj,yj), there is a point (xj,β0+β1xj)(xj,β0+β1xj) that is the point on the line with the same xx-coordinate.

In [10]:
# image credit: Lay, LAA, 4th edition
display(Image("images/Lay-fig-6-6-1.jpg", width=550))

We call yjyj the observed value of yy

and we call β0+β1xjβ0+β1xj the predicted yy-value.

The difference between an observed yy-value and a predicted yy-value is called a residual.

There are several ways to measure how "close" the line is to the data.

The usual choice is to sum the squares of the residuals.

(Note that the residuals themselves may be positive or negative -- by squaring them, we ensure that our error measures don't cancel out.)

The least-squares line is the line y=β0+β1xy=β0+β1x that minimizes the sum of squares of the residuals.

The coefficients β0,β1β0,β1 of the line are called regression coefficients.

Regression is a Least-Squares Problem¶

Let's imagine for a moment that the data fit a line perfectly.

Then, if each of the data points happened to fall exactly on the line, the parameters β0β0 and β1β1 would satisfy the equations

β0+β1x1=y1β0+β1x1=y1β0+β1x2=y2β0+β1x2=y2β0+β1x3=y3β0+β1x3=y3⋮⋮β0+β1xn=ynβ0+β1xn=yn

We can write this system as

Xβ=yXβ=y

where

X=⎡⎢ ⎢ ⎢ ⎢⎣1x11x2⋮⋮1xn⎤⎥ ⎥ ⎥ ⎥⎦,β=[β0β1],y=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣y1y2⋮yn⎤⎥ ⎥ ⎥ ⎥ ⎥⎦X=[1x11x2⋮⋮1xn],β=[β0β1],y=[y1y2⋮yn]

If the data points don't actually lie exactly on a line,

... then there are no parameters β0,β1β0,β1 for which the predicted yy-values in XβXβ equal the observed yy-values in yy,

... and Xβ=yXβ=y has no solution.

Now, since the data doesn't fall exactly on a line, we have decided to seek the ββ that minimizes the sum of squared residuals, ie,

∑i((β0+β1xi)−yi)2=∥Xβ−y∥2∑i((β0+β1xi)−yi)2=‖Xβ−y‖2

This is key: the sum of squares of the residuals is exactly the square of the distance between the vectors XβXβ and y.y.

This is a least-squares problem, Ax=b,Ax=b, with different notation.

Computing the least-squares solution of Xβ=yXβ=y is equivalent to finding the ββ that determines the least-squares line.

In [11]:
#
ax = ut.plotSetup(-10, 10, -10, 10, size = (7, 7))
ut.centerAxes(ax)
line = np.array([1, 0.5])
xlin = -10.0 + 20.0 * np.random.random(100)
ylin = line[0]+ (line[1] * xlin) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xlin, ylin, 'ro', markersize=6)
ax.plot(xlin, line[0] + line[1] * xlin, 'b-')
plt.text(-9,3,r'$y = \beta_0 + \beta_1x$',size=20);

Example. Find the equation y=β0+β1xy=β0+β1x of the least-squares line that best fits the data points

x y
2 1
5 2
7 3
8 3
In [12]:
#
ax = ut.plotSetup(0, 10, 0, 4, size = (7, 7))
ut.centerAxes(ax)
pts = np.array([[2,1], [5,2], [7,3], [8,3]]).T
ax.plot(pts[0], pts[1], 'ro', markersize = 10);
plt.xlabel('x', size = 20)
plt.ylabel('y', size = 20);

Solution. Use the xx-coordinates of the data to build the design matrix XX, and the yy-coordinates to build the observation vector yy:

X=⎡⎢ ⎢ ⎢⎣12151718⎤⎥ ⎥ ⎥⎦,y=⎡⎢ ⎢ ⎢⎣1233⎤⎥ ⎥ ⎥⎦X=[12151718],y=[1233]

Now, to obtain the least-squares line, find the least-squares solution to Xβ=y.Xβ=y.

We do this by solving the normal equation that we learned last lecture (just with new notation):

XTXβ=XTyXTXβ=XTy

So, we compute:

XTX=[11112578]⎡⎢ ⎢ ⎢⎣12151718⎤⎥ ⎥ ⎥⎦=[42222142]XTX=[11112578][12151718]=[42222142]XTy=[11112578]⎡⎢ ⎢ ⎢⎣1233⎤⎥ ⎥ ⎥⎦=[957]XTy=[11112578][1233]=[957]

So the normal equations are:

[42222142][β0β1]=[957][42222142][β0β1]=[957]

Solving, we get:

[β0β1]=[42222142]−1[957][β0β1]=[42222142]−1[957]=184[142−22−224][957]=184[142−22−224][957]=[2/75/14]=[2/75/14]

So the least-squares line has the equation

y=27+514x.y=27+514x.
In [4]:
#
ax = ut.plotSetup(0, 10, 0, 4, size = (7, 7))
ut.centerAxes(ax)
pts = np.array([[2,1], [5,2], [7,3], [8,3]]).T
ax.plot(pts[0], pts[1], 'ro', markersize = 10)
ut.plotLinEqn(-5./14, 1, 2./7, color='b')
plt.text(6,1.5, r'$y = \frac{2}{7} + \frac{5}{14}x$', size=24);