In [1]:
#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None

mp.rcParams['animation.html'] = 'jshtml';

Announcements¶

  • In-class test will be on Friday, March 31
    • No calculators or reference material
  • A sample test has been posted online in the Piazza resources tab
  • Upcoming office hours:
    • Today: Prof McDonald from 4:30-6pm in CCDS 1341
    • Tomorrow: Peer tutor Rohan Anand from 1:30-3pm in CCDS 16th floor
    • Tomorrow: Peer tutor Daniel Cho from 2-4pm in CCDS

Recap from last lecture¶

We introduced regression, which is constructing an equation that describes the relationships among variables.

For example, we may look at these points and observe that they approximately lie on a line.

So we could decide to model this data using a line.

In [6]:
#
ax = ut.plotSetup(-10, 10, -10, 10, size = (7, 7))
ut.centerAxes(ax)
line = np.array([1, 0.5])
xlin = -10.0 + 20.0 * np.random.random(100)
ylin = line[0] + (line[1] * xlin) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xlin, ylin, 'ro', markersize=6);

We may look at these points and decide to model them using a quadratic function.

In [7]:
#
ax = ut.plotSetup(-10, 10, -10, 20, size = (7, 7))
ut.centerAxes(ax)
quad = np.array([1, 3, 0.5])
xquad = -10.0 + 20.0 * np.random.random(100)
yquad = quad[0] + (quad[1] * xquad) + (quad[2] * xquad * xquad) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xquad, yquad, 'ro', markersize=6); 

And we may look at these points and decide to model them using a logarithmic function.

In [8]:
#
ax = ut.plotSetup(-10, 10, -10, 15, size = (7, 7))
ut.centerAxes(ax)
log = np.array([1, 4])
xlog = 10.0 * np.random.random(100)
ylog = log[0] + log[1] * np.log(xlog) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xlog, ylog, 'ro', markersize=6);

The Framework of Linear Models¶

The regression problem has been studied extensively in the field of statistics and machine learning. We are working with:

  • Values that are referred to as "independent," and
  • Values that are referred to as "dependent."

The basic regression task is:

  • given a set of independent variables
  • and the associated dependent variables,
  • estimate the parameters of a model (such as a line, parabola, etc) that describes how the dependent variables are related to the independent variables.

The independent variables are collected into a matrix X,X, which is called the design matrix.

The dependent variables are collected into an observation vector y.y.

The parameters of the model (for any kind of model) are collected into a parameter vector β.β.

In [9]:
#
ax = ut.plotSetup(-10, 10, -10, 10, size = (7, 7))
ut.centerAxes(ax)
line = np.array([1, 0.5])
xlin = -10.0 + 20.0 * np.random.random(100)
ylin = line[0] + (line[1] * xlin) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xlin, ylin, 'ro', markersize = 6)
ax.plot(xlin, line[0] + line[1] *xlin, 'b-')
plt.text(-9, 3, r'$y = \beta_0 + \beta_1x$', size=20); 

We call yjyj the observed value of yy

and we call β0+β1xjβ0+β1xj the predicted yy-value.

The difference between an observed yy-value and a predicted yy-value is called a residual.

In [10]:
# image credit: Lay, LAA, 4th edition
display(Image("images/Lay-fig-6-6-1.jpg", width=550))

Regression is a Least-Squares Problem¶

Let's imagine for a moment that the data fit a line perfectly.

Then, if each of the data points happened to fall exactly on the line, the parameters β0β0 and β1β1 would satisfy the equations

β0+β1x1=y1β0+β1x1=y1β0+β1x2=y2β0+β1x2=y2β0+β1x3=y3β0+β1x3=y3⋮⋮β0+β1xn=ynβ0+β1xn=yn

We can write this system as

Xβ=yXβ=y

where

X=⎡⎢ ⎢ ⎢ ⎢⎣1x11x2⋮⋮1xn⎤⎥ ⎥ ⎥ ⎥⎦,β=[β0β1],y=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣y1y2⋮yn⎤⎥ ⎥ ⎥ ⎥ ⎥⎦X=[1x11x2⋮⋮1xn],β=[β0β1],y=[y1y2⋮yn]

If the data points don't actually lie exactly on a line,

... then there are no parameters β0,β1β0,β1 for which the predicted yy-values in XβXβ equal the observed yy-values in yy,

... and Xβ=yXβ=y has no solution.

Now, to obtain the least-squares line, find the least-squares solution to Xβ=y.Xβ=y.

We do this by solving the normal equation that we learned last lecture (just with new notation):

XTXβ=XTyXTXβ=XTy

Lecture 26: Multiple Regression¶

26.1 The General Linear Model¶

Another way that the inconsistent linear system is often written is to collect all the residuals into a residual vector.

Then an exact equation is

y=Xβ+ϵy=Xβ+ϵ

Any equation of this form is referred to as a linear model.

In this formulation, the goal is to minimize the length of ϵϵ, ie, ∥ϵ∥.‖ϵ‖.

In some cases, one would like to fit data points with something other than a straight line.

For example, think of Gauss trying to find the equation for the orbit of Ceres.

In cases like this, the matrix equation is still Xβ=yXβ=y, but the specific form of XX changes from one problem to the next.

The least-squares solution ^ββ^ is a solution of the normal equations

XTXβ=XTy.XTXβ=XTy.

Least-Squares Fitting of Other Models¶

Most models have parameters, and the object of model fitting is to to fix those parameters. Let's talk about model parameters.

In model fitting, the parameters are the unknown. A central question for us is whether the model is linear in its parameters.

For example, the model y=β0⋅e−β1xy=β0⋅e−β1x is not linear in its parameters.

The model y=β0⋅e−2xy=β0⋅e−2x is linear in its parameters.

For a model that is linear in its parameters, an observation is a linear combination of (arbitrary) known functions.

In other words, a model that is linear in its parameters is

y=β0f0(x)+β1f1(x)+⋯+βnfn(x)y=β0f0(x)+β1f1(x)+⋯+βnfn(x)

where f0,…,fnf0,…,fn are known functions and β0,…,βkβ0,…,βk are parameters.

Example. Suppose data points (x1,y1),…,(xn,yn)(x1,y1),…,(xn,yn) appear to lie along some sort of parabola instead of a straight line. Suppose we wish to approximate the data by an equation of the form

y=β0+β1x+β2x2.y=β0+β1x+β2x2.

Describe the linear model that produces a "least squares fit" of the data by the equation.

Solution. The ideal relationship is y=β0+β1x+β2x2.y=β0+β1x+β2x2.

Suppose the actual values of the parameters are β0,β1,β2.β0,β1,β2. Then the coordinates of the first data point satisfy the equation

y1=β0+β1x1+β2x21+ϵ1y1=β0+β1x1+β2x12+ϵ1

where ϵ1ϵ1 is the residual error between the observed value y1y1 and the predicted yy-value.

Each data point determines a similar equation:

y1=β0+β1x1+β2x21+ϵ1y1=β0+β1x1+β2x12+ϵ1y2=β0+β1x2+β2x22+ϵ2y2=β0+β1x2+β2x22+ϵ2⋮⋮yn=β0+β1xn+β2x2n+ϵnyn=β0+β1xn+β2xn2+ϵn

Clearly, this system can be written as y=Xβ+ϵ.y=Xβ+ϵ.

⎡⎢ ⎢ ⎢ ⎢ ⎢⎣y1y2⋮yn⎤⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1x1x211x2x22⋮⋮⋮1xnx2n⎤⎥ ⎥ ⎥ ⎥ ⎥⎦⎡⎢⎣β0β1β2⎤⎥⎦+⎡⎢ ⎢ ⎢ ⎢⎣ϵ1ϵ2⋮ϵn⎤⎥ ⎥ ⎥ ⎥⎦[y1y2⋮yn]=[1x1x121x2x22⋮⋮⋮1xnxn2][β0β1β2]+[ϵ1ϵ2⋮ϵn]
In [14]:
#
m = np.shape(xquad)[0]
X = np.array([np.ones(m), xquad,xquad**2]).T
beta = np.linalg.inv(X.T @ X) @ X.T @ yquad
#
ax = ut.plotSetup(-10, 10, -10, 20, size = (7, 7))
ut.centerAxes(ax)
xplot = np.linspace(-10, 10, 50)
yestplot = beta[0] + beta[1]  * xplot + beta[2] * xplot**2
ax.plot(xplot, yestplot, 'b-', lw=2)
ax.plot(xquad, yquad, 'ro', markersize = 8);
In [15]:
#
m = np.shape(xlog)[0]
X = np.array([np.ones(m), np.log(xlog)]).T
beta = np.linalg.inv(X.T @ X) @ X.T @ ylog
# 
ax = ut.plotSetup(-10, 10, -10, 15, size = (7, 7))
ut.centerAxes(ax)
xplot = np.linspace(0.1, 10, 50)
yestplot = beta[0] + beta[1] * np.log(xplot)
ax.plot(xplot, yestplot, 'b-', lw=2)
ax.plot(xlog, ylog, 'ro', markersize = 8);

26.2 Multiple Regression¶

Suppose an experiment involves two independent variables -- say, uu and vv, -- and one dependent variable, yy.

A linear equation for predicting yy from uu and vv has the form

y=β0+β1u+β2vy=β0+β1u+β2v

Since there is more than one independent variable, this is called multiple regression.

A more general prediction equation might have the form

y=β0+β1u+β2v+β3u2+β4uv+β5v2y=β0+β1u+β2v+β3u2+β4uv+β5v2

A least squares fit to equations like this is called a trend surface.

In general, a linear model will arise whenever yy is to be predicted by an equation of the form

y=β0f0(u,v)+β1f1(u,v)+⋯+βkfk(u,v)y=β0f0(u,v)+β1f1(u,v)+⋯+βkfk(u,v)

with f0,…,fkf0,…,fk any sort of known functions and β0,…,βkβ0,…,βk unknown weights.

Example. In geography, local models of terrain are constructed from data (u1,v1,y1),…,(un,vn,yn)(u1,v1,y1),…,(un,vn,yn) where uj,vjuj,vj, and yjyj are latitude, longitude, and altitude, respectively.

Let's take an example. Here are a set of points in R3R3:

In [26]:
#
fig = ut.three_d_figure((14, 1), 'multivariate regression example', 
                       -7, 7, -7, 7, -8, 8,
                        equalAxes = False, figsize = (7, 7), qr = qr_setting)
np.random.seed(6)
v = [4.0,  4.0, 2.0]
u = [-4.0, 3.0, 1.0]
npts = 70
# set locations of points that fall within x,y
xc = -7.0 + 14.0 * np.random.random(npts)
yc = -7.0 + 14.0 * np.random.random(npts)
A = np.array([u,v]).T
# project these points onto the plane
P = A @ np.linalg.inv(A.T @ A) @ A.T
coords = P @ np.array([xc,yc,np.zeros(npts)])
coords[2] += np.random.normal(scale = 1, size = npts)
for i in range(coords.shape[-1]):
    fig.plotPoint(coords[0,i],coords[1,i], coords[2,i], 'r')
fig.set_title('Terrain Data for Multiple Regression', 'Terrain Data for Multiple Regression', size = 20)
fig.ax.set_zlabel('y')
fig.ax.set_xlabel('u')
fig.ax.set_ylabel('v')
fig.desc['xlabel'] = 'u'
fig.desc['ylabel'] = 'v'
fig.desc['zlabel'] = 'y'
fig.ax.view_init(azim=30, elev = 15)
fig.save();

Let's describe the linear models that gives a least-squares fit to such data. The solution is called the least-squares plane.

Solution. We expect the data to satisfy these equations:

y1=β0+β1u1+β2v1+ϵ1y1=β0+β1u1+β2v1+ϵ1y2=β0+β1u2+β2v2+ϵ2y2=β0+β1u2+β2v2+ϵ2⋮⋮yn=β0+β1un+β2vn+ϵnyn=β0+β1un+β2vn+ϵn

This system has the matrix for y=Xβ+ϵ,y=Xβ+ϵ, where

y=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣y1y1⋮yn⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,X=⎡⎢ ⎢ ⎢ ⎢⎣1u1v11u2v2⋮⋮⋮1unvn⎤⎥ ⎥ ⎥ ⎥⎦,β=⎡⎢⎣β0β1β2⎤⎥⎦,ϵ=⎡⎢ ⎢ ⎢ ⎢⎣ϵ1ϵ2⋮ϵn⎤⎥ ⎥ ⎥ ⎥⎦y=[y1y1⋮yn],X=[1u1v11u2v2⋮⋮⋮1unvn],β=[β0β1β2],ϵ=[ϵ1ϵ2⋮ϵn]
In [27]:
#
fig = ut.three_d_figure((14, 2), 'multivariate regression example with fitted plane', 
                       -7, 7, -7, 7, -8, 8,
                        equalAxes = False, figsize = (7, 7), qr = qr_setting)
np.random.seed(6)
v = [4.0,  4.0, 2.0]
u = [-4.0, 3.0, 1.0]
npts = 70
# set locations of points that fall within x,y
xc = -7.0 + 14.0 * np.random.random(npts)
yc = -7.0 + 14.0 * np.random.random(npts)
A = np.array([u,v]).T
# project these points onto the plane
P = A @ np.linalg.inv(A.T @ A) @ A.T
coords = P @ np.array([xc,yc,np.zeros(npts)])
coords[2] += np.random.normal(scale = 1, size = npts)
for i in range(coords.shape[-1]):
    fig.plotPoint(coords[0,i],coords[1,i], coords[2,i], 'r')
fig.set_title('Multiple Regression Fit to Data', 'Multiple Regression Fit to Data', size = 20)
fig.ax.set_zlabel('y')
fig.ax.set_xlabel('u')
fig.ax.set_ylabel('v')
fig.desc['xlabel'] = 'u'
fig.desc['ylabel'] = 'v'
fig.desc['zlabel'] = 'y'
fig.plotSpan(u, v, 'Green')
fig.ax.view_init(azim=30, elev = 30)
fig.save();

This example shows that the linear model for multiple regression has the same abstract form as the model for the simple regression in the earlier examples.

We can see that there the general principle is the same across all the different kinds of linear models.

Once XX is defined properly, the normal equations for ββ have the same matrix form, no matter how many variables are involved.

Thus, for any linear model where XTXXTX is invertible, the least squares ^ββ^ is given by (XTX)−1XTy(XTX)−1XTy.

Theorem. (Reminder from last time) Let XX be an m×nm×n matrix. The following statements are equivalent:

  1. The equation Xβ=yXβ=y has a unique least-squares solution for each yy in Rm.Rm.
  2. The columns of XX are linearly independent.
  3. The matrix XTXXTX is invertible.

Multiple Regression in Practice¶

Let's see how powerful multiple regression can be on a real-world example.

A typical application of linear models is predicting house prices. Linear models have been used for this problem for decades, and when a municipality does a value assessment on your house, they typically use a linear model.

We can consider various measurable attributes of a house (its "features") as the independent variables, and the most recent sale price of the house as the dependent variable.

For our case study, we will use the features:

  • Lot Area (sq ft),
  • Gross Living Area (sq ft),
  • Number of Fireplaces,
  • Number of Full Baths,
  • Number of Half Baths,
  • Garage Area (sq ft),
  • Basement Area (sq ft)

So our design matrix will have 8 columns (including the constant for the intercept):

Xβ=yXβ=y

and it will have one row for each house in the data set, with yy the sale price of the house.

We will use data from housing sales in Ames, Iowa from 2006 to 2009:

In [3]:
# read and parse the raw data
df = pd.read_csv('data/ames-housing-data/train.csv')

# split into the independent and dependent variables
X_no_intercept = df[['LotArea', 'GrLivArea', 'Fireplaces', 'FullBath', 'HalfBath', 'GarageArea', 'TotalBsmtSF']].values
y = df['SalePrice'].values
In [4]:
# run this line to visualize the data
df[['LotArea', 'GrLivArea', 'Fireplaces', 'FullBath', 'HalfBath', 'GarageArea', 'TotalBsmtSF', 'SalePrice']].head()
Out[4]:
LotArea GrLivArea Fireplaces FullBath HalfBath GarageArea TotalBsmtSF SalePrice
0 8450 1710 0 2 1 548 856 208500
1 9600 1262 1 2 0 460 1262 181500
2 11250 1786 1 2 1 608 920 223500
3 9550 1717 1 1 0 642 756 140000
4 14260 2198 1 2 1 836 1145 250000

Next we add a column of 1s to the design matrix, which adds a constant intercept to the model:

In [6]:
X = np.column_stack([np.ones(X_no_intercept.shape[0], dtype = 'int'), X_no_intercept])
In [7]:
# run this line to view the X matrix
X
Out[7]:
array([[    1,  8450,  1710, ...,     1,   548,   856],
       [    1,  9600,  1262, ...,     0,   460,  1262],
       [    1, 11250,  1786, ...,     1,   608,   920],
       ...,
       [    1,  9042,  2340, ...,     0,   252,  1152],
       [    1,  9717,  1078, ...,     0,   240,  1078],
       [    1,  9937,  1256, ...,     1,   276,  1256]])

Now let's peform the least-squares regression:

^β=(XTX)−1XTyβ^=(XTX)−1XTy
In [8]:
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y

What does our model tell us?

In [9]:
 # run this line to view the model parameters
beta_hat
Out[9]:
array([-2.92338280e+04,  1.87444579e-01,  3.94185205e+01,  1.45698657e+04,
        2.29695596e+04,  1.62834807e+04,  9.14770980e+01,  5.11282216e+01])

We see that we have:

  • β0β0: Intercept of -$29,233
  • β1β1: Marginal value of one square foot of Lot Area: $18
  • β2β2: Marginal value of one square foot of Gross Living Area: $39
  • β3β3: Marginal value of one additional fireplace: $14,570
  • β4β4: Marginal value of one additional full bath: $22,970
  • β5β5: Marginal value of one additional half bath: $16,283
  • β6β6: Marginal value of one square foot of Garage Area: $91
  • β7β7: Marginal value of one square foot of Basement Area: $51

Is our model doing a good job?

There are many statistics for testing this question, but we'll just look at the predictions versus the ground truth.

For each house we compute its predicted sale value according to our model:

^y=X^βy^=Xβ^
In [21]:
y_hat = X @ beta_hat

And for each house, we'll plot its predicted versus actual sale value:

In [26]:
# plot of (X, Y) samples and the linear regression based on y_hat
plt.figure()
plt.plot(y, y_hat, '.')
plt.xlabel('Actual Sale Price')
plt.ylabel('Predicted Sale Price')
plt.title('Linear Model Predictions of House Prices')
plt.plot([0, 500000], [0, 500000], '-', label='yhat = y')
plt.legend(loc = 'best');

We see that the model does a reasonable job for house values less than about $250,000.

For a better model, we'd want to consider more features of each house, and perhaps some additional functions such as polynomials as components of our model.

In [ ]:
 
In [ ]: