#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None
mp.rcParams['animation.html'] = 'jshtml';
We introduced regression, which is constructing an equation that describes the relationships among variables.
For example, we may look at these points and observe that they approximately lie on a line.
So we could decide to model this data using a line.
#
ax = ut.plotSetup(-10, 10, -10, 10, size = (7, 7))
ut.centerAxes(ax)
line = np.array([1, 0.5])
xlin = -10.0 + 20.0 * np.random.random(100)
ylin = line[0] + (line[1] * xlin) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xlin, ylin, 'ro', markersize=6);
We may look at these points and decide to model them using a quadratic function.
#
ax = ut.plotSetup(-10, 10, -10, 20, size = (7, 7))
ut.centerAxes(ax)
quad = np.array([1, 3, 0.5])
xquad = -10.0 + 20.0 * np.random.random(100)
yquad = quad[0] + (quad[1] * xquad) + (quad[2] * xquad * xquad) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xquad, yquad, 'ro', markersize=6);
And we may look at these points and decide to model them using a logarithmic function.
#
ax = ut.plotSetup(-10, 10, -10, 15, size = (7, 7))
ut.centerAxes(ax)
log = np.array([1, 4])
xlog = 10.0 * np.random.random(100)
ylog = log[0] + log[1] * np.log(xlog) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xlog, ylog, 'ro', markersize=6);
The regression problem has been studied extensively in the field of statistics and machine learning. We are working with:
The basic regression task is:
The independent variables are collected into a matrix which is called the design matrix.
The dependent variables are collected into an observation vector
The parameters of the model (for any kind of model) are collected into a parameter vector
#
ax = ut.plotSetup(-10, 10, -10, 10, size = (7, 7))
ut.centerAxes(ax)
line = np.array([1, 0.5])
xlin = -10.0 + 20.0 * np.random.random(100)
ylin = line[0] + (line[1] * xlin) + np.random.normal(scale = 1.5, size = 100)
ax.plot(xlin, ylin, 'ro', markersize = 6)
ax.plot(xlin, line[0] + line[1] *xlin, 'b-')
plt.text(-9, 3, r'$y = \beta_0 + \beta_1x$', size=20);
We call the observed value of
and we call the predicted -value.
The difference between an observed -value and a predicted -value is called a residual.
# image credit: Lay, LAA, 4th edition
display(Image("images/Lay-fig-6-6-1.jpg", width=550))
Let's imagine for a moment that the data fit a line perfectly.
Then, if each of the data points happened to fall exactly on the line, the parameters and would satisfy the equations
We can write this system as
where
If the data points don't actually lie exactly on a line,
... then there are no parameters for which the predicted -values in equal the observed -values in ,
... and has no solution.
Now, to obtain the least-squares line, find the least-squares solution to
We do this by solving the normal equation that we learned last lecture (just with new notation):
Another way that the inconsistent linear system is often written is to collect all the residuals into a residual vector.
Then an exact equation is
Any equation of this form is referred to as a linear model.
In this formulation, the goal is to minimize the length of , ie,
In some cases, one would like to fit data points with something other than a straight line.
For example, think of Gauss trying to find the equation for the orbit of Ceres.
In cases like this, the matrix equation is still , but the specific form of changes from one problem to the next.
The least-squares solution is a solution of the normal equations
Most models have parameters, and the object of model fitting is to to fix those parameters. Let's talk about model parameters.
In model fitting, the parameters are the unknown. A central question for us is whether the model is linear in its parameters.
For example, the model is not linear in its parameters.
The model is linear in its parameters.
For a model that is linear in its parameters, an observation is a linear combination of (arbitrary) known functions.
In other words, a model that is linear in its parameters is
where are known functions and are parameters.
Example. Suppose data points appear to lie along some sort of parabola instead of a straight line. Suppose we wish to approximate the data by an equation of the form
Describe the linear model that produces a "least squares fit" of the data by the equation.
Solution. The ideal relationship is
Suppose the actual values of the parameters are Then the coordinates of the first data point satisfy the equation
where is the residual error between the observed value and the predicted -value.
Each data point determines a similar equation:
Clearly, this system can be written as
#
m = np.shape(xquad)[0]
X = np.array([np.ones(m), xquad,xquad**2]).T
beta = np.linalg.inv(X.T @ X) @ X.T @ yquad
#
ax = ut.plotSetup(-10, 10, -10, 20, size = (7, 7))
ut.centerAxes(ax)
xplot = np.linspace(-10, 10, 50)
yestplot = beta[0] + beta[1] * xplot + beta[2] * xplot**2
ax.plot(xplot, yestplot, 'b-', lw=2)
ax.plot(xquad, yquad, 'ro', markersize = 8);
#
m = np.shape(xlog)[0]
X = np.array([np.ones(m), np.log(xlog)]).T
beta = np.linalg.inv(X.T @ X) @ X.T @ ylog
#
ax = ut.plotSetup(-10, 10, -10, 15, size = (7, 7))
ut.centerAxes(ax)
xplot = np.linspace(0.1, 10, 50)
yestplot = beta[0] + beta[1] * np.log(xplot)
ax.plot(xplot, yestplot, 'b-', lw=2)
ax.plot(xlog, ylog, 'ro', markersize = 8);
Suppose an experiment involves two independent variables -- say, and , -- and one dependent variable, .
A linear equation for predicting from and has the form
Since there is more than one independent variable, this is called multiple regression.
A more general prediction equation might have the form
A least squares fit to equations like this is called a trend surface.
In general, a linear model will arise whenever is to be predicted by an equation of the form
with any sort of known functions and unknown weights.
Example. In geography, local models of terrain are constructed from data where , and are latitude, longitude, and altitude, respectively.
Let's take an example. Here are a set of points in :
#
fig = ut.three_d_figure((14, 1), 'multivariate regression example',
-7, 7, -7, 7, -8, 8,
equalAxes = False, figsize = (7, 7), qr = qr_setting)
np.random.seed(6)
v = [4.0, 4.0, 2.0]
u = [-4.0, 3.0, 1.0]
npts = 70
# set locations of points that fall within x,y
xc = -7.0 + 14.0 * np.random.random(npts)
yc = -7.0 + 14.0 * np.random.random(npts)
A = np.array([u,v]).T
# project these points onto the plane
P = A @ np.linalg.inv(A.T @ A) @ A.T
coords = P @ np.array([xc,yc,np.zeros(npts)])
coords[2] += np.random.normal(scale = 1, size = npts)
for i in range(coords.shape[-1]):
fig.plotPoint(coords[0,i],coords[1,i], coords[2,i], 'r')
fig.set_title('Terrain Data for Multiple Regression', 'Terrain Data for Multiple Regression', size = 20)
fig.ax.set_zlabel('y')
fig.ax.set_xlabel('u')
fig.ax.set_ylabel('v')
fig.desc['xlabel'] = 'u'
fig.desc['ylabel'] = 'v'
fig.desc['zlabel'] = 'y'
fig.ax.view_init(azim=30, elev = 15)
fig.save();
Let's describe the linear models that gives a least-squares fit to such data. The solution is called the least-squares plane.
Solution. We expect the data to satisfy these equations:
This system has the matrix for where
#
fig = ut.three_d_figure((14, 2), 'multivariate regression example with fitted plane',
-7, 7, -7, 7, -8, 8,
equalAxes = False, figsize = (7, 7), qr = qr_setting)
np.random.seed(6)
v = [4.0, 4.0, 2.0]
u = [-4.0, 3.0, 1.0]
npts = 70
# set locations of points that fall within x,y
xc = -7.0 + 14.0 * np.random.random(npts)
yc = -7.0 + 14.0 * np.random.random(npts)
A = np.array([u,v]).T
# project these points onto the plane
P = A @ np.linalg.inv(A.T @ A) @ A.T
coords = P @ np.array([xc,yc,np.zeros(npts)])
coords[2] += np.random.normal(scale = 1, size = npts)
for i in range(coords.shape[-1]):
fig.plotPoint(coords[0,i],coords[1,i], coords[2,i], 'r')
fig.set_title('Multiple Regression Fit to Data', 'Multiple Regression Fit to Data', size = 20)
fig.ax.set_zlabel('y')
fig.ax.set_xlabel('u')
fig.ax.set_ylabel('v')
fig.desc['xlabel'] = 'u'
fig.desc['ylabel'] = 'v'
fig.desc['zlabel'] = 'y'
fig.plotSpan(u, v, 'Green')
fig.ax.view_init(azim=30, elev = 30)
fig.save();
This example shows that the linear model for multiple regression has the same abstract form as the model for the simple regression in the earlier examples.
We can see that there the general principle is the same across all the different kinds of linear models.
Once is defined properly, the normal equations for have the same matrix form, no matter how many variables are involved.
Thus, for any linear model where is invertible, the least squares is given by .
Theorem. (Reminder from last time) Let be an matrix. The following statements are equivalent:
Let's see how powerful multiple regression can be on a real-world example.
A typical application of linear models is predicting house prices. Linear models have been used for this problem for decades, and when a municipality does a value assessment on your house, they typically use a linear model.
We can consider various measurable attributes of a house (its "features") as the independent variables, and the most recent sale price of the house as the dependent variable.
For our case study, we will use the features:
So our design matrix will have 8 columns (including the constant for the intercept):
and it will have one row for each house in the data set, with the sale price of the house.
We will use data from housing sales in Ames, Iowa from 2006 to 2009:
# read and parse the raw data
df = pd.read_csv('data/ames-housing-data/train.csv')
# split into the independent and dependent variables
X_no_intercept = df[['LotArea', 'GrLivArea', 'Fireplaces', 'FullBath', 'HalfBath', 'GarageArea', 'TotalBsmtSF']].values
y = df['SalePrice'].values
# run this line to visualize the data
df[['LotArea', 'GrLivArea', 'Fireplaces', 'FullBath', 'HalfBath', 'GarageArea', 'TotalBsmtSF', 'SalePrice']].head()
LotArea | GrLivArea | Fireplaces | FullBath | HalfBath | GarageArea | TotalBsmtSF | SalePrice | |
---|---|---|---|---|---|---|---|---|
0 | 8450 | 1710 | 0 | 2 | 1 | 548 | 856 | 208500 |
1 | 9600 | 1262 | 1 | 2 | 0 | 460 | 1262 | 181500 |
2 | 11250 | 1786 | 1 | 2 | 1 | 608 | 920 | 223500 |
3 | 9550 | 1717 | 1 | 1 | 0 | 642 | 756 | 140000 |
4 | 14260 | 2198 | 1 | 2 | 1 | 836 | 1145 | 250000 |
Next we add a column of 1s to the design matrix, which adds a constant intercept to the model:
X = np.column_stack([np.ones(X_no_intercept.shape[0], dtype = 'int'), X_no_intercept])
# run this line to view the X matrix
X
array([[ 1, 8450, 1710, ..., 1, 548, 856], [ 1, 9600, 1262, ..., 0, 460, 1262], [ 1, 11250, 1786, ..., 1, 608, 920], ..., [ 1, 9042, 2340, ..., 0, 252, 1152], [ 1, 9717, 1078, ..., 0, 240, 1078], [ 1, 9937, 1256, ..., 1, 276, 1256]])
Now let's peform the least-squares regression:
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
What does our model tell us?
# run this line to view the model parameters
beta_hat
array([-2.92338280e+04, 1.87444579e-01, 3.94185205e+01, 1.45698657e+04, 2.29695596e+04, 1.62834807e+04, 9.14770980e+01, 5.11282216e+01])
We see that we have:
Is our model doing a good job?
There are many statistics for testing this question, but we'll just look at the predictions versus the ground truth.
For each house we compute its predicted sale value according to our model:
y_hat = X @ beta_hat
And for each house, we'll plot its predicted versus actual sale value:
# plot of (X, Y) samples and the linear regression based on y_hat
plt.figure()
plt.plot(y, y_hat, '.')
plt.xlabel('Actual Sale Price')
plt.ylabel('Predicted Sale Price')
plt.title('Linear Model Predictions of House Prices')
plt.plot([0, 500000], [0, 500000], '-', label='yhat = y')
plt.legend(loc = 'best');
We see that the model does a reasonable job for house values less than about $250,000.
For a better model, we'd want to consider more features of each house, and perhaps some additional functions such as polynomials as components of our model.