#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None
mp.rcParams['animation.html'] = 'jshtml';
HW 3 is due Friday (2/17) at 8pm
Upcoming office hours:
Read Tan, Steinbach, Karpatne, Kumar Chapter 3.3
The supervised learning problem in general is:
You are given some example data, which we'll think of abstractly as tuples .
Your goal is to learn a rule that allows you to predict for some that is not in the example data you were given.
When our data is continuous, we call this a regression.
The class of models we looked at are polynomials. They are of the form:
where is the order of the polynomial.
Choosing is called model selection and are called the parameters of the model.
#
fig, axs = plt.subplots(1, 3, sharey = True, figsize = (12, 5))
#
cy = 1000 * [w_hat_0]
pred_y = N * [w_hat_0]
axs[0].plot(cx, cy, lw = 2, label = r'$k$ = 0')
axs[0].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[0].set_xlabel('x', size = 16)
axs[0].set_ylabel('y', size = 16)
axs[0].set_title(r'$k$ = 0, constant' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[0].legend(loc = 'best', fontsize = 16)
#
cy = design_matrix(cx, 1) @ w_hat_1
pred_y = design_matrix(x, 1) @ w_hat_1
axs[1].plot(cx, cy, lw = 2, label = r'$k$ = 1')
axs[1].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[1].set_xlabel('x', size = 16)
axs[1].set_title(r'$k$ = 1, linear' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[1].legend(loc = 'best', fontsize = 16)
#
cy = design_matrix(cx, 3) @ w_hat_3
pred_y = design_matrix(x, 3) @ w_hat_3
axs[2].plot(cx, cy, lw = 2, label = r'$k$ = 3')
axs[2].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[2].set_xlabel('x', size = 16)
axs[2].set_title('$k$ = 3, cubic' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[2].legend(loc = 'best', fontsize = 16)
#
fig.tight_layout();
If we overfit our model to our training data, this leads to poor generalizability. To avoid overfitting, we might:
Increase the amount of training data.
Limit the complexity of the model.
#
w_hat_9 = fit_poly(x, y, 9)
cy = design_matrix(cx, 9) @ w_hat_9
plt.plot(cx, cy, lw = 2, label = r'$k$ = 9')
plt.plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
plt.xlabel('x', size = 16)
plt.ylabel('y', size = 16)
plt.title(r'$k$ = 9' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(0));
We can hold out data to test our model selection. For example, we might do the following:
#
test_y = np.sin(2 * np.pi * x) + default_rng(8).normal(size = N, scale = 0.20)
max_k = N
train_err = [np.linalg.norm(y - N * [w_hat_0])]
test_err = [np.linalg.norm(test_y - N * [w_hat_0])]
for k in range(1, max_k):
w_hat = fit_poly(x, y, k)
pred_y = design_matrix(x, k) @ w_hat
train_err.append(np.linalg.norm(y - pred_y))
test_err.append(np.linalg.norm(test_y - pred_y))
plt.plot(range(max_k), test_err, 'ro-', label = 'Testing Error')
plt.plot(range(max_k), train_err, 'bo-', label = 'Training Error')
plt.xlabel(r'$k$', size = 16)
plt.ylabel(r'$E(\mathbf{w}^*)$')
plt.legend(loc = 'best');
In cross-validation, the data is partitioned once, and then each partition is used as the test data once.
This ensures that all the data gets equal weight in the training and in the testing.
We divide the data into "folds".
#
display(Image("images/22-k-fold.png", width=500))
Next, we start our study of classification methods.
Recall that in a classification problem, we have data tuples in which the are the features, and the values are categorical data (which are usually discrete).
We typically call the values "labels" or "classes."
Some examples of classification tasks:
The first classification method we will consider is called the decision tree. It is a very popular method, and has some nice properties like interpretability, as we will see.
Next time, we will discuss another classification method called -nearest neighbors.
We will start by describing how a decision tree works.
We are assuming a decision tree has been built to solve the following classification problem:
Given an individual's Tax Refund Status, Marital Status, and Taxable Income, predict whether they will repay a loan.
#
display(Image("images/22-DT-Example-1.png", width=800))
We then step through the tree, making a decision at each node that takes us to another node in the tree.
Each decision examines a single feature in the item being classified.
#
display(Image("images/22-DT-Example-2.png", width=800))
#
display(Image("images/22-DT-Example-3.png", width=800))
#
display(Image("images/22-DT-Example-4.png", width=800))
#
display(Image("images/22-DT-Example-5.png", width=800))
#
display(Image("images/22-DT-Example-6.png", width=800))
We conclude that this record is classified as "Not Repay" is "No".
Note also that decision trees can be used to predict numeric values, so they are used for regression as well.
The general term "Classification and Regression Tree" (CART) is sometimes used -- although this term also refers to a specific decision tree learning algorithm.
#
display(Image("images/22-DT-Overview.png", width=800))
We've discussed how to apply a decision tree to data (lower portion of this figure).
But how does one train a decision tree? What algorithm can we use?
A number of algorithms have been proposed for building decision trees:
We will look at Hunt's Algorithm as our example.
Hunt's Algorithm builds the tree node by node, starting from the root.
As we build the tree, we divide the training data up.
Let be the set of training records that reach node .
Recursively apply the above procedure until a stopping criterion is met.
#
display(Image("images/22-DT-Data-Example.png", width=500))
So as Hunt's algorithm progresses, records are divided up and assigned to leaf nodes. The most challenging step in Hunt's algorithm is deciding which leaf node to split next, and how to specify the split.
In general, we use a greedy strategy. We will split a set of records based on an attribute test that optimizes some criterion.
We choose the set to split that provides the largest improvement in our criterion.
So we need to determine:
How we specify a test condition depends on the attribute type:
And depends on the number of ways to split - multi-way or binary:
For a Nominal (or categorical) attribute:
#
display(Image("images/22-split-1.png", width=300))
#
display(Image("images/22-split-2.png", width=600))
An Ordinal attribute is similar to a Nominal attribute, except that since attributes have an ordering, we can specify the test in terms of a threshold. This simplifies the search for a good paritition.
#
display(Image("images/22-split-3.png", width=600))
A Continuous attribute can be discretized in two ways:
#
display(Image("images/22-split-4.png", width=400))
#
display(Image("images/22-split-5.png", width=200))
Note that finding good partitions for nominal attributes can be expensive, possibly involving combinatorial searching of groupings.
However for ordinal or continuous attributes, sweeping through a range of threshold values can be more efficient.
Our algorithmic strategy is to split record sets so as to improve classification accuracy on the training data.
When is classification accuracy maximized?
When each leaf node contains records that all have the same label.
Consider the following example:
At some node in the tree, before splitting that node's records, we have
There are three attributes we can split on. Which split is best?
#
display(Image("images/22-split-6.png", width=700))
The "Car Type" attribute yields more homogeneous splits.
Using the "Student ID" attribute yields perfect accuracy on the training data.
... but what would it do on the test data? Using the "Student ID" attribute would lead to overfitting.
We need a measure of homogeneity:
#
display(Image("images/22-split-7.png", width=700))
A number of measures have been proposed:
Let's look at Entropy and the Gini coefficient. We can think of the set of records at each node as defining a distribution over the classes.
For a given node , we will use to denote the frequency of class at node .
Definition. Entropy is defined as
Entropy spans [0,1].
The Gini coefficient is also a common measure:
Definition. The Gini coefficient is defined as
The Gini coefficient spans [0,0.5].
Definition.
These measures give us a sense of "balance" in a distribution, where:
As nodes are split in building the tree, we will see decreasing Gini and Entropy scores.
#
display(Image("images/11-entropy-example.png", width=800))
With this measure of impurity, we can decide which split will yield the most information gain.
Definition. Information gain is defined as
where is the impurity measure at the current node and is defined as
#
display(Image("images/11-gini-example.png", width=800))
Remember that
(a) Compute the Gini index for the overall collection of training examples.
(b) Compute the Gini index for the Customer ID attribute.
(c) Compute the Gini index for the Car Type attribute using multiway split.
(a) Gini =
(b) Gini = 0 because each CustomerID is unique.
(c) The gini for:
Family car =
Sports car =
Luxury car =
Entropy(Car Type) =
The decision tree can be built, node by node, by splitting nodes in a way that steadily improves the overall Gini score.
Clearly, we could keep on splitting nodes until every leaf node contains only unique training records. For example, each leaf node could contain just a single training record - a fully-grown tree.
This would certainly be overfitting!
There are two strategies that can be used to control the complexity of a decision tree:
Early Stopping
Do not expand the current node, even though it still has multiple records.
Default stopping conditions include:
More restrictive stopping conditions include:
Post-pruning
Suppose we can assess generalization error, for example by using held-out data.
Then:
Note that this is computationally more demanding than early-stopping.
Decision Trees are a very popular technique.
They have a number of advantages:
We will explore interpretation of decision trees in more detail next time.