11-decision-trees

In [1]:

#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None

mp.rcParams['animation.html'] = 'jshtml';

Announcements¶

HW 3 is due Friday (2/17) at 8pm
Upcoming office hours:
- Tomorrow: Peer tutor Rohan Anand from 1:30-3pm in CCDS 16th floor
- Tomorrow: Abhishek Tiwari from 3:30-4:30pm on CCDS 13th floor
Read Tan, Steinbach, Karpatne, Kumar Chapter 3.3

Recap of last lecture¶

Supervised Learning¶

The supervised learning problem in general is:

You are given some example data, which we'll think of abstractly as tuples ${(x_{i}, y_{i}) | i = 1, \dots, N}$ .
Your goal is to learn a rule that allows you to predict $y_{j}$ for some $x_{j}$ that is not in the example data you were given.

When our data is continuous, we call this a regression.

The class of models we looked at are polynomials. They are of the form:

y (x, w) = w_{0} + w_{1} x + w_{2} x^{2} + \dots + w_{k} x^{k} = \sum_{j = 0}^{k} w_{j} x^{j}

where $k$ is the order of the polynomial.

Choosing $k$ is called model selection and $w_{0}, w_{1}, \dots, w_{k}$ are called the parameters of the model.

In [96]:

#
fig, axs = plt.subplots(1, 3, sharey = True, figsize = (12, 5))
#
cy = 1000 * [w_hat_0]
pred_y = N * [w_hat_0]
axs[0].plot(cx, cy, lw = 2, label = r'$k$ = 0')
axs[0].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[0].set_xlabel('x', size = 16)
axs[0].set_ylabel('y', size = 16)
axs[0].set_title(r'$k$ = 0, constant' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[0].legend(loc = 'best', fontsize = 16)
#
cy = design_matrix(cx, 1) @ w_hat_1
pred_y = design_matrix(x, 1) @ w_hat_1
axs[1].plot(cx, cy, lw = 2, label = r'$k$ = 1')
axs[1].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[1].set_xlabel('x', size = 16)
axs[1].set_title(r'$k$ = 1, linear' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[1].legend(loc = 'best', fontsize = 16)
#
cy = design_matrix(cx, 3) @ w_hat_3
pred_y = design_matrix(x, 3) @ w_hat_3
axs[2].plot(cx, cy, lw = 2, label = r'$k$ = 3')
axs[2].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[2].set_xlabel('x', size = 16)
axs[2].set_title('$k$ = 3, cubic' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[2].legend(loc = 'best', fontsize = 16)
#
fig.tight_layout();

If we overfit our model to our training data, this leads to poor generalizability. To avoid overfitting, we might:

Increase the amount of training data.
Limit the complexity of the model.

In [97]:

#
w_hat_9 = fit_poly(x, y, 9)
cy = design_matrix(cx, 9) @ w_hat_9
plt.plot(cx, cy, lw = 2, label = r'$k$ = 9')
plt.plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
plt.xlabel('x', size = 16)
plt.ylabel('y', size = 16)
plt.title(r'$k$ = 9' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(0));

We can hold out data to test our model selection. For example, we might do the following:

For each possible value of the hyperparameter $k$ :
- randomly split the data 5 times
- compute the mean testing and training error over the 5 random splits

In [98]:

#
test_y = np.sin(2 * np.pi * x) + default_rng(8).normal(size = N, scale = 0.20)
max_k = N
train_err = [np.linalg.norm(y - N * [w_hat_0])]
test_err = [np.linalg.norm(test_y - N * [w_hat_0])]
for k in range(1, max_k):
    w_hat = fit_poly(x, y, k)
    pred_y = design_matrix(x, k) @ w_hat
    train_err.append(np.linalg.norm(y - pred_y))
    test_err.append(np.linalg.norm(test_y - pred_y))
plt.plot(range(max_k), test_err, 'ro-', label = 'Testing Error')
plt.plot(range(max_k), train_err, 'bo-', label = 'Training Error')
plt.xlabel(r'$k$', size = 16)
plt.ylabel(r'$E(\mathbf{w}^*)$')
plt.legend(loc = 'best');

In cross-validation, the data is partitioned once, and then each partition is used as the test data once.

This ensures that all the data gets equal weight in the training and in the testing.

We divide the data into $k$ "folds".

In [4]:

#
display(Image("images/22-k-fold.png", width=500))

11.1 Classification via Decision Trees¶

Next, we start our study of classification methods.

Recall that in a classification problem, we have data tuples $(x, y)$ in which the $x$ are the features, and the $y$ values are categorical data (which are usually discrete).

We typically call the $y$ values "labels" or "classes."

Some examples of classification tasks:

Predicting tumor cells as malignant or benign
Classifying credit card transactions as legitimate or fraudulent
Classifying secondary structures of a protein as alpha-helix, beta-sheet, or other
Categorizing news stories as finance, weather, entertainment, sports, etc

The first classification method we will consider is called the decision tree. It is a very popular method, and has some nice properties like interpretability, as we will see.

Next time, we will discuss another classification method called $k$ -nearest neighbors.

Decision Trees in Action¶

We will start by describing how a decision tree works.

We are assuming a decision tree has been built to solve the following classification problem:

Given an individual's Tax Refund Status, Marital Status, and Taxable Income, predict whether they will repay a loan.

In [6]:

#
display(Image("images/22-DT-Example-1.png", width=800))

We then step through the tree, making a decision at each node that takes us to another node in the tree.

Each decision examines a single feature in the item being classified.

In [7]:

#
display(Image("images/22-DT-Example-2.png", width=800))

In [8]:

#
display(Image("images/22-DT-Example-3.png", width=800))

In [9]:

#
display(Image("images/22-DT-Example-4.png", width=800))

In [10]:

#
display(Image("images/22-DT-Example-5.png", width=800))

In [11]:

#
display(Image("images/22-DT-Example-6.png", width=800))

We conclude that this record is classified as "Not Repay" is "No".

Note also that decision trees can be used to predict numeric values, so they are used for regression as well.

The general term "Classification and Regression Tree" (CART) is sometimes used -- although this term also refers to a specific decision tree learning algorithm.

Learning a Decision Tree¶

In [12]:

#
display(Image("images/22-DT-Overview.png", width=800))

We've discussed how to apply a decision tree to data (lower portion of this figure).

But how does one train a decision tree? What algorithm can we use?

A number of algorithms have been proposed for building decision trees:

Hunt's algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ, SPRINT
etc

We will look at Hunt's Algorithm as our example.

11.2 Hunt's Algorithm¶

Hunt's Algorithm builds the tree node by node, starting from the root.

As we build the tree, we divide the training data up.

Let $D_{t}$ be the set of training records that reach node $t$ .

If $D_{t}$ contains records that all belong to a single class $y_{t}$ , then $t$ is a leaf node labeled as $y_{t}$ .
If $D_{t}$ is an empty set, then $t$ is a leaf node labeled by the default class $y_{d}$ .
If $D_{t}$ contains records that belong to more than one class, use an attribute to split $D_{t}$ into smaller subsets, and assign that splitting-rule to node $t$ .

Recursively apply the above procedure until a stopping criterion is met.

In [13]:

#
display(Image("images/22-DT-Data-Example.png", width=500))

So as Hunt's algorithm progresses, records are divided up and assigned to leaf nodes. The most challenging step in Hunt's algorithm is deciding which leaf node to split next, and how to specify the split.

In general, we use a greedy strategy. We will split a set of records based on an attribute test that optimizes some criterion.

We choose the set to split that provides the largest improvement in our criterion.

So we need to determine:

The attribute test condition that determines how 'good' a proposed split is
The best split set based on a given attribute?
When to stop splitting.

Specifying the Test Condition¶

How we specify a test condition depends on the attribute type:

Nominal (Categorical)
Ordinal (eg, Small, Medium, Large)
Continuous

And depends on the number of ways to split - multi-way or binary:

For a Nominal (or categorical) attribute:

In a Multi-way split we can use as many partitions as there are distinct values of the attribute:

In [14]:

#
display(Image("images/22-split-1.png", width=300))

In a Binary split we divide the values into two groups. In this case, we need to find an optimal partitioning of values into groups.

In [16]:

#
display(Image("images/22-split-2.png", width=600))

An Ordinal attribute is similar to a Nominal attribute, except that since attributes have an ordering, we can specify the test in terms of a threshold. This simplifies the search for a good paritition.

In [17]:

#
display(Image("images/22-split-3.png", width=600))

A Continuous attribute can be discretized in two ways:

To form an ordinal categorical attribute

In [18]:

#
display(Image("images/22-split-4.png", width=400))

To form a binary split.

In [20]:

#
display(Image("images/22-split-5.png", width=200))

Note that finding good partitions for nominal attributes can be expensive, possibly involving combinatorial searching of groupings.

However for ordinal or continuous attributes, sweeping through a range of threshold values can be more efficient.

Determining the Best Split¶

Our algorithmic strategy is to split record sets so as to improve classification accuracy on the training data.

When is classification accuracy maximized?

When each leaf node contains records that all have the same label.

Consider the following example:

At some node in the tree, before splitting that node's records, we have

10 records from class 0, and
10 records from class 1

There are three attributes we can split on. Which split is best?

In [22]:

#
display(Image("images/22-split-6.png", width=700))

The "Car Type" attribute yields more homogeneous splits.

Using the "Student ID" attribute yields perfect accuracy on the training data.

... but what would it do on the test data? Using the "Student ID" attribute would lead to overfitting.

We need a measure of homogeneity:

In [23]:

#
display(Image("images/22-split-7.png", width=700))

A number of measures have been proposed:

Gini Index
Entropy
Misclassification Error

Let's look at Entropy and the Gini coefficient. We can think of the set of records at each node as defining a distribution over the classes.

For a given node $t$ , we will use $p (j | t)$ to denote the frequency of class $j$ at node $t$ .

Definition. Entropy is defined as

Entropy (t) = - \sum_{j = 1}^{c} p (j | t) \log_{2} p (j | t)

Entropy spans [0,1].

The Gini coefficient is also a common measure:

Definition. The Gini coefficient is defined as

Gini (t) = 1 - \sum_{j = 1}^{c} [p (j | t)]^{2}

The Gini coefficient spans [0,0.5].

Definition.

Classification error (t) = 1 - max_{j} [p (j | t)]

These measures give us a sense of "balance" in a distribution, where:

0 expresses homogeneity: every element in the node has the same class.
larger numbers express non-homogeneity: many classes appear in the node.

As nodes are split in building the tree, we will see decreasing Gini and Entropy scores.

In [2]:

#
display(Image("images/11-entropy-example.png", width=800))

With this measure of impurity, we can decide which split will yield the most information gain.

Definition. Information gain is defined as

Δ_{info} = I (parent) - I (children)

where $I (parent)$ is the impurity measure at the current node and $I (children)$ is defined as

I (children) = \sum_{j = 1}^{k} \frac{N (v_{j})}{N} I (v_{j})

Example¶

In [5]:

#
display(Image("images/11-gini-example.png", width=800))

Remember that $Gini (t) = 1 - \sum_{j = 1}^{c} [p (j | t)]^{2}$

(a) Compute the Gini index for the overall collection of training examples.

(b) Compute the Gini index for the Customer ID attribute.

(c) Compute the Gini index for the Car Type attribute using multiway split.

(a) Gini = $1 - (10 / 20)^{2} + (10 / 20)^{2} = 0.5$

(b) Gini = 0 because each CustomerID is unique.

(c) The gini for:

Family car = $1 - (1 / 4)^{2} + (3 / 4)^{2} = 0.375$

Sports car = $1 - (8 / 8)^{2} + (0 / 8)^{2} = 0$

Luxury car = $1 = (1 / 8)^{2} + (7 / 8)^{2} = 0.2188$

Entropy(Car Type) = $(4 / 20) (0.375) + (8 / 20) (0) + (8 / 20) (0.2188) = 0.1625$

When to Stop Splitting¶

The decision tree can be built, node by node, by splitting nodes in a way that steadily improves the overall Gini score.

Clearly, we could keep on splitting nodes until every leaf node contains only unique training records. For example, each leaf node could contain just a single training record - a fully-grown tree.

This would certainly be overfitting!

We would have as many leaf nodes (parameters) as training data records.
And we would not expect the decision tree to generalize well.

There are two strategies that can be used to control the complexity of a decision tree:

Early-stopping: stop the algorithm before the tree becomes fully-grown
Post-pruning: grow decision tree fully, then prune nodes from leaves upward

Early Stopping

Do not expand the current node, even though it still has multiple records.

Default stopping conditions include:

Stop if all records belong to the same class
Stop if all the records have the same attributes

More restrictive stopping conditions include:

Stop if number of records in current node is less than a specified value
Stop if expanding the current node does not improve homogeneity (eg, Gini score)
Stop if splitting current node would create node smaller than a specified value

Post-pruning

Suppose we can assess generalization error, for example by using held-out data.

Then:

Grow decision tree to its entirety
Trim the nodes of the tree in bottom-up fashion, replacing nodes with the majority class
If generalization error improves after trimming, replace sub-tree by a leaf node

Note that this is computationally more demanding than early-stopping.

Interpretability¶

Decision Trees are a very popular technique.

They have a number of advantages:

They are relatively inexpensive to construct
They are extremely fast at classifying unknown data
Their accuracy can be quite good in many cases
And oftentimes they are interpretable

We will explore interpretation of decision trees in more detail next time.

In [ ]: