In [2]:
#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None

mp.rcParams['animation.html'] = 'jshtml';
In [12]:
#
def centerAxes(ax):
    ax.spines['left'].set_position('zero')
    ax.spines['right'].set_color('none')
    ax.spines['bottom'].set_position('zero')
    ax.spines['top'].set_color('none')
    ax.xaxis.set_ticks_position('bottom')
    ax.yaxis.set_ticks_position('left')
    bounds = np.array([ax.axes.get_xlim(), ax.axes.get_ylim()])
    ax.plot(bounds[0][0],bounds[1][0],'')
    ax.plot(bounds[0][1],bounds[1][1],'')
In [1]:
#
%matplotlib inline
%config InlineBackend.figure_format='retina'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image, display_html, display, Math, HTML;

Announcements¶

  • HW 3 is due Friday (2/17) at 8pm

  • Upcoming office hours:

    • Today: Prof McDonald from 4:30-6pm in CCDS 1341
    • Tomorrow: Peer tutor Rohan Anand from 1:30-3pm in CCDS 16th floor
  • Read Tan, Steinbach, Karpatne, Kumar Chapter 3.3

Recap¶

Decision trees are a supervised learning approach to classification.

In [12]:
#
display(Image("images/22-DT-Overview.png", width=800))

Our task is determining:

  • How to measure a 'good' split
    • We have several "impurity" metrics such as Entropy and the Gini coefficient
  • What attribute to split on
    • Which gives us the largest information gain?
  • The best split set based on a given attribute
    • Is the attribute nominal, ordinal, continuous?
  • When to stop splitting
    • Should we fully grow the tree?
In [6]:
#
display(Image("images/22-DT-Example-1.png", width=800))

Lecture 12: kk-Nearest Neighbors¶

[This lecture is based on Prof. Crovella's CS 506 lecture notes and Data Science from Scratch by Joel Grus.]

12.1 Parametric vs. Nonparametric Models¶

Today we'll expand our repetoire of classification techniques.

In so doing we'll look at a first example of a new kind of model: nonparametric.

There are many ways to define models (whether supervised or unsupervised).

One key distinction is this: does the model have a fixed number of parameters, or does the number of parameters grow with the training data?

  • If the model has a fixed number of parameters, it is called parametric.

  • If the number of parameters grows with the data, the model is called nonparametric.

Parametric models have

  • the advantage of often being faster to use,
  • but the disadvantage of making strong assumptions about the nature of data distributions.

Nonparametric models are

  • more flexible,
  • but can be computationally intractable for large datasets.

The classic example of a nonparametric classifier is called kk-Nearest Neighbors.

In [5]:
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

12.2 kk-Nearest Neighbors¶

When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.

--James Whitcomb Riley (1849 - 1916)

In [4]:
#
display(Image("images/23-duck.jpg", width=100))

Like any classifier, kk-Nearest Neighbors is trained by providing it a set of labeled data.

However, at training time, the classifier does very little. It just stores away the training data.

In [3]:
#
demo_y = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
demo_X = np.array([[-3,1], [-2, 4], [-2, 2], [-1.5, 1], [-1, 3], [0, 0], [1, 1.5], [2, 0.5], [2, 3], [2, 0], [3, 1], [4, 4], [0, 1]])
test_X = [-0.3, 0.7]
#
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
plt.axis('equal')
plt.axis('off')
plt.title('Training Points: 2 Classes');

The idea of the kk-Nearest Neighbors classifier is that, at test time, it simply "looks at" the kk points in the training set that are nearest to the test input xx, and makes a decision based on the labels on those points.

By "nearest" we usually mean in Euclidean distance.

In [4]:
#
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
plt.plot(test_X[0], test_X[1], 'ok')
plt.annotate('Test Point', test_X, [75, 25], 
             textcoords = 'offset points', fontsize = 14, 
             arrowprops = {'arrowstyle': '->'})
plt.axis('equal')
plt.axis('off')
plt.title('Training Points: 2 Classes');
In [5]:
#
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
plt.plot(test_X[0], test_X[1], 'ok')
ax=plt.gcf().gca()
circle = mp.patches.Circle(test_X, 0.5, facecolor = 'red', alpha = 0.2)
plt.axis('equal')
plt.axis('off')
ax.add_artist(circle)
plt.title('1-Nearest-Neighbor: Classification: Red');
In [6]:
#
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
test_X = [-0.3, 0.7]
plt.plot(test_X[0], test_X[1], 'ok')
ax=plt.gcf().gca()
    #ellipse = mp.patches.Ellipse(gmm.means_[clus], 3 * e[0], 3 * e[1], angle, color = 'r')
circle = mp.patches.Circle(test_X, 0.9, facecolor = 'gray', alpha = 0.3)
plt.axis('equal')
plt.axis('off')
ax.add_artist(circle)
plt.title('2-Nearest-Neighbor');
In [7]:
#
plt.figure()
ax=plt.gcf().gca()
    #ellipse = mp.patches.Ellipse(gmm.means_[clus], 3 * e[0], 3 * e[1], angle, color = 'r')
circle = mp.patches.Circle(test_X, 1.4, facecolor = 'blue', alpha = 0.2)
ax.add_artist(circle)
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
test_X = [-0.3, 0.7]
plt.plot(test_X[0], test_X[1], 'ok')
plt.axis('equal')
plt.axis('off')
plt.title('3-Nearest-Neighbor: Classification: Blue');

Note that kk-Nearest Neighbors can do either hard or soft classification.

As a hard classifier, it returns the majority vote of the labels on the kk Nearest Neighbors.

Which may be indeterminate, as above.

It is also reasonable to weight the votes of neighborhood points according to their distance from xx.

As a soft classifier it returns:

p(x=c|x,k)=number of points in neighborhood with label ckp(x=c|x,k)=number of points in neighborhood with label ck

Model Selection for kk-NN¶

Each value of kk results in a different model.

The complexity of the resulting model is therefore controlled by the hyperparameter kk.

Hence we will want to select kk using held-out data to avoid overfitting.

Consider this dataset where items fall into three classes:

In [8]:
import sklearn.datasets as sk_data
X, y = sk_data.make_blobs(n_samples=150, 
                          centers=[[-2, 0],[1, 5], [2.5, 1.5]],
                          cluster_std = [2, 2, 3],
                          n_features=2,
                          center_box=(-10.0, 10.0),random_state=0)
plt.figure(figsize = (5,5))
plt.axis('equal')
plt.axis('off')
plt.scatter(X[:,0], X[:,1], c = y, cmap = cmap_bold, s = 80);

Let's observe how the complexity of the resulting model changes as we vary kk.

We'll do this by plotting the decision regions. These show how the method would classify each potential test point in the space.

In [9]:
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
h = .1  # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                      np.arange(y_min, y_max, h))
In [10]:
#
f, axs = plt.subplots(1, 3, figsize=(15, 5))
for i, k in enumerate([1, 5, 25]):
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X, y)
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    axs[i].pcolormesh(xx, yy, Z, cmap = cmap_light, shading = 'auto')
    axs[i].axis('equal')
    axs[i].axis('off')
    axs[i].set_title(f'Decision Regions for $k$ = {k}');

Notice how increasing kk results in smoother decision boundaries.

These are more likely to show good generalization ability.

Challenges for kk-NN¶

Working with a kk-NN classifier can involve some challenges.

  1. First and foremost, the computational cost of classification grows with the size of the training data. (Why?) While certain data structures may help, essentially the classification time grows linearly with the data set size.

    Note the tradeoff here: the training step is trivial, but the classification step can be prohibitively expensive.

  1. Second, since Euclidean distance is the most common distance function used, data scaling is important.

    As previously discussed, features should be scaled to prevent distance measures from being dominated by a small subset of features.

  1. Third concerns the curse of dimensionality.

    If training data lives in a high dimensional space, Euclidean distance measures become less effective.

    This is subtle but important, so we will now look at the curse of dimensionality more closely.

In [9]:
#
display(Image("images/23-curse-frankenstein-edited.jpeg", width=300))

12.3 The Curse of Dimensionality¶

The Curse of Dimensionality is a somewhat tongue-in-cheek term for serious problems that arise when we use geometric algorithms in high dimensions.

There are various aspects of the Curse that affect kk-NN.

1. Points are far apart in high dimension.

kk-NN relies on there being one or more "close" points to the test point xx.

In other words, we need the training data to be relatively dense, so there are "close" points everywhere.

Unfortunately, the amount of space we work in grows exponentially with the dimension dd.

So the amount of data we need to maintain a given density also grows exponentially with dimension dd.

Hence, points in high-dimensional spaces tend not to be close to one another at all.

One very intuitive way to think about it is this:

In order for two points to be close in RdRd, they must be close in each of the dd dimensions.

As the number of dimensions grows, it becomes harder and harder for a pair of points to be close in each dimension.

2. Points tend to all be at similar distances in high dimension.

This one is a little harder to visualize. We'll use formulas instead to guide us.

Let's say points are uniformly distributed in space, so that the number of points in a region is proportional to the region's volume.

How does volume relate to distance as dimension dd grows?

Consider you are at some point in space (say, the test point xx), and you want to know how many points are within a unit distance from you.

This is proportional to the volume of a hypersphere with radius 1.

Now, the volume of a hypersphere is kdrdkdrd.

For each dd there is a different kdkd.

  • For d=2d=2, kdkd is 4π4π, and
  • for d=3d=3, kdkd is 4/3 ππ,
  • etc.
In [11]:
#
ax = plt.figure(figsize = (7,7)).add_subplot(projection = '3d')
# coordinates of sphere surface
u, v = np.mgrid[0:2*np.pi:50j, 0:np.pi:50j]
x = np.cos(u)*np.sin(v)
y = np.sin(u)*np.sin(v)
z = np.cos(v)
#
ax.plot_surface(x, y, z, color='r', alpha = 0.3)
s3 = 1/np.sqrt(3)
ax.quiver(0, 0, 0, s3, s3, s3, color = 'b')
ax.text(s3/2, s3/2, s3/2-0.2, 'r', size = 14)
ax.set_axis_off()
plt.title('Hypersphere in $d$ dimensions\nVolume is $k_d \,r^d$');

Let's also ask how many points are within a slightly smaller distance, let's say 0.99.

The new distance can be thought of as 1−ϵ1−ϵ for some small ϵϵ.

The number of points then of course is proprtional to kd(1−ϵ)dkd(1−ϵ)d

Now, what is the fraction fdfd of all the points that are within a unit distance, but not within a distance of 0.99?

(That is, not within the the hypersphere with radius 1−ϵ1−ϵ)?

This is fd=kd1d−kd(1−ϵ)dkd1dfd=kd1d−kd(1−ϵ)dkd1d

In [12]:
#
ax = plt.figure(figsize = (7,7)).add_subplot(projection = '3d')
# coordinates of sphere surface
u, v = np.mgrid[0:2*np.pi:50j, 0:np.pi:50j]
x = np.cos(u)*np.sin(v)
y = np.sin(u)*np.sin(v)
z = np.cos(v)
#
ax.plot_surface(x, y, z, color='r', alpha = 0.2)
s3 = 1/np.sqrt(3)
ax.quiver(0, 0, 0, s3, s3, s3, color = 'b')
ax.text(s3/2, s3/2, s3/2-0.2, '1', size = 14)
#
eps = 0.9
#
ax.plot_surface(eps * x, eps * y, eps * z, color='b', alpha = 0.2)
ax.quiver(0, 0, 0, eps, 0, 0, color = 'k')
ax.text(1/2-0.2, 0, -0.4, r'$1-\epsilon$', size = 14)
ax.set_axis_off()
plt.title('Inner and Outer Hyperspheres');

Now, (1−ϵ)d(1−ϵ)d goes to 0 as d→∞d→∞.

So, fdfd goes to 1 as d→∞d→∞.

Which means: in the limit of high dd, all of the points that are within 1 unit of our location, are almost exactly 1 unit from our location!

In [13]:
#
ax = plt.figure(figsize = (7,7)).add_subplot(projection = '3d')
# coordinates of sphere surface
u, v = np.mgrid[0:2*np.pi:50j, 0:np.pi:50j]
x = np.cos(u)*np.sin(v)
y = np.sin(u)*np.sin(v)
z = np.cos(v)
#
ax.plot_surface(x, y, z, color='r', alpha = 0.2)
s3 = 1/np.sqrt(3)
ax.quiver(0, 0, 0, s3, s3, s3, color = 'b')
ax.text(s3/2, s3/2, s3/2-0.2, '1', size = 14)
#
eps = 0.9
#
ax.plot_surface(eps * x, eps * y, eps * z, color='b', alpha = 0.2)
ax.quiver(0, 0, 0, eps, 0, 0, color = 'k')
ax.text(1/2-0.2, 0, -0.4, r'$1-\epsilon$', size = 14)
ax.set_axis_off()
plt.title('In high-$d$, All Points Lie in Outer Shell');
The following example is based on _Data Science from Scratch,_ Joel Grus, Second Edition, Chapter 12.  

Let's demonstrate this effect in practice.

What we will do is create 100 points, scattered at random within a dd-dimensional space.

We will look at two quantities:

  • The minimum distance between any two points, and
  • The average distance between any two points.

as we vary dd.

In [15]:
import sklearn.metrics as metrics

nsamples = 1000
unif_X = np.random.default_rng().uniform(0, 1, nsamples).reshape(-1, 1)
euclidean_dists = metrics.euclidean_distances(unif_X)
# extract the values above the diagonal
dists = euclidean_dists[np.triu_indices(nsamples, 1)]
mean_dists = [np.mean(dists)]
min_dists = [np.min(dists)]
for d in range(2, 101):
    unif_X = np.column_stack([unif_X, np.random.default_rng().uniform(0, 1, nsamples)])
    euclidean_dists = metrics.euclidean_distances(unif_X)
    dists = euclidean_dists[np.triu_indices(nsamples, 1)]
    mean_dists.append(np.mean(dists))
    min_dists.append(np.min(dists))

plt.plot(min_dists, label = "Minimum Distance")
plt.plot(mean_dists, label = "Average Distance")
plt.xlabel(r'Number of dimensions ($d$)')
plt.ylabel('Distance')
plt.legend(loc = 'best')
plt.title(f'Comparison of Minimum Versus Average Distance Between {nsamples} Points\nAs Dimension Grows');

The average distance between points grows, but it seems that the minimum distance between points grows about as fast.

So the ratio of the minimum distance to the average distance grows as well!

Let's look at that ratio:

In [16]:
#
plt.plot([a/b for a, b in zip(min_dists, mean_dists)])
plt.xlabel(r'Number of dimensions ($d$)')
plt.ylabel('Ratio')
plt.title(f'Ratio of Minimum to Average Distance Between {nsamples} Points\nAs Dimension Grows');

This shows that, for any test point xx, the distance to the closest point to xx, relatively speaking, gets closer and closer to the average distance between points.

Of course, if we used a point at the average distance for classifying xx, we'd get a very poor classifier.

Implications of the Curse.

For kk-means, the Curse of Dimensionality means that in high dimension, most points are nearly the same distance from the test point.

This makes kk-means ineffective: it cannot reliably tell which are the kk nearest neighbors, and its performance degrades.

What Can be Done?

The problem is that you simply cannot have enough data to do a good job using kk-NN in high dimensions.

If you must use kk-NN for your task, the only option may be to reduce the dimension of your data.

Thankfully, this can often be done at little cost in accuracy using SVD! We can form a rank-rr approximation to the data and perform kk-NN in a lower-dimensional space.

12.4 Comparing Classification Methods on Synthetic Data¶

Next we'll look at two classification methods in practice:

  • Decision Trees, and
  • k-Nearest Neighbors.

To compare these methods, the question arises:

How do we evaluate a classifier?

In the simple case of a binary classifier, we can call one class the 'Positive' class and one the 'Negative' class.

The most basic measure of success for a classifer is accuracy: what fraction of test points are correctly classified?

Of course, accuracy is important, but it can be too simplistic at times.

For example, let's say we have a dataset showing class imbalance: for example 90% of the data are the Positive class and 10% are the Negative class.

For this dataset, consider a classifier that always predicts 'Positive'. Its accuracy is 90%, but it is a very 'stupid' classifier! (ie, it could be one line of code: print(Positive)!)

A better way to measure the classifier's performance is using a Confusion Matrix:

Diagonal elements represent successes, and off diagonals represent errors.

In [11]:
#
display(Image("images/23-confusion-matrix.png", width=300))

Using the confusion matrix we can define some more useful measures:

  • Recall - defined as the fraction of actual positives correctly classified:
    • TP/(TP + FN)
  • Precision - defined as the fraction of classified positives correctly classified:
    • TP/(TP + FP)

Evaluating kk- Nearest Neighbors¶

First we'll generate some synthetic data to work with.

In [17]:
X, y = datasets.make_circles(noise=.1, factor=.5, random_state=1)
print('Shape of data: {}'.format(X.shape))
print('Unique labels: {}'.format(np.unique(y)))
Shape of data: (100, 2)
Unique labels: [0 1]

Here is what the data looks like:

In [18]:
#
plt.figure(figsize = (6,6))
plt.prism()  # this sets a nice color map
plt.scatter(X[:, 0], X[:, 1], c=y, s = 80)
plt.axis('off')
plt.axis('equal');

Recall that we always want to test on data separate from our training data.

For now, we will something very simple: take the first 50 examples for training and hold out the rest for testing. (Later we will do this a better way.)

In [19]:
X_train = X[:50]
y_train = y[:50]
X_test = X[50:]
y_test = y[50:]
In [21]:
#
fig_size = (12, 5)

plt.figure(figsize = fig_size)
plt.subplot(1, 2, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c = y_train, s = 80)
plt.axis('equal')
plt.axis('off')
plt.title('Training Data')

plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c = y_test, s = 80)
plt.title('Test Data')
plt.axis('equal')
plt.axis('off');

For our first example, we will classify the points (in the two classes) using a k-nn classifier.

We will specify that k=5k=5, i.e., we will classify based on the majority vote of the 5 nearest neighbors.

In [22]:
k = 5
knn5 = KNeighborsClassifier(n_neighbors = k)    

In the context of supervised learning, the scikit-learn fit() function corresponds to training and the predict() function corresponds to testing.

In [23]:
knn5.fit(X_train,y_train)
print(f'Accuracy on test data: {knn5.score(X_test, y_test)}')
Accuracy on test data: 0.72

Accuracy of 72% sounds good -- but let's dig deeper.

We'll call the red points the Positive class and the green points the Negative class.

Here is the confusion matrix:

In [24]:
y_pred_test = knn5.predict(X_test)
pd.DataFrame(metrics.confusion_matrix(y_test, y_pred_test), 
             columns = ['Predicted +', 'Predicted -'], 
             index = ['Actual +', 'Actual -'])
Out[24]:
Predicted + Predicted -
Actual + 14 14
Actual - 0 22

Looks like the classifier is getting all of the Negative class correct, but only achieving accuracy of 50% on the Positive class.

That is, its precision is 100%, but its recall is only 50%.

Let's visualize the results.

In [25]:
#
k = 5

plt.figure(figsize = fig_size)
plt.subplot(1, 2, 1)

plt.scatter(X_train[:, 0], X_train[:, 1], c = y_train, s = 80)
plt.axis('equal')
plt.title('Training')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c = y_pred_test, s = 80)
plt.title(f'Testing $k$={k}\nAccuracy: {knn5.score(X_test, y_test)}')
plt.axis('off')
plt.axis('equal');

Indeed, the Positive (red) points in the upper half of the test data are all classified incorrectly.

Let's look at one of the points that the classifier got wrong:

In [26]:
#
k=5 
test_point = np.argmax(X_test[:,1])
neighbors = knn5.kneighbors([X_test[test_point]])[1]

plt.figure(figsize = fig_size)
plt.subplot(1, 2, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c = y_train, s = 80)
plt.scatter(X_train[neighbors,0], X_train[neighbors,1],
            c = y_train[neighbors], marker='o', 
            facecolors='none', edgecolors='b', s = 80)
radius = np.max(metrics.euclidean_distances(X_test[test_point].reshape(1, -1), X_train[neighbors][0]))
ax = plt.gcf().gca()
circle = mp.patches.Circle(X_test[test_point], radius, facecolor = 'blue', alpha = 0.2)
ax.add_artist(circle)
plt.axis('equal')
plt.axis('off')
plt.title(r'Training')

plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c = y_pred_test, s = 80)
plt.scatter(X_test[test_point,0], X_test[test_point,1], marker='o', 
            facecolors='none', edgecolors='b', s = 80)
plt.title('Testing $k$={}\nAccuracy: {}'.format(k,knn5.score(X_test, y_test)))
plt.axis('equal')
plt.axis('off');

For comparison purposes, let's try kk = 3.

In [27]:
#
k = 3
knn3 = KNeighborsClassifier(n_neighbors=k)    
knn3.fit(X_train,y_train)
y_pred_test = knn3.predict(X_test)

plt.figure(figsize = fig_size)
plt.subplot(1, 2, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s = 80)
plt.axis('equal')
plt.axis('off')
plt.title(r'Training')

plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred_test, s = 80)
plt.title(f'Testing $k$={k}\nAccuracy: {knn3.score(X_test, y_test)}')
plt.axis('off')
plt.axis('equal');

And let's look at the same individual point as before:

In [28]:
#
k = 3
test_point = np.argmax(X_test[:,1])
X_test[test_point]
neighbors = knn3.kneighbors([X_test[test_point]])[1]

plt.figure(figsize = fig_size)
plt.subplot(1, 2, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s = 80)
plt.scatter(X_train[neighbors, 0], X_train[neighbors, 1], marker = 'o', 
            facecolors = 'none', edgecolors = 'b', s = 80)
radius = np.max(metrics.euclidean_distances(X_test[test_point].reshape(1, -1), 
                                            X_train[neighbors][0]))
ax = plt.gcf().gca()
circle = mp.patches.Circle(X_test[test_point], radius, facecolor = 'blue', alpha = 0.2)
ax.add_artist(circle)
plt.axis('equal')
plt.axis('off')
plt.title(r'Training')
plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c = y_pred_test, s = 80)
plt.scatter(X_test[test_point,0], X_test[test_point,1], marker = 'o', 
            facecolors = 'none', edgecolors = 'b', s = 80)
plt.title(f'Testing $k$={k}\nAccuracy: {knn3.score(X_test, y_test)}')
plt.axis('off')
plt.axis('equal');

So how confident can we be that the test accuracy is 92% in general?

What we really need to do is consider many different train/test splits.

Thus, the proper way to evaluate generalization ability (accuracy on the test data) is:

  1. Form a random train/test split
  2. Train the classifier on the training split
  3. Test the classifier on the testing split
  4. Accumulate statistics
  5. Repeat from 1. until enough statistics have been collected.
In [29]:
import sklearn.model_selection as model_selection

nreps = 50
kvals = range(1, 10)
acc = []
np.random.seed(4)
for k in kvals:
    test_rep = []
    train_rep = []
    for i in range(nreps):
        X_train, X_test, y_train, y_test = model_selection.train_test_split(X, 
                                                                            y, 
                                                                            test_size = 0.5)
        knn = KNeighborsClassifier(n_neighbors = k)    
        knn.fit(X_train, y_train)
        train_rep.append(knn.score(X_train, y_train))
        test_rep.append(knn.score(X_test, y_test))
    acc.append([np.mean(np.array(test_rep)), np.mean(np.array(train_rep))])
accy = np.array(acc)
In [30]:
plt.plot(kvals, accy[:, 0], '.-', label = 'Accuracy on Test Data')
plt.plot(kvals, accy[:, 1], '.-', label = 'Accuracy on Training Data')
plt.xlabel(r'$k$')
plt.ylabel('Accuracy')
plt.title('Train/Test Comparision of $k$-NN')
plt.legend(loc = 'best');

Based on the generalization error -- ie, accuracy on test (held-out) data -- it looks like k=2k=2 is the best choice.

Here is the decision boundary for kk-NN with k=2k=2.

In [31]:
#
x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
plot_step = 0.02
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))
In [32]:
#
np.random.seed(1)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.5)

k = 2
knn = KNeighborsClassifier(n_neighbors = k)  
knn.fit(X_train, y_train)
y_pred_train = knn.predict(X_train)
y_pred_test = knn.predict(X_test)

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize = fig_size)
plt.subplot(1, 2, 1)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.3)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=30)
plt.axis('equal')
plt.axis('off')
plt.xlim((x_min, x_max))
plt.ylim((y_min, y_max))
plt.title(f'{k}-NN - Training Data\nAccuracy: {knn.score(X_train, y_train)}');

plt.subplot(1, 2, 2)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.3)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=30)
plt.axis('equal')
plt.axis('off')
plt.xlim((x_min, x_max))
plt.ylim((y_min, y_max))
plt.title(f'{k}-NN - Test Data\nAccuracy: {knn.score(X_test, y_test)}');