#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None
mp.rcParams['animation.html'] = 'jshtml';
HW 2 is due Friday (2/10) at 8pm
Upcoming office hours:
Read Tan, Steinbach, Karpatne, Kumar Chapter 7.3
#
display(Image("images/08-gradescope_grading.png", width=600))
To measure how close or different many vectors are in space, let's define the idea of a dataset centroid.
Given a n different vectors x1,…,xn each with the same dimension d, the centroid is the average of the vectors taken componentwise.
¯x=1n∑ixi.In other words: the centroid is the "center of mass" of the dataset.
Next, we define the sample variance of a dataset {x1,…,xn} as:
Var(X)=1n∑j‖xj−¯x‖2.In other words, the sample variance of the set of points is the average squared distance from each point to the centroid.
k-means Problem:
Find k points c1,…,ck (called centers, centroids, or means), so that the cost
n∑i=1minis minimized.
The (approximate) k-means algorithm:
(Nearer to {\bf c_k} than to any other center.)
3. For each j, redefine {\bf c_j} to be the center of mass of cluster X_j.
(In other words, {\bf c_j} is the mean of the vectors in X_j.)
4. Repeat (i.e., go to Step 2) until convergence.
#
display(Image("images/20-kmeans-example.png", width=600))
As you can see, k-means can work very well.
However, we don't have any guarantees on the performance of k-means.
In particular, there are various settings in which k-means can fail to do a good job.
k-means tries to find spherical clusters.
Because each point is assigned to its closest center, the points in a cluster are implicitly assumed to be arranged in a sphere around the center.
#
display(Image("images/20-kmeans-nonspherical-clusters.png", width=600))
k-means tries to find equal-sized clusters.
For the same reason, the sizes of clusters are implicitly assumed to be approximately equal.
#
display(Image("images/20-kmeans-cluster-size.png", width=600))
k-means is sensitive to the starting cluster centers.
If the initial guess (Step 1) is a bad one, k-means may get "stuck" in a bad solution.
#
display(Image("images/20-kmeans-bad-initialization.png", width=600))
How can we avoid the kind of bad initialization we just saw?
A good strategy is to pick points that are distant to each other. This strategy is called "k-means++".
It works very well in practice, and the scikit-learn
implementation uses it by default.
Given the tendency of k-means to look for spherical clusters, we should consider the scales of the various features.
In fact, in general when constructing or selecting a distance metric, one needs to think carefully about the scale of the features being used.
Example. Consider the case where we are clustering people based on their age, income, and whether they have children.
We might use age in years, income in dollars, and build an enumerated data type for existence of children.
Thus, the following records:
Would be encoded in feature space as:
\begin{bmatrix}27\\75000\\1\end{bmatrix},\begin{bmatrix}45\\42000\\0\end{bmatrix}What would happen if we used Euclidean distance as our dissimilarity metric in this feature space? (This is what k-means uses.)
Clearly, the influence of income would dominate the other two features. For example, a difference of parenthood is about as significant as a difference of one dollar of yearly income.
We are unlikely to expose parenting-based differences if we cluster using this representation.
The most common way to handle this is feature scaling.
The basic idea is to rescale each feature separately, so that its range of values is about the same as all other features.
For example, one may choose to:
Generally, we would say that, given some k, the k-means algorithm "learns" the cluster centers -- that is, the parameters of the model.
But we have not yet considered how to choose the right number of clusters.
That's typically not something one knows in advance.
As an aside:
Our basic strategy will be to
How do we know whether the clusters we get represent "real" structure in our data?
Consider this dataset, which we have clustered using k-means:
unif_X = np.random.default_rng().uniform(0, 1, 100)
unif_Y = np.random.default_rng().uniform(0, 1, 100)
df = pd.DataFrame(np.column_stack([unif_X, unif_Y]), columns = ['X', 'Y'])
kmeans = KMeans(init = 'k-means++', n_clusters = 3, n_init = 100)
df['label'] = kmeans.fit_predict(df[['X', 'Y']])
df.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', colorbar = False)
plt.axis('equal')
plt.axis('off');
In fact, this dataset was generated using uniform random numbers for the coordinates.
So ... it truly has no clusters.
The point is: any clustering algorithm will always output some "clustering" of the data.
The question is, does the clustering reflect real structure?
# From xkcd.com
display(Image("images/08-kmeans-xkcd.png", width=400))
Generally we encounter two problems:
There is often no definitive answer to either of these questions. You will need to use your judgment in answering them.
That said, we can define some metrics that will help to evaluate the extent to which clustering appears useful.
One tool we may be able to use in some settings is external information about the data.
In particular, we may have knowledge from some other source about the nature of the clusters in the data.
In that case, what we need is a way to compare a proposed clustering with some externally-known, "ground truth" clustering.
The Rand Index is a similarity measure for clusterings. We can use it to compare two clusterings.
Or, if we are testing an algorithm on data for which we know ground truth, we can use it to assess the algorithm's accuracy.
Defining the Rand Index.
Each item in our dataset is assumed to have two labelings, one for each clustering.
For example, we could have a ground truth label assignment T and a comparison clustering C.
X_rand, y_rand = sk_data.make_blobs(n_samples=[100, 250, 150], centers = [[1, 2],[1.5, 3], [2, 4]], n_features = 2,
center_box = (-10.0, 10.0), cluster_std = [.2, .3, .2], random_state = 0)
df_rand_gt = pd.DataFrame(np.column_stack([X_rand[:, 0], X_rand[:, 1], y_rand]), columns = ['X', 'Y', 'label'])
df_rand_clust = df_rand_gt.copy()
kmeans = KMeans(init = 'k-means++', n_clusters = 3, n_init = 100)
df_rand_clust['label'] = kmeans.fit_predict(df_rand_gt[['X', 'Y']])
figs, axs = plt.subplots(1, 2, figsize = (12, 5))
df_rand_gt.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', ax = axs[0],
colorbar = False)
axs[0].set_title('Ground Truth (T)')
axs[0].set_axis_off()
df_rand_clust.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', ax = axs[1],
colorbar = False)
axs[1].set_title('Clustering (C)')
axs[1].set_axis_off();
Intuitively, the idea behind Rand Index is to consider pairs of points, and ask whether pairs that fall into the same cluster in T also fall into the same cluster in C.
Specifically:
Let a be the number of pairs that have the same label in T and the same label in C.
Let b be: the number of pairs that have different labels in T and different labels in C.
Then the Rand Index is: \mbox{RI}(T,C) = \frac{a+b}{n \choose 2}
How do we know whether a particular Rand Index (RI) score is significant?
We might compare it to the RI for a random assignment of points to labels.
This leads to the Adjusted Rand Index.
Definition of Adjusted Rand Index.
To "calibrate" the Rand Index this way, we use the expected Rand Index of random labelings, denoted E[\text{RI}].
The Expected Rand Index considers C to be a clustering that has the same cluster sizes as T, but labels are assigned at random.
Using that, we define the adjusted Rand index as a simple rescaling of RI:
\begin{equation} \text{ARI} = \frac{\text{RI} - E[\text{RI}]}{\max(\text{RI}) - E[\text{RI}]} \end{equation}The computation of the E[\text{RI}] and the \max(\text{RI}) are simple combinatorics (we'll omit the derivation).
Example.
Let's consider again our 3-cluster dataset with known labels y
.
sns.heatmap(rX, xticklabels = False, yticklabels = False, linewidths = 0);
Here is the adjusted Rand Index, when using k-means to cluster this dataset for 1 to 10 clusters:
def ri_evaluate_clusters(X,max_clusters,ground_truth):
ri = np.zeros(max_clusters+1)
ri[0] = 0;
for k in range(1,max_clusters+1):
kmeans = KMeans(init='k-means++', n_clusters=k, n_init=10)
kmeans.fit_predict(X)
ri[k] = metrics.adjusted_rand_score(kmeans.labels_,ground_truth)
return ri
ri = ri_evaluate_clusters(X, 10, y)
plt.plot(range(1,len(ri)), ri[1:], 'o-')
plt.xlabel('Number of clusters')
plt.title('$k$-means Clustering Compared to Known Labels')
plt.ylabel('Adjusted Rand Index');
An important we face in evaluating a clustering is how many clusters are present.
In practice, to use k-means or most other clustering methods, one must choose k, the number of clusters, via some process.
The first thing you might do is to look at the k-means objective function and see if it levels off after a certain point.
Recall that the k-means objective can be considered the clustering "error".
If the error stops going down, that would suggest that the clustering is not improving as the number of clusters is increased.
error = np.zeros(11)
for k in range(1,11):
kmeans = KMeans(init='k-means++', n_clusters = k, n_init = 10)
kmeans.fit_predict(X)
error[k] = kmeans.inertia_
Recall our synthetic data from earlier.
#
import sklearn.manifold
import sklearn.metrics as metrics
euclidean_dists = metrics.euclidean_distances(X)
mds = sklearn.manifold.MDS(n_components = 2, max_iter = 3000, eps = 1e-9, random_state = 0,
dissimilarity = "precomputed", n_jobs = 1)
fit = mds.fit(euclidean_dists)
pos = fit.embedding_
plt.axis('equal')
plt.scatter(pos[:, 0], pos[:, 1], s = 8);
For this synthetic dataset, here is the k-means objective, as a function of k:
plt.plot(range(1, len(error)), error[1:], 'o-')
plt.xlabel('Number of clusters')
plt.title(r'$k$-means clustering performance of synthetic data')
plt.ylabel('Error');
Warning: This synthetic data is not at all typical. You will almost never see such a sharp change in the error function as we see here.
Let's create a function for later use.
def evaluate_clusters(X,max_clusters):
error = np.zeros(max_clusters+1)
error[0] = 0;
for k in range(1,max_clusters+1):
kmeans = KMeans(init='k-means++', n_clusters=k, n_init=10)
kmeans.fit_predict(X)
error[k] = kmeans.inertia_
return error
Of course, normally, the ground truth labels are not known.
In that case, evaluation must be performed using the model itself.
Recall our definition of clustering:
a grouping of data objects, such that the objects within a group are similar (or near) to one another and dissimilar (or far) from the objects in other groups.
This suggests a metric that could evaluate a clustering: comparing the distances between points within a cluster, to the distances between points in different clusters.
The Silhouette Coefficient is an example of such an evaluation, where a higher Silhouette Coefficient score relates to a model with "better defined" clusters.
(sklearn.metrics.silhouette_score
)
Let a be the mean distance between a data point and all other points in the same cluster.
Let b be the mean distance between a data point and all other points in the next nearest cluster.
Then the Silhouette Coefficient for a clustering is: s = \frac{b-a}{\max(a, b)}
Let's apply the Silhouette Coefficient to our synthetic dataset.
sc = metrics.silhouette_score(X, labels, metric='euclidean')
print(sc)
0.8319348841402534
#
def sc_evaluate_clusters(X, max_clusters, n_init, seed):
s = np.zeros(max_clusters+1)
s[0] = 0;
s[1] = 0;
for k in range(2, max_clusters+1):
kmeans = KMeans(init='k-means++', n_clusters = k, n_init = n_init, random_state = seed)
kmeans.fit_predict(X)
s[k] = metrics.silhouette_score(X, kmeans.labels_, metric = 'euclidean')
return s
s = sc_evaluate_clusters(X, 10, 10, 1)
plt.plot(range(2, len(s)), s[2:], 'o-')
plt.xlabel('Number of Clusters')
plt.title('$k$-means clustering performance on synthetic data')
plt.ylabel('Silhouette Score');
Again, these results are more perfect than typical.
But the general idea is to look for a local maximum in the Silhouette Coefficient as the potential number of clusters.