#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None
mp.rcParams['animation.html'] = 'jshtml';
HW 2 is due Friday (2/10) at 8pm
Upcoming office hours:
Read Tan, Steinbach, Karpatne, Kumar Chapter 7.3
Clustering is a very important way of discovering structure in data. It represents an example of unsupervised learning.
Supervised methods: Data items have labels, and we want to learn a function that correctly assigns labels to new data items.
Unsupervised methods: Data items do not have labels, and we want to learn a function that extracts important patterns from the data.
We can often simplify or compress our data if we recognize the existence of clusters.
Informally, a clustering is:
a grouping of data objects, such that the objects within a group are similar (or near) to one another and dissimilar (or far) from the objects in other groups.
-means Problem: Find points (called centers, centroids, or means), so that the sum of squares of distances from every point to the closest center
is minimized.
The (approximate) -means algorithm:
(Nearer to than to any other center.)
3. For each , redefine to be the center of mass of cluster .
(In other words, is the mean of the vectors in .)
4. Repeat (i.e., go to Step 2) until convergence.
#
display(Image("images/20-kmeans-example.png", width=600))
We discussed two metrics to test how well -means clustering has done... and as a result, to find the best .
If ground truth is known, we can use the Rand Index: a similarity measure that allows us to compare the ground truth versus the result of -means clustering .
If denotes the number of pairs of points that have the same label in both and , and denotes the number of pairs that have different labels in both and , then the Rand Index is:
#
figs, axs = plt.subplots(1, 2, figsize = (12, 5))
df_rand_gt.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', ax = axs[0],
colorbar = False)
axs[0].set_title('Ground Truth (T)')
axs[0].set_axis_off()
df_rand_clust.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', ax = axs[1],
colorbar = False)
axs[1].set_title('Clustering (C)')
axs[1].set_axis_off();
If ground truth is not known, we can use the Silhouette Coefficient: an attempt to measure when a model produces "better defined" clusters, in the sense that points in the same cluster are close to each other and far from the other clusters.
If equals the mean distance between a data point and all other points in the same cluster, and equals the mean distance between a data point and all other points in the next nearest cluster, then the
Silhouette Coefficient for a clustering is:
Today, we will look at a fairly different approach to clustering.
So far, we have been thinking of clustering as finding a partition of our dataset.
That is, a set of nonoverlapping clusters, in which each data item is in one cluster.
However, in many cases, the notion of a strict partition is not as useful.
Question. How many clusters would you say there are here? That is: what is the best choice of for -means clustering?
#
X_rand, y_rand = sk_data.make_blobs(n_samples=[100, 100, 250, 70, 75, 80], centers = [[1, 2], [1.5, 1], [3, 2], [1.75, 3.25], [2, 4], [2.25, 3.25]], n_features = 2,
center_box = (-10.0, 10.0), cluster_std = [.2, .2, .3, .1, .15, .15], random_state = 0)
df_rand = pd.DataFrame(np.column_stack([X_rand[:, 0], X_rand[:, 1], y_rand]), columns = ['X', 'Y', 'label'])
df_rand = df_rand.astype({'label': 'int'})
df_rand['label2'] = [{0: 0, 1: 1, 2: 2, 3: 3, 4: 3, 5: 3}[x] for x in df_rand['label']]
df_rand['label3'] = [{0: 0, 1: 0, 2: 1, 3: 2, 4: 2, 5: 2}[x] for x in df_rand['label']]
# kmeans = KMeans(init = 'k-means++', n_clusters = 3, n_init = 100)
# df_rand['label'] = kmeans.fit_predict(df_rand[['X', 'Y']])
#
df_rand.plot('X', 'Y', kind = 'scatter', colormap='viridis',
colorbar = False, figsize = (6, 6))
plt.axis('square')
plt.axis('off');
Three clusters?
#
df_rand.plot('X', 'Y', kind = 'scatter', c = 'label3', colormap='viridis',
colorbar = False, figsize = (6, 6))
plt.axis('square')
plt.axis('off');
Four clusters?
#
df_rand.plot('X', 'Y', kind = 'scatter', c = 'label2', colormap='viridis',
colorbar = False, figsize = (6, 6))
plt.axis('square')
plt.axis('off');
Six clusters?
#
df_rand.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis',
colorbar = False, figsize = (6, 6))
plt.axis('square')
plt.axis('off');
This dataset shows clustering on multiple scales.
To fully capture the structure in this dataset, two things are needed:
These observations motivate the notion of hierarchical clustering.
In hierarchical clustering, we move away from the partition notion of -means,
and instead capture a more complex arrangement that includes containment of one cluster within another.
A hierarchical clustering produces a set of nested clusters organized into a tree.
A hierarchical clustering is visualized using a dendrogram
#
display(Image("images/21-dendrogram.png", width=600))
Hierarchical clustering has a number of advantages:
First, a hierarchical clustering encodes many different clusterings. That is, it does not itself decide on the correct number of clusters.
A clustering is obtained by "cutting" the dendrogram at some level.
This means that you can make this crucial decision yourself, by inspecting the dendrogram.
Put another way, you can obtain any desired number of clusters.
#
display(Image("images/21-dendrogram-cut.png", width=600))
The second advantage is that the dendrogram may itself correspond to a meaningful structure, for example, a taxonomy.
#
display(Image("images/21-animal-taxonomy.jpg", width=600))
The third advantage is that many hierarchical clustering methods can be performed using either similarity (proximity) or dissimilarity (distance) metrics.
This can be very helpful!
(Note that techniques like -means cannot be used with unmodified similarity metrics.)
Another aspect of hierachical clustering is that it can handle certain cases better than -means.
Because of the nature of the -means algorithm, -means tends to produce:
In many real-world situations, clusters may not be round, they may be of unequal size, and they may overlap.
Hence we would like clustering algorithms that can work in those cases also.
There are two main approaches to hierarchical clustering: "bottom-up" and "top-down." These approaches start at the extreme options of choosing clusters or choosing 1 cluster.
Agglomerative Clustering ("bottom-up"):
Divisive Clustering ("top-down"):
Agglomerative techniques are by far the more common.
The key to both of these methods is defining the distance between two clusters.
Different definitions for the inter-cluster distance yield different clusterings.
To illustrate the impact of the choice of cluster distances, we'll focus on agglomerative clustering.
Given two clusters, how do we define the distance between them?
Here are three natural ways to do it:
# Single, Complete, and Average Linkage
display(Image("images/21-hierarchical-criteria.png", width=1000))
Notice that it is easy to express the definitions above in terms of similarity instead of distance.
Here is a set of 6 points that we will cluster to show differences between distance metrics.
import matplotlib.pyplot as plt
pt_x = [0.4, 0.22, 0.35, 0.26, 0.08, 0.45]
pt_y = [0.53, 0.38, 0.32, 0.19, 0.41, 0.30]
plt.plot(pt_x, pt_y, 'o', markersize = 10, color = 'k')
plt.ylim([.15, .60])
plt.xlim([0.05, 0.70])
for i in range(6):
plt.annotate(f'{i}', (pt_x[i]+0.02, pt_y[i]-0.01), fontsize = 12)
plt.axis('off');
And here is the matrix showing the distance between each pair of points.
X = np.array([pt_x, pt_y]).T
from scipy.spatial import distance_matrix
labels = ['p0', 'p1', 'p2', 'p3', 'p4', 'p5']
D = pd.DataFrame(distance_matrix(X, X), index = labels, columns = labels)
D.style.format('{:.2f}') # round to 2 digits after the decimal
p0 | p1 | p2 | p3 | p4 | p5 | |
---|---|---|---|---|---|---|
p0 | 0.00 | 0.23 | 0.22 | 0.37 | 0.34 | 0.24 |
p1 | 0.23 | 0.00 | 0.14 | 0.19 | 0.14 | 0.24 |
p2 | 0.22 | 0.14 | 0.00 | 0.16 | 0.28 | 0.10 |
p3 | 0.37 | 0.19 | 0.16 | 0.00 | 0.28 | 0.22 |
p4 | 0.34 | 0.14 | 0.28 | 0.28 | 0.00 | 0.39 |
p5 | 0.24 | 0.24 | 0.10 | 0.22 | 0.39 | 0.00 |
#
display(Image("images/21-singlelink-pointset.png", width=500))
import scipy.cluster
import scipy.cluster.hierarchy as hierarchy
Z = hierarchy.linkage(X, method='single')
hierarchy.dendrogram(Z);
Advantages:
Single-linkage clustering can handle non-elliptical shapes.
In fact it can produce long, elongated clusters:
#
X_moon_05, y_moon_05 = sk_data.make_moons(random_state = 0, noise = 0.05)
Z = hierarchy.linkage(X_moon_05, method='single')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_moon_05[:,0], X_moon_05[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Single-Linkage Can Find Irregularly Shaped Clusters')
plt.axis('off');
#
X_rand_lo, y_rand_lo = sk_data.make_blobs(n_samples=[20, 200], centers = [[1, 1], [3, 1]], n_features = 2,
center_box = (-10.0, 10.0), cluster_std = [.1, .5], random_state = 0)
Z = hierarchy.linkage(X_rand_lo, method='single')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_rand_lo[:,0], X_rand_lo[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Single-Linkage Can Find Different-Sized Clusters')
plt.axis('off');
Disadvantages:
#
X_moon_10, y_moon_10 = sk_data.make_moons(random_state = 0, noise = 0.1)
Z = hierarchy.linkage(X_moon_10, method='single')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_moon_10[:,0], X_moon_10[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Single-Linkage Clustering Changes Drastically on Slightly More Noisy Data')
plt.axis('off');
#
X_rand_hi, y_rand_hi = sk_data.make_blobs(n_samples=[20, 200], centers = [[1, 1], [3, 1]], n_features = 2,
center_box = (-10.0, 10.0), cluster_std = [.15, .6], random_state = 0)
Z = hierarchy.linkage(X_rand_hi, method='single')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.title('Single-Linkage Clustering Changes Drastically on Slightly More Noisy Data')
plt.scatter(X_rand_hi[:,0], X_rand_hi[:,1], c = [['b','g'][i-1] for i in labels])
plt.axis('off');
#
display(Image("images/21-completelink-pointset.png", width=500))
Z = hierarchy.linkage(X, method='complete')
hierarchy.dendrogram(Z);
Properties:
#
X_moon_05, y_moon_05 = sk_data.make_moons(random_state = 0, noise = 0.05)
Z = hierarchy.linkage(X_moon_05, method='complete')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_moon_05[:,0], X_moon_05[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Complete-Linkage Seeks Globular Clusters of Similar Size')
plt.axis('off');
#
Z = hierarchy.linkage(X_rand_hi, method='complete')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_rand_hi[:,0], X_rand_hi[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Complete-Linkage Seeks Globular Clusters of Similar Size')
plt.axis('off');
#
Z = hierarchy.linkage(X_moon_10, method='complete')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_moon_10[:,0], X_moon_10[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Complete-Linkage Clustering of Noisy Data similar to Less Noisy')
plt.axis('off');
#
display(Image("images/21-averagelink-pointset.png", width=500))
Z = hierarchy.linkage(X, method='average')
hierarchy.dendrogram(Z);
Average-Linkage clustering is in some sense a compromise between Single-linkage and Complete-linkage clustering.
Strengths:
Limitations:
#
Z = hierarchy.linkage(X_moon_10, method='average')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_moon_10[:,0], X_moon_10[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Average-Linkage Similar to Complete - Globular Clusters')
plt.axis('off');
#
Z = hierarchy.linkage(X_rand_hi, method='average')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_rand_hi[:,0], X_rand_hi[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Average-Linkage More resistant to noise than Single-Linkage')
plt.axis('off');
Finally, we consider one more cluster distance.
Ward's distance asks
What if we combined these two clusters -- how would clustering improve?
To define "how would clustering improve?" we appeal to the -means criterion.
So:
Ward's Distance between clusters and is the difference between the total within cluster sum of squares for the two clusters separately, compared to the within cluster sum of squares resulting from merging the two clusters into a new cluster :
where are the corresponding cluster centroids.
In a sense, this cluster distance results in a hierarchical analog of -means.
As a result, it has properties similar to -means:
Hence it tends to behave more like group-average hierarchical clustering.
Now we'll look at doing hierarchical clustering in practice, using Python
.
We'll use the same synthetic data as we did in the k-means case -- i.e., three "blobs" living in 30 dimensions.
X, y = sk_data.make_blobs(n_samples=100, centers=3, n_features=30,
center_box=(-10.0, 10.0),random_state=0)
As a reminder of the raw data here is the visualization: first the raw data, then an embedding into 2-D (using MDS).
#
sns.heatmap(X, xticklabels=False, yticklabels=False, linewidths=0,cbar=False);
#
import sklearn.manifold
import sklearn.metrics as metrics
euclidean_dists = metrics.euclidean_distances(X)
mds = sklearn.manifold.MDS(n_components = 2, max_iter = 3000, eps = 1e-9, random_state = 0,
dissimilarity = "precomputed", n_jobs = 1)
fit = mds.fit(euclidean_dists)
pos = fit.embedding_
plt.axis('equal')
plt.scatter(pos[:, 0], pos[:, 1], s = 8);
Hierarchical clustering is available in sklearn
, but there is a much more fully developed set of tools in the scipy
package and that is the one to use.
import scipy.cluster
import scipy.cluster.hierarchy as hierarchy
import scipy.spatial.distance
# linkages = ['single','complete','average','weighted','ward']
Z = hierarchy.linkage(X, method = 'single')
R = hierarchy.dendrogram(Z)
Let's use the "20 Newsgroup" data provided as example data in sklearn.
(http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).
from sklearn.datasets import fetch_20newsgroups
categories = ['comp.os.ms-windows.misc', 'sci.space','rec.sport.baseball']
news_data = fetch_20newsgroups(subset = 'train', categories = categories)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', min_df = 4, max_df = 0.8)
data = vectorizer.fit_transform(news_data.data).todense()
data.shape
(1781, 9409)
# metrics can be ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’,
# ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’,
# ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’,
# ‘sqeuclidean’, ‘yule’.
Z_20ng = hierarchy.linkage(data, method = 'ward', metric = 'euclidean')
plt.figure(figsize=(14,4))
R_20ng = hierarchy.dendrogram(Z_20ng, p=4, truncate_mode = 'level', show_leaf_counts=True)
clusters = hierarchy.fcluster(Z_20ng, 3, criterion = 'maxclust')
print(clusters.shape)
clusters
(1781,)
array([3, 3, 3, ..., 1, 3, 1], dtype=int32)
max_clusters = 20
s = np.zeros(max_clusters+1)
for k in range(2, max_clusters+1):
clusters = hierarchy.fcluster(Z_20ng, k, criterion = 'maxclust')
s[k] = metrics.silhouette_score(data, clusters, metric = 'euclidean')
plt.plot(range(2, len(s)), s[2:], '.-')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score');
print('Top Terms Per Cluster:')
k = 5
clusters = hierarchy.fcluster(Z_20ng, k, criterion = 'maxclust')
for i in range(1,k+1):
items = np.array([item for item,clust in zip(data, clusters) if clust == i])
centroids = np.squeeze(items).mean(axis = 0)
asc_order_centroids = centroids.argsort()#[:, ::-1]
order_centroids = asc_order_centroids[::-1]
terms = vectorizer.get_feature_names()
print(f'Cluster {i}:')
for ind in order_centroids[:10]:
print(f' {terms[ind]}')
print('')
Top Terms Per Cluster: Cluster 1: space nasa edu henry gov alaska access com moon digex Cluster 2: ax max b8f g9v a86 145 1d9 pl 2di 0t Cluster 3: edu com year baseball article writes cs team game university Cluster 4: risc instruction ghhwang csie set nctu cisc tw reduced mq Cluster 5: windows edu file dos com files card drivers driver use