In [1]:
#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None

mp.rcParams['animation.html'] = 'jshtml';

Announcements¶

  • HW 2 is due Friday (2/10) at 8pm

  • Upcoming office hours:

    • Today: Prof McDonald from 4:30-6pm in CCDS 1341
    • Tomorrow: Peer tutor Rohan Anand from 1:30-3pm in CCDS 16th floor
  • Read Tan, Steinbach, Karpatne, Kumar Chapter 7.3

Recap of Last Lecture¶

Clustering is a very important way of discovering structure in data. It represents an example of unsupervised learning.

  • Supervised methods: Data items have labels, and we want to learn a function that correctly assigns labels to new data items.

  • Unsupervised methods: Data items do not have labels, and we want to learn a function that extracts important patterns from the data.

We can often simplify or compress our data if we recognize the existence of clusters.

Informally, a clustering is:

a grouping of data objects, such that the objects within a group are similar (or near) to one another and dissimilar (or far) from the objects in other groups.

kk-means Problem: Find kk points c1,…,ckc1,…,ck (called centers, centroids, or means), so that the sum of squares of distances from every point to the closest center

n∑i=1mink∥xi−cj∥2∑i=1nmink‖xi−cj‖2

is minimized.

The (approximate) kk-means algorithm:

  1. Pick kk cluster centers {c1,…,ck}{c1,…,ck}. These can be chosen randomly, or by some other method.
  2. For each jj, define the cluster XjXj as the set of points in XX that are closest to center ckck.

(Nearer to ckck than to any other center.) 3. For each jj, redefine cjcj to be the center of mass of cluster XjXj.
(In other words, cjcj is the mean of the vectors in XjXj.) 4. Repeat (i.e., go to Step 2) until convergence.

In [8]:
#
display(Image("images/20-kmeans-example.png", width=600))

We discussed two metrics to test how well kk-means clustering has done... and as a result, to find the best kk.

  • If ground truth is known, we can use the Rand Index: a similarity measure that allows us to compare the ground truth TT versus the result of kk-means clustering CC.

    If aa denotes the number of pairs of points that have the same label in both TT and CC, and bb denotes the number of pairs that have different labels in both TT and CC, then the Rand Index is:

RI(T,C)=a+b(n2)RI(T,C)=a+b(n2)
In [14]:
#
figs, axs = plt.subplots(1, 2, figsize = (12, 5))
df_rand_gt.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', ax = axs[0],
                   colorbar = False)
axs[0].set_title('Ground Truth (T)')
axs[0].set_axis_off()
df_rand_clust.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', ax = axs[1],
                  colorbar = False)
axs[1].set_title('Clustering (C)')
axs[1].set_axis_off();
  • If ground truth is not known, we can use the Silhouette Coefficient: an attempt to measure when a model produces "better defined" clusters, in the sense that points in the same cluster are close to each other and far from the other clusters.

    If aa equals the mean distance between a data point and all other points in the same cluster, and bb equals the mean distance between a data point and all other points in the next nearest cluster, then the

Silhouette Coefficient for a clustering is: s=b−amax(a,b)s=b−amax(a,b)

9. Hierarchical Clustering¶

9.1 Limits of Partitional Clustering¶

Today, we will look at a fairly different approach to clustering.

So far, we have been thinking of clustering as finding a partition of our dataset.

That is, a set of nonoverlapping clusters, in which each data item is in one cluster.

However, in many cases, the notion of a strict partition is not as useful.

Question. How many clusters would you say there are here? That is: what is the best choice of kk for kk-means clustering?

In [2]:
#
X_rand, y_rand = sk_data.make_blobs(n_samples=[100, 100, 250, 70, 75, 80], centers = [[1, 2], [1.5, 1], [3, 2], [1.75, 3.25], [2, 4], [2.25, 3.25]], n_features = 2,
                          center_box = (-10.0, 10.0), cluster_std = [.2, .2, .3, .1, .15, .15], random_state = 0)
df_rand = pd.DataFrame(np.column_stack([X_rand[:, 0], X_rand[:, 1], y_rand]), columns = ['X', 'Y', 'label'])
df_rand = df_rand.astype({'label': 'int'})
df_rand['label2'] = [{0: 0, 1: 1, 2: 2, 3: 3, 4: 3, 5: 3}[x] for x in df_rand['label']]
df_rand['label3'] = [{0: 0, 1: 0, 2: 1, 3: 2, 4: 2, 5: 2}[x] for x in df_rand['label']]
# kmeans = KMeans(init = 'k-means++', n_clusters = 3, n_init = 100)
# df_rand['label'] = kmeans.fit_predict(df_rand[['X', 'Y']])
In [3]:
#
df_rand.plot('X', 'Y', kind = 'scatter', colormap='viridis', 
                   colorbar = False, figsize = (6, 6))
plt.axis('square')
plt.axis('off');

Three clusters?

In [4]:
#
df_rand.plot('X', 'Y', kind = 'scatter', c = 'label3', colormap='viridis', 
                   colorbar = False, figsize = (6, 6))
plt.axis('square')
plt.axis('off');

Four clusters?

In [5]:
#
df_rand.plot('X', 'Y', kind = 'scatter', c = 'label2', colormap='viridis', 
                   colorbar = False, figsize = (6, 6))
plt.axis('square')
plt.axis('off');

Six clusters?

In [6]:
#
df_rand.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', 
                   colorbar = False, figsize = (6, 6))
plt.axis('square')
plt.axis('off');

This dataset shows clustering on multiple scales.

To fully capture the structure in this dataset, two things are needed:

  1. Capturing the differing clusters depending on the scale
  2. Capturing the containment relations -- which clusters lie within other clusters

These observations motivate the notion of hierarchical clustering.

In hierarchical clustering, we move away from the partition notion of kk-means,

and instead capture a more complex arrangement that includes containment of one cluster within another.

9.2 Hierarchical Clustering¶

A hierarchical clustering produces a set of nested clusters organized into a tree.

A hierarchical clustering is visualized using a dendrogram

  • A tree-like diagram that records the containment relations among clusters.
In [5]:
#
display(Image("images/21-dendrogram.png", width=600))

Strengths of Hierarchical Clustering¶

Hierarchical clustering has a number of advantages:

First, a hierarchical clustering encodes many different clusterings. That is, it does not itself decide on the correct number of clusters.

A clustering is obtained by "cutting" the dendrogram at some level.

This means that you can make this crucial decision yourself, by inspecting the dendrogram.

Put another way, you can obtain any desired number of clusters.

In [6]:
#
display(Image("images/21-dendrogram-cut.png", width=600))

The second advantage is that the dendrogram may itself correspond to a meaningful structure, for example, a taxonomy.

In [7]:
#
display(Image("images/21-animal-taxonomy.jpg", width=600))

The third advantage is that many hierarchical clustering methods can be performed using either similarity (proximity) or dissimilarity (distance) metrics.

This can be very helpful!

(Note that techniques like kk-means cannot be used with unmodified similarity metrics.)

Comparison to kk-means¶

Another aspect of hierachical clustering is that it can handle certain cases better than kk-means.

Because of the nature of the kk-means algorithm, kk-means tends to produce:

  • Roughly spherical clusters
  • Clusters of approximately equal size
  • Non-overlapping clusters

In many real-world situations, clusters may not be round, they may be of unequal size, and they may overlap.

Hence we would like clustering algorithms that can work in those cases also.

9.3 Hierarchical Clustering Algorithms¶

There are two main approaches to hierarchical clustering: "bottom-up" and "top-down." These approaches start at the extreme options of choosing nn clusters or choosing 1 cluster.

Agglomerative Clustering ("bottom-up"):

  • Start by defining each point as its own cluster
  • At each successive step, merge the two clusters that are closest to each other (according to some rule to be determined)
  • Repeat until only one cluster is left.

Divisive Clustering ("top-down"):

  • Start with one, all-inclusive cluster
  • At each step, find the cluster split that creates the largest distance between resulting clusters
  • Repeat until each point is in its own cluster.

Agglomerative techniques are by far the more common.

The key to both of these methods is defining the distance between two clusters.

Different definitions for the inter-cluster distance yield different clusterings.

To illustrate the impact of the choice of cluster distances, we'll focus on agglomerative clustering.

Defining Cluster Proximity¶

Given two clusters, how do we define the distance between them?

Here are three natural ways to do it:

In [9]:
# Single, Complete, and Average Linkage
display(Image("images/21-hierarchical-criteria.png", width=1000))
  • Single-Linkage: the distance between two clusters is the distance between the closest two points that are in different clusters.
Dsingle(i,j)=minx,y{d(x,y)|x∈Ci,y∈Cj}Dsingle(i,j)=minx,y{d(x,y)|x∈Ci,y∈Cj}
  • Complete-Linkage: the distance between two clusters is the distance between the farthest two points that are in different clusters.
Dcomplete(i,j)=maxx,y{d(x,y)|x∈Ci,y∈Cj}Dcomplete(i,j)=maxx,y{d(x,y)|x∈Ci,y∈Cj}
  • Average-Linkage: the distance between two clusters is the average distance between all pairs of points from different clusters.
Daverage(i,j)=1|Ci|⋅|Cj|∑x∈Ci,y∈Cjd(x,y)Daverage(i,j)=1|Ci|⋅|Cj|∑x∈Ci,y∈Cjd(x,y)

Notice that it is easy to express the definitions above in terms of similarity instead of distance.

Here is a set of 6 points that we will cluster to show differences between distance metrics.

In [44]:
import matplotlib.pyplot as plt

pt_x = [0.4, 0.22, 0.35, 0.26, 0.08, 0.45]
pt_y = [0.53, 0.38, 0.32, 0.19, 0.41, 0.30]
plt.plot(pt_x, pt_y, 'o', markersize = 10, color = 'k')
plt.ylim([.15, .60])
plt.xlim([0.05, 0.70])
for i in range(6):
    plt.annotate(f'{i}', (pt_x[i]+0.02, pt_y[i]-0.01), fontsize = 12)
plt.axis('off');

And here is the matrix showing the distance between each pair of points.

In [46]:
X = np.array([pt_x, pt_y]).T
from scipy.spatial import distance_matrix
labels = ['p0', 'p1', 'p2', 'p3', 'p4', 'p5']
D = pd.DataFrame(distance_matrix(X, X), index = labels, columns = labels)
D.style.format('{:.2f}') # round to 2 digits after the decimal
Out[46]:
  p0 p1 p2 p3 p4 p5
p0 0.00 0.23 0.22 0.37 0.34 0.24
p1 0.23 0.00 0.14 0.19 0.14 0.24
p2 0.22 0.14 0.00 0.16 0.28 0.10
p3 0.37 0.19 0.16 0.00 0.28 0.22
p4 0.34 0.14 0.28 0.28 0.00 0.39
p5 0.24 0.24 0.10 0.22 0.39 0.00

Single-Linkage Clustering¶

In [10]:
#
display(Image("images/21-singlelink-pointset.png", width=500))
In [9]:
import scipy.cluster
import scipy.cluster.hierarchy as hierarchy
Z = hierarchy.linkage(X, method='single')
hierarchy.dendrogram(Z);
In [ ]:
 

Advantages:

  • Single-linkage clustering can handle non-elliptical shapes.

    In fact it can produce long, elongated clusters:

In [10]:
#
X_moon_05, y_moon_05 = sk_data.make_moons(random_state = 0, noise = 0.05)
Z = hierarchy.linkage(X_moon_05, method='single')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_moon_05[:,0], X_moon_05[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Single-Linkage Can Find Irregularly Shaped Clusters')
plt.axis('off');
  • Single-linkage clustering can find clusters of different sizes.
In [11]:
#
X_rand_lo, y_rand_lo = sk_data.make_blobs(n_samples=[20, 200], centers = [[1, 1], [3, 1]], n_features = 2,
                          center_box = (-10.0, 10.0), cluster_std = [.1, .5], random_state = 0)
Z = hierarchy.linkage(X_rand_lo, method='single')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_rand_lo[:,0], X_rand_lo[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Single-Linkage Can Find Different-Sized Clusters')
plt.axis('off');

Disadvantages:

  • Single-linkage clustering can be sensitive to noise and outliers.
In [12]:
#
X_moon_10, y_moon_10 = sk_data.make_moons(random_state = 0, noise = 0.1)
Z = hierarchy.linkage(X_moon_10, method='single')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_moon_10[:,0], X_moon_10[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Single-Linkage Clustering Changes Drastically on Slightly More Noisy Data')
plt.axis('off');
In [13]:
#
X_rand_hi, y_rand_hi = sk_data.make_blobs(n_samples=[20, 200], centers = [[1, 1], [3, 1]], n_features = 2,
                          center_box = (-10.0, 10.0), cluster_std = [.15, .6], random_state = 0)
Z = hierarchy.linkage(X_rand_hi, method='single')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.title('Single-Linkage Clustering Changes Drastically on Slightly More Noisy Data')
plt.scatter(X_rand_hi[:,0], X_rand_hi[:,1], c = [['b','g'][i-1] for i in labels])
plt.axis('off');

Complete-Linkage Clustering¶

In [11]:
#
display(Image("images/21-completelink-pointset.png", width=500))
In [14]:
Z = hierarchy.linkage(X, method='complete')
hierarchy.dendrogram(Z);
In [ ]:
 

Properties:

  • Produces more-balanced clusters -- more-equal diameters
In [15]:
#
X_moon_05, y_moon_05 = sk_data.make_moons(random_state = 0, noise = 0.05)
Z = hierarchy.linkage(X_moon_05, method='complete')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_moon_05[:,0], X_moon_05[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Complete-Linkage Seeks Globular Clusters of Similar Size')
plt.axis('off');
  • Less susceptible to noise
In [16]:
#
Z = hierarchy.linkage(X_rand_hi, method='complete')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_rand_hi[:,0], X_rand_hi[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Complete-Linkage Seeks Globular Clusters of Similar Size')
plt.axis('off');
In [17]:
#
Z = hierarchy.linkage(X_moon_10, method='complete')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_moon_10[:,0], X_moon_10[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Complete-Linkage Clustering of Noisy Data similar to Less Noisy')
plt.axis('off');

Average-Linkage Clustering¶

In [12]:
#
display(Image("images/21-averagelink-pointset.png", width=500))
In [18]:
Z = hierarchy.linkage(X, method='average')
hierarchy.dendrogram(Z);

Average-Linkage clustering is in some sense a compromise between Single-linkage and Complete-linkage clustering.

Strengths:

  • Less susceptible to noise and outliers

Limitations:

  • Biased toward elliptical clusters
In [19]:
#
Z = hierarchy.linkage(X_moon_10, method='average')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_moon_10[:,0], X_moon_10[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Average-Linkage Similar to Complete - Globular Clusters')
plt.axis('off');
In [20]:
#
Z = hierarchy.linkage(X_rand_hi, method='average')
labels = hierarchy.fcluster(Z, 2, criterion = 'maxclust')
plt.scatter(X_rand_hi[:,0], X_rand_hi[:,1], c = [['b','g'][i-1] for i in labels])
plt.title('Average-Linkage More resistant to noise than Single-Linkage')
plt.axis('off');

Ward's Distance¶

Finally, we consider one more cluster distance.

Ward's distance asks

What if we combined these two clusters -- how would clustering improve?

To define "how would clustering improve?" we appeal to the kk-means criterion.

So:

Ward's Distance between clusters CiCi and CjCj is the difference between the total within cluster sum of squares for the two clusters separately, compared to the within cluster sum of squares resulting from merging the two clusters into a new cluster Ci+jCi+j:

DWard(i,j)=∑x∈Ci(x−ri)2+∑x∈Cj(x−rj)2−∑x∈Ci+j(x−ri+j)2DWard(i,j)=∑x∈Ci(x−ri)2+∑x∈Cj(x−rj)2−∑x∈Ci+j(x−ri+j)2

where ri,rj,ri+jri,rj,ri+j are the corresponding cluster centroids.

In a sense, this cluster distance results in a hierarchical analog of kk-means.

As a result, it has properties similar to kk-means:

  • Less susceptible to noise and outliers
  • Biased toward elliptical clusters

Hence it tends to behave more like group-average hierarchical clustering.

9.4 Hierarchical Clustering on Synthetic Data¶

Now we'll look at doing hierarchical clustering in practice, using Python.

We'll use the same synthetic data as we did in the k-means case -- i.e., three "blobs" living in 30 dimensions.

In [21]:
X, y = sk_data.make_blobs(n_samples=100, centers=3, n_features=30,
                          center_box=(-10.0, 10.0),random_state=0)

As a reminder of the raw data here is the visualization: first the raw data, then an embedding into 2-D (using MDS).

In [22]:
#
sns.heatmap(X, xticklabels=False, yticklabels=False, linewidths=0,cbar=False);
In [23]:
#
import sklearn.manifold
import sklearn.metrics as metrics
euclidean_dists = metrics.euclidean_distances(X)
mds = sklearn.manifold.MDS(n_components = 2, max_iter = 3000, eps = 1e-9, random_state = 0,
                   dissimilarity = "precomputed", n_jobs = 1)
fit = mds.fit(euclidean_dists)
pos = fit.embedding_
plt.axis('equal')
plt.scatter(pos[:, 0], pos[:, 1], s = 8);

Hierarchical clustering is available in sklearn, but there is a much more fully developed set of tools in the scipy package and that is the one to use.

In [24]:
import scipy.cluster
import scipy.cluster.hierarchy as hierarchy
import scipy.spatial.distance

# linkages = ['single','complete','average','weighted','ward']
Z = hierarchy.linkage(X, method = 'single')
In [25]:
R = hierarchy.dendrogram(Z)

9.5 Hierarchical Clustering on Real Data¶

Let's use the "20 Newsgroup" data provided as example data in sklearn.

(http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

In [26]:
from sklearn.datasets import fetch_20newsgroups
categories = ['comp.os.ms-windows.misc', 'sci.space','rec.sport.baseball']
news_data = fetch_20newsgroups(subset = 'train', categories = categories)
In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', min_df = 4, max_df = 0.8)
data = vectorizer.fit_transform(news_data.data).todense()
data.shape
Out[27]:
(1781, 9409)
In [28]:
# metrics can be ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, 
# ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, 
# ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, 
# ‘sqeuclidean’, ‘yule’.
Z_20ng = hierarchy.linkage(data, method = 'ward', metric = 'euclidean')
plt.figure(figsize=(14,4))
R_20ng = hierarchy.dendrogram(Z_20ng, p=4, truncate_mode = 'level', show_leaf_counts=True)

Selecting the Number of Clusters¶

In [29]:
clusters = hierarchy.fcluster(Z_20ng, 3, criterion = 'maxclust')
print(clusters.shape)
clusters
(1781,)
Out[29]:
array([3, 3, 3, ..., 1, 3, 1], dtype=int32)
In [30]:
max_clusters = 20
s = np.zeros(max_clusters+1)
for k in range(2, max_clusters+1):
    clusters = hierarchy.fcluster(Z_20ng, k, criterion = 'maxclust')
    s[k] = metrics.silhouette_score(data, clusters, metric = 'euclidean')
plt.plot(range(2, len(s)), s[2:], '.-')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score');
In [31]:
print('Top Terms Per Cluster:')
k = 5
clusters = hierarchy.fcluster(Z_20ng, k, criterion = 'maxclust')
for i in range(1,k+1):
    items = np.array([item for item,clust in zip(data, clusters) if clust == i])
    centroids = np.squeeze(items).mean(axis = 0)
    asc_order_centroids = centroids.argsort()#[:, ::-1]
    order_centroids = asc_order_centroids[::-1]
    terms = vectorizer.get_feature_names()
    print(f'Cluster {i}:')
    for ind in order_centroids[:10]:
        print(f' {terms[ind]}')
    print('')
Top Terms Per Cluster:
Cluster 1:
 space
 nasa
 edu
 henry
 gov
 alaska
 access
 com
 moon
 digex

Cluster 2:
 ax
 max
 b8f
 g9v
 a86
 145
 1d9
 pl
 2di
 0t

Cluster 3:
 edu
 com
 year
 baseball
 article
 writes
 cs
 team
 game
 university

Cluster 4:
 risc
 instruction
 ghhwang
 csie
 set
 nctu
 cisc
 tw
 reduced
 mq

Cluster 5:
 windows
 edu
 file
 dos
 com
 files
 card
 drivers
 driver
 use

In [ ]: