In [2]:
#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None

mp.rcParams['animation.html'] = 'jshtml';

Announcements¶

  • Homework:
    • Homework 9 out, due today
  • Final exam: Monday, May 8 from 12-2pm
    • Practice exam will be out today
  • Wednesday's class is review
  • Upcoming office hours:
    • Tomorrow: Peer tutor Rohan Anand from 1:30-3pm in CCDS 16th floor
    • Tomorrow: Abhishek Tiwari from 3:30-4:30pm on CCDS 13th floor

Lecture 38: Faster PageRank¶

[This lecture is based on Prof. Crovella's CS 132 lecture notes.]

The PageRank Insight¶

Links are endorsements. When a first page contains a link to a second page, that is an indication that the author of the first page thinks the second page is worth looking at. If the first and second pages both contain the same query terms, it is likely that the second page is an important page with respect to that query term.

In [12]:
#
display(Image("images/17-pagerank-quote.png", width=600))

What they are implying is that a random surfer should visit important pages more often and unimportant pages less often.

The way to interpret this precisely is:

  1. Form the graph that encodes the connections between Web pages that are retrieved for a particular query.
  1. Construct a Markov chain that corresponds to a random walk on this graph.
  1. Rank-order the pages according to their probability in the Markov chain's steady state.

So let's try to make this work and see what happens.

Example. Assume a set of Web pages have been selected based on a text query, e.g., pages related to "personal 737 jets."

These pages have various links between them, as represented by this graph:

In [13]:
#
display(Image("images/17-deeper-pagerank-fig.jpg", width=250))

Construct the unique steady-state distribution for a random walk on this graph, if it exists. That is, construct the PageRank for this set of Web pages.

Solution.

The key question we must ask is whether a unique steady state exists.

Step 1.¶

Assume there are nn pages to be ranked. Construct an n×nn×n transition matrix for the Markov chain.

Set the Markov chain transitions so that each outgoing link from a node has equal probability of being taken.

We have already seen the transition matrix for this graph:

P=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣001/30001/201/30001/20000000001/21001/31/2000001/21/20⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦P=[001/30001/201/30001/20000000001/21001/31/2000001/21/20]

We have observed that this transition matrix is not regular, because for any Ak,k>0,Ak,k>0, the second column will be zero.

To address this, let's ask why it happens.

The reason that column 2 of PP is zero is that the Web page corresponding to node 2 has no links embedded in it, so there is nowhere to go from this page. Of course this will happen a lot in an arbitrary collection of Web pages.

Note that Page and Brin say that the random surfer will occasionally "start on another random page." In other words, it seems reasonable that when reaching a page with no embedded links, the surfer chooses another page at random.

So this motivates the first adjustment to PP:

Step 2.¶

Form the matrix P′P′ as follows: for each column in PP that is entirely zeros, replace it with a column in which each entry is 1/n1/n.

In our example:

P=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣001/30001/201/30001/20000000001/21001/31/2000001/21/20⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦→P′=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣01/n1/30001/21/n1/30001/21/n000001/n001/2101/n1/31/20001/n01/21/20⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣01/61/30001/21/61/30001/21/6000001/6001/2101/61/31/20001/601/21/20⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦P=[001/30001/201/30001/20000000001/21001/31/2000001/21/20]→P′=[01/n1/30001/21/n1/30001/21/n000001/n001/2101/n1/31/20001/n01/21/20]=[01/61/30001/21/61/30001/21/6000001/6001/2101/61/31/20001/601/21/20]

Nonetheless, even after this change, P′P′ can fail to be regular.

In other words, for an arbitrary set of web pages, there is no guarantee that their transition matrix will be regular.

Once again, let's read the words of Page and Brin closely: the surfer "eventually gets bored and starts on another random page."

Step 3.¶

In practice this means that there a small probability that the surfer will jump from any page to any other page at random.

Let's call this small probability α.α.

We can't just add αα to every entry in P′P′, because then the columns of the new matrix would not sum to 1.

Instead we decrease each entry in P′P′ by a factor of (1−α)(1−α), and then add α/nα/n to it.

So we compute the final transition matrix P′′P″ as:

P′′ij=(1−α)P′ij+αn.Pij″=(1−α)Pij′+αn.

We can write this as a matrix equation:

P′′=(1−α)P′+αn1P″=(1−α)P′+αn1

where 11 is an n×nn×n matrix of 1's.

In our example, let's say that α=1/10α=1/10 (in reality it would be smaller). So α/n=1/60.α/n=1/60.

Then:

P′⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣01/61/30001/21/61/30001/21/6000001/6001/2101/61/31/20001/601/21/20⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦→P′′=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1/601/619/601/601/601/607/151/619/601/601/601/607/151/61/601/601/601/601/601/61/601/607/1511/121/601/619/607/151/601/601/601/61/607/157/151/60⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦P′[01/61/30001/21/61/30001/21/6000001/6001/2101/61/31/20001/601/21/20]→P″=[1/601/619/601/601/601/607/151/619/601/601/601/607/151/61/601/601/601/601/601/61/601/607/1511/121/601/619/607/151/601/601/601/61/607/157/151/60]

Obviously, P′′P″ is regular, because all its entries are positive (they are at least α/n.α/n.)

P′′P″ is the Markov Chain that Brin and Page defined, and which is used by PageRank to rank pages in response to a Google search.

Step 4.¶

Compute the steady-state of P′′P″, and rank pages according to their magnitude in the resulting vector.

We can do this by solving P′′x=xP″x=x, or we can compute the eigenvectors of P′′P″ and use the eigenvector that corresponds to λ=1.λ=1.

For the example P′′P″, we find that the steady-state vector is:

x=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣0.0370.0540.0410.3750.2060.286⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦x=[0.0370.0540.0410.3750.2060.286]

So the final ranking of pages is: 4, 6, 5, 2, 3, 1.

This is the order that PageRank would display its results, with page 4 at the top of the list.

Let's see how to do Step 4 in Python:

In [15]:
# Here is the P'' matrix as computed in steps 1 through 3.
P = np.array([
[1./60, 1./6, 19./60, 1./60, 1./60,  1./60],
[7./15, 1./6, 19./60, 1./60, 1./60,  1./60],
[7./15, 1./6,  1./60, 1./60, 1./60,  1./60],
[1./60, 1./6,  1./60, 1./60, 7./15, 11./12],
[1./60, 1./6, 19./60, 7./15, 1./60,  1./60],
[1./60, 1./6,  1./60, 7./15, 7./15,  1./60]
])
eigenvalues, eigenvectors = np.linalg.eig(P)
print(np.real(eigenvalues))
[ 1.          0.61008601 -0.08958752 -0.37049849 -0.45       -0.45      ]
In [16]:
# find the location of the largest eigenvalue (1), 
# by computing the indices that would sort the eigenvalues
# from smallest to largest
indices = np.argsort(eigenvalues)
# and take the index of the largest eigenvalue
principal = indices[-1]
print(principal)
0
In [17]:
# using the index of the largest eigenvalue, extract
# the corresponding eigenvector (the steady state vector)
steadyState = np.real(eigenvectors[:,principal])
steadyState = steadyState/np.sum(steadyState)
print(steadyState)
[0.03721197 0.05395735 0.04150565 0.37508082 0.20599833 0.28624589]
In [18]:
# find the order of the pages in the steady state vector
# this function sorts from smallest to largest (reverse of what we want)
reverseOrder = np.argsort(steadyState)
print(reverseOrder)
[0 2 1 4 5 3]
In [19]:
# reverse the order to get the most important page first
# and add one to convert from zero indexing to indexing of example
order = 1 + reverseOrder[::-1]
print('final order = {}'.format(order))
print('importance = {}'.format(steadyState[order-1]))
final order = [4 6 5 2 3 1]
importance = [0.37508082 0.28624589 0.20599833 0.05395735 0.04150565 0.03721197]
In [14]:
#
display(Image("images/17-deeper-pagerank-fig.jpg", width=250))

38.1 Computing PageRank: the Power Method¶

From a mathematical standpoint, we are done!

However, from a Computing & Data Sciences standpoint, there are still some issues.

The most significant issue is simply this: PageRank results must be provided very quickly. Search engines are in competition and speed is a competitive advantage.

Here is an example Google search:

In [6]:
#
display(Image("images/17-sample-google-search.jpg", width=800))

Notice that the search returned about 400,000 results!

Recall that using Gaussian elimination to solve Ax=bAx=b takes about 23n323n3 operations.

In this case, apparently n=400,000.n=400,000.

So computing the PageRank in the straightforward way we've described would take about 42,667,000,000,000,000 operations.

Assuming a 2GHz CPU, that's on the order of eight months.

In [22]:
((2./3)*(400000**3))/((2*10**9)*(3600*24*30))
Out[22]:
8.23045267489712

We need a faster way to compute the PageRank!

Here is an important point: we only need the principal eigenvector. (The one corresponding to λ=1λ=1).

Let's review how a Markov chain gets to steady state. As we discussed at the beginning of the lecture, the state of the chain at any step kk is given by

xk=c1v1λk1+c2v2λk2+⋯+cnvnλkn.xk=c1v1λ1k+c2v2λ2k+⋯+cnvnλnk.

Let's assume that λ1λ1 is the eigenvalue 1. If the chain converges to steady sate, then we know that all eigenvalues other than λ1λ1 are less than 1 in magnitude.

Of course, if |λi|<1,|λi|<1,

limk→∞λki=0.limk→∞λik=0.

So:

limk→∞xk=c1v1.limk→∞xk=c1v1.

Note that c1c1 is just a constant that doesn't affect the relative sizes of the components of xkxk in the limit of large k.k.

This is another way of stating that the Markov chain goes to steady state no matter what the starting state is.

This observation suggests another way to compute the steady state of the chain:

  1. Start from a random state x0x0.
  2. Compute xk+1=Axkxk+1=Axk for k=0,1,2,3,…k=0,1,2,3,…

How do we know when to stop in Step 2?

Since we are looking for steady-state, we can stop when the difference between xk+1xk+1 and xkxk is small.

This is called the power method.

Why is this a better method?

Keep in mind that the number of flops in matrix-vector multiplication is about 2n22n2.

This is compared to 23n323n3 for solving a system (finding the eigenvector directly).

Let's say that after computing

x1=Ax0x1=Ax0x2=Ax1x2=Ax1x3=Ax2x3=Ax2x4=Ax3x4=Ax3x5=Ax4x5=Ax4x6=Ax5x6=Ax5x7=Ax6x7=Ax6x8=Ax7x8=Ax7x9=Ax8x9=Ax8x10=Ax9x10=Ax9

we find that x10x10 is sufficiently close to x9.x9.

How much work did we do?

We did 10 matrix-vector multiplications, or 20n220n2 flops.

So the power method is

23n320n2=n3023n320n2=n30

times faster than the direct method.

For our example, n/30=13,333n/30=13,333. So this trick reduces the running time from 8 months down to 27 minutes.

In [23]:
20*400000.**2/((2*10**9)*(60))
Out[23]:
26.666666666666668
In [24]:
# Given time, talk about sparse matrix-vector multiply.
# This is linear in n, with a constant equal to average degree (say 10)
# and gets the computation down to 40 milliseconds
# but that requires explaining why P is not really dense
# (it can be expressed as a sparse matrix plus a constant matrix)
20*10*400000./((2*10**9))
Out[24]:
0.04

This is an example of an iterative method. Iterative methods are often the preferred approach for solving linear algebra problems in the real world.

One final thing: how exactly do we decide when to stop iterating in the power method?

One simple way is to add up the differences of the components of xk+1−xkxk+1−xk:

s=n∑i=1|xk+1,i−xk,i|s=∑i=1n|xk+1,i−xk,i|

and compare it to the sum of the components of xkxk:

d=n∑i=1|xk,i|d=∑i=1n|xk,i|

If s/ds/d is small (say, less than 0.001) then we can conclude that xk+1xk+1 is close enough to xkxk for us to stop iterating.

So the power method is fast, making it the algorithm of choice for a company like Google. It is also easy to implement, and easy to parallelize across multiple machines.

38.2 Course Evaluation¶

BU requires that we devote some class time to course evaluations. Your input helps CDS grow!

https://bu.campuslabs.com/courseeval/

In [5]:
#
display(Image("images/ds121s23-course-evals.png", width=250))

Exam review¶

Linear independence¶

A set of vectors {v1,...,vp}{v1,...,vp} all of which are in RnRn is said to be linearly dependent if there exist weights {c1,...,cp},{c1,...,cp}, not all zero, such that

c1v1+...+cpvp=0.c1v1+...+cpvp=0.

A set S={v1,...,vp}S={v1,...,vp} of two or more vectors is linearly dependent if and only if at least one of the vectors in SS is a linear combination of the others.

Gaussian Elimination¶

Definition. A matrix is in echelon form (or row echelon form) if it has the following three properties:

  1. All nonzero rows are above any rows of all zeros.
  2. Each leading entry of a row is in a column to the right of the leading entry of the row above it.
  3. All entries in a column below a leading entry are zeros.

For example:

⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣0■∗∗∗∗∗∗∗∗000■∗∗∗∗∗∗0000■∗∗∗∗∗00000■∗∗∗∗00000000■∗0000000000⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[0◼∗∗∗∗∗∗∗∗000◼∗∗∗∗∗∗0000◼∗∗∗∗∗00000◼∗∗∗∗00000000◼∗0000000000]

In this diagram, the leading entries ■◼ are nonzero, and the ∗∗ symbols can be any value.

This definition is a special case of an upper triangular matrix.

The goal of the first step of Gaussian elimination is to convert the augmented matrix into echelon form.

Definition: A matrix is in reduced echelon form (or reduced row echelon form) if it is in echelon form, and additionally:

  1. The leading entry in each nonzero row is 1.
  2. Each leading 1 is the only nonzero entry in its column.

For example:

⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣01∗000∗∗0∗000100∗∗0∗000010∗∗0∗000001∗∗0∗000000001∗0000000000⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[01∗000∗∗0∗000100∗∗0∗000010∗∗0∗000001∗∗0∗000000001∗0000000000]

The goal of the second step of Gaussian elimination is to convert the matrix into reduced echelon form.

Example 1¶

In our first example, let the input matrix AA be

⎡⎢ ⎢⎣034−53−7893−9615⎤⎥ ⎥⎦[034−53−7893−9615]

Stage 1 (Elimination)

Start with the first row (i=1i=1). The leftmost nonzero in row 1 and below is in column 1. But since it's not in row 1, we need to swap. We'll swap rows 1 and 3 (we could have swapped 1 and 2).

⎡⎢ ⎢⎣3−96153−789034−5⎤⎥ ⎥⎦[3−96153−789034−5]

The pivot is shown in a box. Use row reduction operations to create zeros below the pivot. In this case, that means subtracting row 1 from row 2.

⎡⎢ ⎢⎣3−9615022−6034−5⎤⎥ ⎥⎦[3−9615022−6034−5]

Now i=2i=2. The pivot is boxed (no need to do any swaps). Use row reduction to create zeros below the pivot. To do so we subtract 3/23/2 times row 2 from row 3.

⎡⎢ ⎢⎣3−9615022−60014⎤⎥ ⎥⎦[3−9615022−60014]

Now i=3i=3. Since it is the last row, we are done with Stage 1. The pivots are marked:

⎡⎢ ⎢⎣3−9615022−60014⎤⎥ ⎥⎦[3−9615022−60014]

Stage 2 (Backsubstitution)

Starting again with the first row (i=1i=1). Divide row 1 by its pivot.

⎡⎢ ⎢⎣1−325022−60014⎤⎥ ⎥⎦[1−325022−60014]

Moving to the next row (i=2i=2). Divide row 2 by its pivot.

⎡⎢ ⎢⎣1−325011−30014⎤⎥ ⎥⎦[1−325011−30014]

And use row reduction operations to create zeros in all elements above the pivot. In this case, that means adding 3 times row 2 to row 1.

⎡⎢ ⎢⎣105−4011−30014⎤⎥ ⎥⎦[105−4011−30014]

Moving to the next row (i=3i=3). The pivot is already 1. So we subtract row 3 from row 2, and subtract 5 times row 3 from row 1.

⎡⎢ ⎢⎣100−24010−70014⎤⎥ ⎥⎦[100−24010−70014]

This matrix is in reduced row echelon form, so we are done with GE. The solution is:

x1=−24x2=−7x3=4x1=−24x2=−7x3=4

Example 2¶

Let's assume that the augmented matrix of a system has been transformed into the equivalent reduced echelon form:

⎡⎢ ⎢⎣10−5101140000⎤⎥ ⎥⎦[10−5101140000]

This system is consistent. Is the solution unique?

The associated system of equations is

x1−5x3=1x2+x3=40=0x1−5x3=1x2+x3=40=0

Variables x1x1 and x2x2 correspond to pivot columns. They are called basic variables. The other variable x3x3 is a free variable.

Whenever a system is consistent, the solution set can be described explicitly by solving the reduced system of equations for the basic variables in terms of the free variables.

This operation is possible because the reduced echelon form places each basic variable in one and only one equation.

In the example, solve the first and second equations for x1x1 and x2x2. Ignore the third equation; it offers no restriction on the variables.

So the solution set is:

x1=1+5x3x2=4−x3x3is freex1=1+5x3x2=4−x3x3is free

"x3x3 is free" means you can choose any value for x3x3.

In other words, there are an inifinite set of solutions to this linear system. Each solution corresponds to one particular value of x3x3.

For instance,

  • when x3=0x3=0, the solution is (1,4,0)(1,4,0);
  • when x3=1,x3=1, the solution is (6,3,1)(6,3,1).

How many solutions will a system have?¶

⎡⎢ ⎢⎣10−5101140001⎤⎥ ⎥⎦[10−5101140001]
⎡⎢ ⎢⎣10−230−2401−220−7000014⎤⎥ ⎥⎦[10−230−2401−220−7000014]
⎡⎢ ⎢⎣1001010−20010⎤⎥ ⎥⎦[1001010−20010]
⎡⎢ ⎢⎣2−313−241−11⎤⎥ ⎥⎦[2−313−241−11]

Given a system of mm equations in nn unknowns, let AA be the m×(n+1)m×(n+1) augmented matrix. Let rr be the number of pivot positions in the REF of AA.

  • If r=nr=n, there is a unique solution (no parameters in the solution).
  • If r>nr>n (so r=n+1r=n+1) the system is inconsistent (no solution).
  • If r<nr<n, either the system is inconsistent (no solution) or has a solution with n−rn−r parameters.

Vector geometry¶

Length is such a fundamental concept that we introduce a special notation and name for it.

Definition. The norm of vv is the nonnegative scalar ∥v∥‖v‖ defined by

∥v∥=√vTv= ⎷n∑i=1vi2.‖v‖=vTv=∑i=1nvi2.

Definition. For uu and vv in Rn,Rn, the distance between uu and vv, written as dist(u,v),dist(u,v), is the length of the vector u−vu−v. That is,

dist(u,v)=∥u−v∥.dist(u,v)=‖u−v‖.

This definition agrees with the usual formulas for the Euclidean distance between two points. The usual formula is

dist(u,v)=√(v1−u1)2+(v2−u2)2+⋯+(vn−un)2.dist(u,v)=(v1−u1)2+(v2−u2)2+⋯+(vn−un)2.

Which you can see is equal to

∥u−v∥=√(u−v)T(u−v)=     ⎷[u1−v1u2−v2…un−vn]⎡⎢ ⎢ ⎢ ⎢⎣u1−v1u2−v2⋮un−vn⎤⎥ ⎥ ⎥ ⎥⎦‖u−v‖=(u−v)T(u−v)=[u1−v1u2−v2…un−vn][u1−v1u2−v2⋮un−vn]

Definition. Two vectors uu and vv in RnRn are orthogonal to each other if uTv=0.uTv=0.

Clustering¶

Clustering is a very important way of discovering structure in data. It represents an example of unsupervised learning.

  • Supervised methods: Data items have labels, and we want to learn a function that correctly assigns labels to new data items.

  • Unsupervised methods: Data items do not have labels, and we want to learn a function that extracts important patterns from the data.

The high-level goal of our clustering algorithm is to:

  • minimize intra-cluster distances
  • maximize inter-cluster distances

Centroid and Variance¶

To measure how close or different many vectors are in space, let's define the idea of a dataset centroid.

Given a nn different vectors x1,…,xnx1,…,xn each with the same dimension dd, the centroid is the average of the vectors taken componentwise.

¯¯¯x=1n∑ixi.x¯=1n∑ixi.

In other words: the centroid is the "center of mass" of the dataset.

Next, we define the sample variance of a dataset {x1,…,xn}{x1,…,xn} as:

Var(X)=1n∑j∥xj−¯¯¯x∥2.Var⁡(X)=1n∑j‖xj−x¯‖2.

In other words, the sample variance of the set of points is the average squared distance from each point to the centroid.

kk-means Problem:¶

Find kk points c1,…,ckc1,…,ck (called centers, centroids, or means), so that the cost

n∑i=1mink∥xi−cj∥2∑i=1nmink‖xi−cj‖2

is minimized.

Equivalently: we can think in terms of the partition itself.

Consider the set X={x1,…,xn}X={x1,…,xn} where xi∈Rdxi∈Rd.

Find kk points c1,…,ckc1,…,ck

and partition the set XX into kk different subsets {X1,…,Xk}{X1,…,Xk} by assigning each point xixi to its nearst cluster center,

so that the cost

n∑i=1minj∥xi−cj∥2=k∑j=1∑x∈Xj∥x−cj∥2∑i=1nminj‖xi−cj‖2=∑j=1k∑x∈Xj‖x−cj‖2

is minimized.

The (approximate) kk-means algorithm:

  1. Pick kk cluster centers {c1,…,ck}{c1,…,ck}. These can be chosen randomly, or by some other method.
  2. For each jj, define the cluster XjXj as the set of points in XX that are closest to center ckck.

(Nearer to ckck than to any other center.) 3. For each jj, redefine cjcj to be the center of mass of cluster XjXj.
(In other words, cjcj is the mean of the vectors in XjXj.) 4. Repeat (i.e., go to Step 2) until convergence.

In [8]:
#
display(Image("images/20-kmeans-example.png", width=600))

Limitations of kk-means¶

As you can see, kk-means can work very well.

However, we don't have any guarantees on the performance of kk-means.

In particular, there are various settings in which kk-means can fail to do a good job.

  1. kk-means tries to find spherical clusters.

    Because each point is assigned to its closest center, the points in a cluster are implicitly assumed to be arranged in a sphere around the center.

In [15]:
#
display(Image("images/20-kmeans-nonspherical-clusters.png", width=600))
  1. kk-means tries to find equal-sized clusters.

    For the same reason, the sizes of clusters are implicitly assumed to be approximately equal.

In [16]:
#
display(Image("images/20-kmeans-cluster-size.png", width=600))
  1. kk-means is sensitive to the starting cluster centers.

    If the initial guess (Step 1) is a bad one, kk-means may get "stuck" in a bad solution.

In [17]:
#
display(Image("images/20-kmeans-bad-initialization.png", width=600))

Evaluating kk-means¶

We discussed two metrics to test how well kk-means clustering has done... and as a result, to find the best kk.

  • If ground truth is known, we can use the Rand Index: a similarity measure that allows us to compare the ground truth TT versus the result of kk-means clustering CC.

    If aa denotes the number of pairs of points that have the same label in both TT and CC, and bb denotes the number of pairs that have different labels in both TT and CC, then the Rand Index is:

RI(T,C)=a+b(n2)RI(T,C)=a+b(n2)
In [14]:
#
figs, axs = plt.subplots(1, 2, figsize = (12, 5))
df_rand_gt.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', ax = axs[0],
                   colorbar = False)
axs[0].set_title('Ground Truth (T)')
axs[0].set_axis_off()
df_rand_clust.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', ax = axs[1],
                  colorbar = False)
axs[1].set_title('Clustering (C)')
axs[1].set_axis_off();
  • If ground truth is not known, we can use the Silhouette Coefficient: an attempt to measure when a model produces "better defined" clusters, in the sense that points in the same cluster are close to each other and far from the other clusters.

    If aa equals the mean distance between a data point and all other points in the same cluster, and bb equals the mean distance between a data point and all other points in the next nearest cluster, then the

Silhouette Coefficient for a clustering is: s=b−amax(a,b)s=b−amax(a,b)

Hierarchical Clustering¶

A hierarchical clustering produces a set of nested clusters organized into a tree.

A hierarchical clustering is visualized using a dendrogram

  • A tree-like diagram that records the containment relations among clusters.
In [5]:
#
display(Image("images/21-dendrogram.png", width=600))

Strengths of Hierarchical Clustering¶

Hierarchical clustering has a number of advantages:

First, a hierarchical clustering encodes many different clusterings. That is, it does not itself decide on the correct number of clusters.

A clustering is obtained by "cutting" the dendrogram at some level.

This means that you can make this crucial decision yourself, by inspecting the dendrogram.

Put another way, you can obtain any desired number of clusters.

In [6]:
#
display(Image("images/21-dendrogram-cut.png", width=600))

The second advantage is that the dendrogram may itself correspond to a meaningful structure, for example, a taxonomy.

In [7]:
#
display(Image("images/21-animal-taxonomy.jpg", width=600))

Building a dendrogram¶

Agglomerative Clustering ("bottom-up"):

  • Start by defining each point as its own cluster
  • At each successive step, merge the two clusters that are closest to each other (according to some rule to be determined)
  • Repeat until only one cluster is left.

Divisive Clustering ("top-down"):

  • Start with one, all-inclusive cluster
  • At each step, find the cluster split that creates the largest distance between resulting clusters
  • Repeat until each point is in its own cluster.
In [9]:
# Single, Complete, and Average Linkage
display(Image("images/21-hierarchical-criteria.png", width=1000))
  • Single-Linkage: the distance between two clusters is the distance between the closest two points that are in different clusters.
Dsingle(i,j)=minx,y{d(x,y)|x∈Ci,y∈Cj}Dsingle(i,j)=minx,y{d(x,y)|x∈Ci,y∈Cj}
  • Complete-Linkage: the distance between two clusters is the distance between the farthest two points that are in different clusters.
Dcomplete(i,j)=maxx,y{d(x,y)|x∈Ci,y∈Cj}Dcomplete(i,j)=maxx,y{d(x,y)|x∈Ci,y∈Cj}
  • Average-Linkage: the distance between two clusters is the average distance between all pairs of points from different clusters.
Daverage(i,j)=1|Ci|⋅|Cj|∑x∈Ci,y∈Cjd(x,y)Daverage(i,j)=1|Ci|⋅|Cj|∑x∈Ci,y∈Cjd(x,y)

Supervised Learning¶

The supervised learning problem in general is:

  • You are given some example data, which we'll think of abstractly as tuples {(xi,yi)|i=1,…,N}{(xi,yi)|i=1,…,N}.

  • Your goal is to learn a rule that allows you to predict yjyj for some xjxj that is not in the example data you were given.

When our data is continuous, we call this a regression.

The class of models we looked at are polynomials. They are of the form:

y(x,w)=w0+w1x+w2x2+⋯+wkxk=k∑j=0wjxjy(x,w)=w0+w1x+w2x2+⋯+wkxk=∑j=0kwjxj

where kk is the order of the polynomial.

Choosing kk is called model selection and w0,w1,…,wkw0,w1,…,wk are called the parameters of the model.

In [96]:
#
fig, axs = plt.subplots(1, 3, sharey = True, figsize = (12, 5))
#
cy = 1000 * [w_hat_0]
pred_y = N * [w_hat_0]
axs[0].plot(cx, cy, lw = 2, label = r'$k$ = 0')
axs[0].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[0].set_xlabel('x', size = 16)
axs[0].set_ylabel('y', size = 16)
axs[0].set_title(r'$k$ = 0, constant' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[0].legend(loc = 'best', fontsize = 16)
#
cy = design_matrix(cx, 1) @ w_hat_1
pred_y = design_matrix(x, 1) @ w_hat_1
axs[1].plot(cx, cy, lw = 2, label = r'$k$ = 1')
axs[1].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[1].set_xlabel('x', size = 16)
axs[1].set_title(r'$k$ = 1, linear' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[1].legend(loc = 'best', fontsize = 16)
#
cy = design_matrix(cx, 3) @ w_hat_3
pred_y = design_matrix(x, 3) @ w_hat_3
axs[2].plot(cx, cy, lw = 2, label = r'$k$ = 3')
axs[2].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[2].set_xlabel('x', size = 16)
axs[2].set_title('$k$ = 3, cubic' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[2].legend(loc = 'best', fontsize = 16)
#
fig.tight_layout();

If we overfit our model to our training data, this leads to poor generalizability. To avoid overfitting, we might:

  1. Increase the amount of training data.

  2. Limit the complexity of the model.

In [97]:
#
w_hat_9 = fit_poly(x, y, 9)
cy = design_matrix(cx, 9) @ w_hat_9
plt.plot(cx, cy, lw = 2, label = r'$k$ = 9')
plt.plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
plt.xlabel('x', size = 16)
plt.ylabel('y', size = 16)
plt.title(r'$k$ = 9' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(0));

We can hold out data to test our model selection. For example, we might do the following:

  • For each possible value of the hyperparameter kk:
    • randomly split the data 5 times
    • compute the mean testing and training error over the 5 random splits
In [98]:
#
test_y = np.sin(2 * np.pi * x) + default_rng(8).normal(size = N, scale = 0.20)
max_k = N
train_err = [np.linalg.norm(y - N * [w_hat_0])]
test_err = [np.linalg.norm(test_y - N * [w_hat_0])]
for k in range(1, max_k):
    w_hat = fit_poly(x, y, k)
    pred_y = design_matrix(x, k) @ w_hat
    train_err.append(np.linalg.norm(y - pred_y))
    test_err.append(np.linalg.norm(test_y - pred_y))
plt.plot(range(max_k), test_err, 'ro-', label = 'Testing Error')
plt.plot(range(max_k), train_err, 'bo-', label = 'Training Error')
plt.xlabel(r'$k$', size = 16)
plt.ylabel(r'$E(\mathbf{w}^*)$')
plt.legend(loc = 'best');

In cross-validation, the data is partitioned once, and then each partition is used as the test data once.

This ensures that all the data gets equal weight in the training and in the testing.

We divide the data into kk "folds".

In [4]:
#
display(Image("images/22-k-fold.png", width=500))

Decision trees¶

In [7]:
#
display(Image("images/22-DT-Example-2.png", width=800))

Hunt's Algorithm¶

Hunt's Algorithm builds the tree node by node, starting from the root.

As we build the tree, we divide the training data up.

Let DtDt be the set of training records that reach node tt.

  • If DtDt contains records that all belong to a single class ytyt, then tt is a leaf node labeled as ytyt.
  • If DtDt is an empty set, then tt is a leaf node labeled by the default class ydyd.
  • If DtDt contains records that belong to more than one class, use an attribute to split DtDt into smaller subsets, and assign that splitting-rule to node tt.

Recursively apply the above procedure until a stopping criterion is met.

kk-Nearest Neighbors¶

In [4]:
#
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
plt.plot(test_X[0], test_X[1], 'ok')
plt.annotate('Test Point', test_X, [75, 25], 
             textcoords = 'offset points', fontsize = 14, 
             arrowprops = {'arrowstyle': '->'})
plt.axis('equal')
plt.axis('off')
plt.title('Training Points: 2 Classes');
In [5]:
#
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
plt.plot(test_X[0], test_X[1], 'ok')
ax=plt.gcf().gca()
circle = mp.patches.Circle(test_X, 0.5, facecolor = 'red', alpha = 0.2)
plt.axis('equal')
plt.axis('off')
ax.add_artist(circle)
plt.title('1-Nearest-Neighbor: Classification: Red');
In [6]:
#
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
test_X = [-0.3, 0.7]
plt.plot(test_X[0], test_X[1], 'ok')
ax=plt.gcf().gca()
    #ellipse = mp.patches.Ellipse(gmm.means_[clus], 3 * e[0], 3 * e[1], angle, color = 'r')
circle = mp.patches.Circle(test_X, 0.9, facecolor = 'gray', alpha = 0.3)
plt.axis('equal')
plt.axis('off')
ax.add_artist(circle)
plt.title('2-Nearest-Neighbor');
In [7]:
#
plt.figure()
ax=plt.gcf().gca()
    #ellipse = mp.patches.Ellipse(gmm.means_[clus], 3 * e[0], 3 * e[1], angle, color = 'r')
circle = mp.patches.Circle(test_X, 1.4, facecolor = 'blue', alpha = 0.2)
ax.add_artist(circle)
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
test_X = [-0.3, 0.7]
plt.plot(test_X[0], test_X[1], 'ok')
plt.axis('equal')
plt.axis('off')
plt.title('3-Nearest-Neighbor: Classification: Blue');

Challenges for kk-NN¶

Working with a kk-NN classifier can involve some challenges.

  1. First, the computational cost of classification grows with the size of the training data. Training is trivial but classification can be prohibitively expensive.
  1. Second, since Euclidean distance is the most common distance function used, data scaling is important.
  1. Third, we need to deal with the curse of dimensionality -- problems arise when we use geometric algorithms in high-dimensional space.

Evaluating Classification Methods¶

Next we'll look at two classification methods in practice:

  • Decision Trees, and
  • k-Nearest Neighbors.

To compare these methods, the question arises:

How do we evaluate a classifier?

Using the confusion matrix we can define some useful measures:

  • Recall - defined as the fraction of actual positives correctly classified:
    • TP/(TP + FN)
  • Precision - defined as the fraction of classified positives correctly classified:
    • TP/(TP + FP)
In [11]:
#
display(Image("images/23-confusion-matrix.png", width=300))
In [ ]: