#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None
mp.rcParams['animation.html'] = 'jshtml';
[This lecture is based on Prof. Crovella's CS 132 lecture notes.]
Links are endorsements. When a first page contains a link to a second page, that is an indication that the author of the first page thinks the second page is worth looking at. If the first and second pages both contain the same query terms, it is likely that the second page is an important page with respect to that query term.
#
display(Image("images/17-pagerank-quote.png", width=600))
What they are implying is that a random surfer should visit important pages more often and unimportant pages less often.
The way to interpret this precisely is:
So let's try to make this work and see what happens.
Example. Assume a set of Web pages have been selected based on a text query, e.g., pages related to "personal 737 jets."
These pages have various links between them, as represented by this graph:
#
display(Image("images/17-deeper-pagerank-fig.jpg", width=250))
Construct the unique steady-state distribution for a random walk on this graph, if it exists. That is, construct the PageRank for this set of Web pages.
Solution.
The key question we must ask is whether a unique steady state exists.
Assume there are pages to be ranked. Construct an transition matrix for the Markov chain.
Set the Markov chain transitions so that each outgoing link from a node has equal probability of being taken.
We have already seen the transition matrix for this graph:
We have observed that this transition matrix is not regular, because for any the second column will be zero.
To address this, let's ask why it happens.
The reason that column 2 of is zero is that the Web page corresponding to node 2 has no links embedded in it, so there is nowhere to go from this page. Of course this will happen a lot in an arbitrary collection of Web pages.
Note that Page and Brin say that the random surfer will occasionally "start on another random page." In other words, it seems reasonable that when reaching a page with no embedded links, the surfer chooses another page at random.
So this motivates the first adjustment to :
Form the matrix as follows: for each column in that is entirely zeros, replace it with a column in which each entry is .
In our example:
Nonetheless, even after this change, can fail to be regular.
In other words, for an arbitrary set of web pages, there is no guarantee that their transition matrix will be regular.
Once again, let's read the words of Page and Brin closely: the surfer "eventually gets bored and starts on another random page."
In practice this means that there a small probability that the surfer will jump from any page to any other page at random.
Let's call this small probability
We can't just add to every entry in , because then the columns of the new matrix would not sum to 1.
Instead we decrease each entry in by a factor of , and then add to it.
So we compute the final transition matrix as:
We can write this as a matrix equation:
where is an matrix of 1's.
In our example, let's say that (in reality it would be smaller). So
Then:
Obviously, is regular, because all its entries are positive (they are at least )
is the Markov Chain that Brin and Page defined, and which is used by PageRank to rank pages in response to a Google search.
Compute the steady-state of , and rank pages according to their magnitude in the resulting vector.
We can do this by solving , or we can compute the eigenvectors of and use the eigenvector that corresponds to
For the example , we find that the steady-state vector is:
So the final ranking of pages is: 4, 6, 5, 2, 3, 1.
This is the order that PageRank would display its results, with page 4 at the top of the list.
Let's see how to do Step 4 in Python:
# Here is the P'' matrix as computed in steps 1 through 3.
P = np.array([
[1./60, 1./6, 19./60, 1./60, 1./60, 1./60],
[7./15, 1./6, 19./60, 1./60, 1./60, 1./60],
[7./15, 1./6, 1./60, 1./60, 1./60, 1./60],
[1./60, 1./6, 1./60, 1./60, 7./15, 11./12],
[1./60, 1./6, 19./60, 7./15, 1./60, 1./60],
[1./60, 1./6, 1./60, 7./15, 7./15, 1./60]
])
eigenvalues, eigenvectors = np.linalg.eig(P)
print(np.real(eigenvalues))
[ 1. 0.61008601 -0.08958752 -0.37049849 -0.45 -0.45 ]
# find the location of the largest eigenvalue (1),
# by computing the indices that would sort the eigenvalues
# from smallest to largest
indices = np.argsort(eigenvalues)
# and take the index of the largest eigenvalue
principal = indices[-1]
print(principal)
0
# using the index of the largest eigenvalue, extract
# the corresponding eigenvector (the steady state vector)
steadyState = np.real(eigenvectors[:,principal])
steadyState = steadyState/np.sum(steadyState)
print(steadyState)
[0.03721197 0.05395735 0.04150565 0.37508082 0.20599833 0.28624589]
# find the order of the pages in the steady state vector
# this function sorts from smallest to largest (reverse of what we want)
reverseOrder = np.argsort(steadyState)
print(reverseOrder)
[0 2 1 4 5 3]
# reverse the order to get the most important page first
# and add one to convert from zero indexing to indexing of example
order = 1 + reverseOrder[::-1]
print('final order = {}'.format(order))
print('importance = {}'.format(steadyState[order-1]))
final order = [4 6 5 2 3 1] importance = [0.37508082 0.28624589 0.20599833 0.05395735 0.04150565 0.03721197]
#
display(Image("images/17-deeper-pagerank-fig.jpg", width=250))
From a mathematical standpoint, we are done!
However, from a Computing & Data Sciences standpoint, there are still some issues.
The most significant issue is simply this: PageRank results must be provided very quickly. Search engines are in competition and speed is a competitive advantage.
Here is an example Google search:
#
display(Image("images/17-sample-google-search.jpg", width=800))
Notice that the search returned about 400,000 results!
Recall that using Gaussian elimination to solve takes about operations.
In this case, apparently
So computing the PageRank in the straightforward way we've described would take about 42,667,000,000,000,000 operations.
Assuming a 2GHz CPU, that's on the order of eight months.
((2./3)*(400000**3))/((2*10**9)*(3600*24*30))
8.23045267489712
We need a faster way to compute the PageRank!
Here is an important point: we only need the principal eigenvector. (The one corresponding to ).
Let's review how a Markov chain gets to steady state. As we discussed at the beginning of the lecture, the state of the chain at any step is given by
Let's assume that is the eigenvalue 1. If the chain converges to steady sate, then we know that all eigenvalues other than are less than 1 in magnitude.
Of course, if
So:
Note that is just a constant that doesn't affect the relative sizes of the components of in the limit of large
This is another way of stating that the Markov chain goes to steady state no matter what the starting state is.
This observation suggests another way to compute the steady state of the chain:
How do we know when to stop in Step 2?
Since we are looking for steady-state, we can stop when the difference between and is small.
This is called the power method.
Why is this a better method?
Keep in mind that the number of flops in matrix-vector multiplication is about .
This is compared to for solving a system (finding the eigenvector directly).
Let's say that after computing
we find that is sufficiently close to
How much work did we do?
We did 10 matrix-vector multiplications, or flops.
So the power method is
times faster than the direct method.
For our example, . So this trick reduces the running time from 8 months down to 27 minutes.
20*400000.**2/((2*10**9)*(60))
26.666666666666668
# Given time, talk about sparse matrix-vector multiply.
# This is linear in n, with a constant equal to average degree (say 10)
# and gets the computation down to 40 milliseconds
# but that requires explaining why P is not really dense
# (it can be expressed as a sparse matrix plus a constant matrix)
20*10*400000./((2*10**9))
0.04
This is an example of an iterative method. Iterative methods are often the preferred approach for solving linear algebra problems in the real world.
One final thing: how exactly do we decide when to stop iterating in the power method?
One simple way is to add up the differences of the components of :
and compare it to the sum of the components of :
If is small (say, less than 0.001) then we can conclude that is close enough to for us to stop iterating.
So the power method is fast, making it the algorithm of choice for a company like Google. It is also easy to implement, and easy to parallelize across multiple machines.
BU requires that we devote some class time to course evaluations. Your input helps CDS grow!
#
display(Image("images/ds121s23-course-evals.png", width=250))
A set of vectors all of which are in is said to be linearly dependent if there exist weights not all zero, such that
A set of two or more vectors is linearly dependent if and only if at least one of the vectors in is a linear combination of the others.
Definition. A matrix is in echelon form (or row echelon form) if it has the following three properties:
For example:
In this diagram, the leading entries are nonzero, and the symbols can be any value.
This definition is a special case of an upper triangular matrix.
The goal of the first step of Gaussian elimination is to convert the augmented matrix into echelon form.
Definition: A matrix is in reduced echelon form (or reduced row echelon form) if it is in echelon form, and additionally:
For example:
The goal of the second step of Gaussian elimination is to convert the matrix into reduced echelon form.
In our first example, let the input matrix be
Stage 1 (Elimination)
Start with the first row (). The leftmost nonzero in row 1 and below is in column 1. But since it's not in row 1, we need to swap. We'll swap rows 1 and 3 (we could have swapped 1 and 2).
The pivot is shown in a box. Use row reduction operations to create zeros below the pivot. In this case, that means subtracting row 1 from row 2.
Now . The pivot is boxed (no need to do any swaps). Use row reduction to create zeros below the pivot. To do so we subtract times row 2 from row 3.
Now . Since it is the last row, we are done with Stage 1. The pivots are marked:
Stage 2 (Backsubstitution)
Starting again with the first row (). Divide row 1 by its pivot.
Moving to the next row (). Divide row 2 by its pivot.
And use row reduction operations to create zeros in all elements above the pivot. In this case, that means adding 3 times row 2 to row 1.
Moving to the next row (). The pivot is already 1. So we subtract row 3 from row 2, and subtract 5 times row 3 from row 1.
This matrix is in reduced row echelon form, so we are done with GE. The solution is:
Let's assume that the augmented matrix of a system has been transformed into the equivalent reduced echelon form:
This system is consistent. Is the solution unique?
The associated system of equations is
Variables and correspond to pivot columns. They are called basic variables. The other variable is a free variable.
Whenever a system is consistent, the solution set can be described explicitly by solving the reduced system of equations for the basic variables in terms of the free variables.
This operation is possible because the reduced echelon form places each basic variable in one and only one equation.
In the example, solve the first and second equations for and . Ignore the third equation; it offers no restriction on the variables.
So the solution set is:
" is free" means you can choose any value for .
In other words, there are an inifinite set of solutions to this linear system. Each solution corresponds to one particular value of .
For instance,
Given a system of equations in unknowns, let be the augmented matrix. Let be the number of pivot positions in the REF of .
Length is such a fundamental concept that we introduce a special notation and name for it.
Definition. The norm of is the nonnegative scalar defined by
Definition. For and in the distance between and , written as is the length of the vector . That is,
This definition agrees with the usual formulas for the Euclidean distance between two points. The usual formula is
Which you can see is equal to
Definition. Two vectors and in are orthogonal to each other if
Clustering is a very important way of discovering structure in data. It represents an example of unsupervised learning.
Supervised methods: Data items have labels, and we want to learn a function that correctly assigns labels to new data items.
Unsupervised methods: Data items do not have labels, and we want to learn a function that extracts important patterns from the data.
The high-level goal of our clustering algorithm is to:
To measure how close or different many vectors are in space, let's define the idea of a dataset centroid.
Given a different vectors each with the same dimension , the centroid is the average of the vectors taken componentwise.
In other words: the centroid is the "center of mass" of the dataset.
Next, we define the sample variance of a dataset as:
In other words, the sample variance of the set of points is the average squared distance from each point to the centroid.
Equivalently: we can think in terms of the partition itself.
Consider the set where .
Find points
and partition the set into different subsets by assigning each point to its nearst cluster center,
so that the cost
is minimized.
The (approximate) -means algorithm:
(Nearer to than to any other center.)
3. For each , redefine to be the center of mass of cluster .
(In other words, is the mean of the vectors in .)
4. Repeat (i.e., go to Step 2) until convergence.
#
display(Image("images/20-kmeans-example.png", width=600))
As you can see, -means can work very well.
However, we don't have any guarantees on the performance of -means.
In particular, there are various settings in which -means can fail to do a good job.
-means tries to find spherical clusters.
Because each point is assigned to its closest center, the points in a cluster are implicitly assumed to be arranged in a sphere around the center.
#
display(Image("images/20-kmeans-nonspherical-clusters.png", width=600))
-means tries to find equal-sized clusters.
For the same reason, the sizes of clusters are implicitly assumed to be approximately equal.
#
display(Image("images/20-kmeans-cluster-size.png", width=600))
-means is sensitive to the starting cluster centers.
If the initial guess (Step 1) is a bad one, -means may get "stuck" in a bad solution.
#
display(Image("images/20-kmeans-bad-initialization.png", width=600))
We discussed two metrics to test how well -means clustering has done... and as a result, to find the best .
If ground truth is known, we can use the Rand Index: a similarity measure that allows us to compare the ground truth versus the result of -means clustering .
If denotes the number of pairs of points that have the same label in both and , and denotes the number of pairs that have different labels in both and , then the Rand Index is:
#
figs, axs = plt.subplots(1, 2, figsize = (12, 5))
df_rand_gt.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', ax = axs[0],
colorbar = False)
axs[0].set_title('Ground Truth (T)')
axs[0].set_axis_off()
df_rand_clust.plot('X', 'Y', kind = 'scatter', c = 'label', colormap='viridis', ax = axs[1],
colorbar = False)
axs[1].set_title('Clustering (C)')
axs[1].set_axis_off();
If ground truth is not known, we can use the Silhouette Coefficient: an attempt to measure when a model produces "better defined" clusters, in the sense that points in the same cluster are close to each other and far from the other clusters.
If equals the mean distance between a data point and all other points in the same cluster, and equals the mean distance between a data point and all other points in the next nearest cluster, then the
Silhouette Coefficient for a clustering is:
A hierarchical clustering produces a set of nested clusters organized into a tree.
A hierarchical clustering is visualized using a dendrogram
#
display(Image("images/21-dendrogram.png", width=600))
Hierarchical clustering has a number of advantages:
First, a hierarchical clustering encodes many different clusterings. That is, it does not itself decide on the correct number of clusters.
A clustering is obtained by "cutting" the dendrogram at some level.
This means that you can make this crucial decision yourself, by inspecting the dendrogram.
Put another way, you can obtain any desired number of clusters.
#
display(Image("images/21-dendrogram-cut.png", width=600))
The second advantage is that the dendrogram may itself correspond to a meaningful structure, for example, a taxonomy.
#
display(Image("images/21-animal-taxonomy.jpg", width=600))
Agglomerative Clustering ("bottom-up"):
Divisive Clustering ("top-down"):
# Single, Complete, and Average Linkage
display(Image("images/21-hierarchical-criteria.png", width=1000))
The supervised learning problem in general is:
You are given some example data, which we'll think of abstractly as tuples .
Your goal is to learn a rule that allows you to predict for some that is not in the example data you were given.
When our data is continuous, we call this a regression.
The class of models we looked at are polynomials. They are of the form:
where is the order of the polynomial.
Choosing is called model selection and are called the parameters of the model.
#
fig, axs = plt.subplots(1, 3, sharey = True, figsize = (12, 5))
#
cy = 1000 * [w_hat_0]
pred_y = N * [w_hat_0]
axs[0].plot(cx, cy, lw = 2, label = r'$k$ = 0')
axs[0].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[0].set_xlabel('x', size = 16)
axs[0].set_ylabel('y', size = 16)
axs[0].set_title(r'$k$ = 0, constant' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[0].legend(loc = 'best', fontsize = 16)
#
cy = design_matrix(cx, 1) @ w_hat_1
pred_y = design_matrix(x, 1) @ w_hat_1
axs[1].plot(cx, cy, lw = 2, label = r'$k$ = 1')
axs[1].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[1].set_xlabel('x', size = 16)
axs[1].set_title(r'$k$ = 1, linear' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[1].legend(loc = 'best', fontsize = 16)
#
cy = design_matrix(cx, 3) @ w_hat_3
pred_y = design_matrix(x, 3) @ w_hat_3
axs[2].plot(cx, cy, lw = 2, label = r'$k$ = 3')
axs[2].plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
axs[2].set_xlabel('x', size = 16)
axs[2].set_title('$k$ = 3, cubic' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(np.linalg.norm(y - pred_y)))
#axs[2].legend(loc = 'best', fontsize = 16)
#
fig.tight_layout();
If we overfit our model to our training data, this leads to poor generalizability. To avoid overfitting, we might:
Increase the amount of training data.
Limit the complexity of the model.
#
w_hat_9 = fit_poly(x, y, 9)
cy = design_matrix(cx, 9) @ w_hat_9
plt.plot(cx, cy, lw = 2, label = r'$k$ = 9')
plt.plot(x, y, 'ro', markersize = 8, fillstyle = 'none')
plt.xlabel('x', size = 16)
plt.ylabel('y', size = 16)
plt.title(r'$k$ = 9' + '\n' + r'$E(\mathbf{w})$ =' + ' {:0.2f}'.format(0));
We can hold out data to test our model selection. For example, we might do the following:
#
test_y = np.sin(2 * np.pi * x) + default_rng(8).normal(size = N, scale = 0.20)
max_k = N
train_err = [np.linalg.norm(y - N * [w_hat_0])]
test_err = [np.linalg.norm(test_y - N * [w_hat_0])]
for k in range(1, max_k):
w_hat = fit_poly(x, y, k)
pred_y = design_matrix(x, k) @ w_hat
train_err.append(np.linalg.norm(y - pred_y))
test_err.append(np.linalg.norm(test_y - pred_y))
plt.plot(range(max_k), test_err, 'ro-', label = 'Testing Error')
plt.plot(range(max_k), train_err, 'bo-', label = 'Training Error')
plt.xlabel(r'$k$', size = 16)
plt.ylabel(r'$E(\mathbf{w}^*)$')
plt.legend(loc = 'best');
In cross-validation, the data is partitioned once, and then each partition is used as the test data once.
This ensures that all the data gets equal weight in the training and in the testing.
We divide the data into "folds".
#
display(Image("images/22-k-fold.png", width=500))
#
display(Image("images/22-DT-Example-2.png", width=800))
Hunt's Algorithm builds the tree node by node, starting from the root.
As we build the tree, we divide the training data up.
Let be the set of training records that reach node .
Recursively apply the above procedure until a stopping criterion is met.
#
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
plt.plot(test_X[0], test_X[1], 'ok')
plt.annotate('Test Point', test_X, [75, 25],
textcoords = 'offset points', fontsize = 14,
arrowprops = {'arrowstyle': '->'})
plt.axis('equal')
plt.axis('off')
plt.title('Training Points: 2 Classes');
#
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
plt.plot(test_X[0], test_X[1], 'ok')
ax=plt.gcf().gca()
circle = mp.patches.Circle(test_X, 0.5, facecolor = 'red', alpha = 0.2)
plt.axis('equal')
plt.axis('off')
ax.add_artist(circle)
plt.title('1-Nearest-Neighbor: Classification: Red');
#
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
test_X = [-0.3, 0.7]
plt.plot(test_X[0], test_X[1], 'ok')
ax=plt.gcf().gca()
#ellipse = mp.patches.Ellipse(gmm.means_[clus], 3 * e[0], 3 * e[1], angle, color = 'r')
circle = mp.patches.Circle(test_X, 0.9, facecolor = 'gray', alpha = 0.3)
plt.axis('equal')
plt.axis('off')
ax.add_artist(circle)
plt.title('2-Nearest-Neighbor');
#
plt.figure()
ax=plt.gcf().gca()
#ellipse = mp.patches.Ellipse(gmm.means_[clus], 3 * e[0], 3 * e[1], angle, color = 'r')
circle = mp.patches.Circle(test_X, 1.4, facecolor = 'blue', alpha = 0.2)
ax.add_artist(circle)
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold)
test_X = [-0.3, 0.7]
plt.plot(test_X[0], test_X[1], 'ok')
plt.axis('equal')
plt.axis('off')
plt.title('3-Nearest-Neighbor: Classification: Blue');
Working with a -NN classifier can involve some challenges.
Next we'll look at two classification methods in practice:
To compare these methods, the question arises:
How do we evaluate a classifier?
Using the confusion matrix we can define some useful measures:
#
display(Image("images/23-confusion-matrix.png", width=300))