#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None
mp.rcParams['animation.html'] = 'jshtml';
[This lecture is based on Prof. Crovella's CS 132 and CS 506 lecture notes.]
Last time, we began our investigation into singular value decomposition, the last and most powerful matrix factorization we will consider in this course.
SVD applies to any matrix , no matter its size, (a)symmetry, invertibility, or diagonalizability.
Using SVD, we can "compress" a matrix of real-world messy data (exactly or approximately) into matrices of smaller dimension.
Geometrically, SVD says that the linear transformation corresponding to can be decomposed into a sequence of three steps.
Rotation within , without scaling (i.e., rotating from the standard basis to another orthonormal basis).
Scaling each dimension by a constant , called a singular value.
Rotation within .
Note that this is not the inverse of step 1: the rotation might be by a different angle, across a different axis, and in a space of different dimension.
# Source: Wikipedia
display(Image("images/18-svd-geometric.png", width=450))
Algebraically, SVD is based on a very simple question:
Among all unit vectors, what is the vector that maximizes ?
In other words, in which direction does create the largest output vector from a unit input?
#
display(Image("images/18-Lay-fig-7-4-1.jpg", width=650))
Unfortunately, the eigenvalues and eigenvectors of don't directly answer this question.
Eigenvectors have the property that they get mapped by to a scalar multiple of themselves. They don't get mapped to the farthest points on the ellipse.
#
V = np.array([[2,1],[.1,1]])
L = np.array([[1.2,0],
[0,0.7]])
A = V @ L @ np.linalg.inv(V)
#
ax = dm.plotSetup(-1.5,1.5,-1.5, 1.5, size=(9,6))
ut.centerAxes(ax)
theta = [2 * np.pi * f for f in np.array(range(360))/360.0]
x = [np.array([np.sin(t), np.cos(t)]) for t in theta]
Ax = [A.dot(xv) for xv in x]
ax.plot([xv[0] for xv in x],[xv[1] for xv in x],'-b')
ax.plot([Axv[0] for Axv in Ax],[Axv[1] for Axv in Ax],'--r')
theta_step = np.linspace(0, 2*np.pi, 24)
for th in theta_step:
x = np.array([np.sin(th), np.cos(th)])
ut.plotArrowVec(ax, A @ x, x, head_width=.04, head_length=.04, length_includes_head = True, color='g')
u, s, v = np.linalg.svd(A)
ut.plotArrowVec(ax, [0.3* V[0][0], 0.3*V[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [0.3* V[0][1], 0.3*V[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ax.set_title(r'Eigenvectors of $A$ and the image of the unit circle under $A$');
The largest value of is the long axis of the ellipse. Clearly there is some that is mapped to that point by . That is what we want to find.
Equivalently, we can maximize Note that:
The eigenvector corresponding to the largest eigenvalue of indeed shows us where is maximized -- where the ellipse is longest.
Also, the other eigenvector of shows us where the ellipse is narrowest.
#
ax = dm.plotSetup(-1.5,1.5,-1.5, 1.5, size=(9,6))
ut.centerAxes(ax)
theta = [2 * np.pi * f for f in np.array(range(360))/360.0]
x = [np.array([np.sin(t), np.cos(t)]) for t in theta]
Ax = [A.dot(xv) for xv in x]
ax.plot([xv[0] for xv in x],[xv[1] for xv in x],'-b')
ax.plot([Axv[0] for Axv in Ax],[Axv[1] for Axv in Ax],'--r')
theta_step = np.linspace(0, 2*np.pi, 24)
#for th in theta_step:
# x = np.array([np.sin(th), np.cos(th)])
# ut.plotArrowVec(ax, A @ x, x, head_width=.04, head_length=.04, length_includes_head = True, color='g')
u, s, v = np.linalg.svd(A)
ut.plotArrowVec(ax, [v[0][0], v[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [v[0][1], v[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [s[0]*u[0][0], s[0]*u[1][0]], [v[0][0], v[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='g')
ut.plotArrowVec(ax, [s[1]*u[0][1], s[1]*u[1][1]], [v[0][1], v[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='g')
ax.set_title(r'Eigenvectors of $A^TA$ and their images under $A$');
This matrix has many special properties, which help us in the singular value decomposition of .
Definition. The singular values of are the square roots of the eigenvalues of . They are denoted by , , , and they are arranged in decreasing order .
And the normalized orthogonal set of eigenvectors of , , are the right singular values of .
Now: we know that vectors can be an orthogonal set because they are eigenvectors of the symmetric matrix .
However, it's also the case that are an orthogonal set.
This fact is key to the SVD.
Another way to look at this is to consider that . By definition, is also orthonormal. In other words, we've found a set of vectors that, when transformed with , preserves orthognonality.
Theorem. Suppose is an orthonormal basis of consisting of eigenvectors of , arranged so that the corresponding eigenvalues of satisfy and suppose has nonzero singular values.
Then is an orthogonal basis for and rank .
Note how surprising this is: while are a basis for , is a subspace of .
Nonetheless,
Proof. What we need to do is establish that is an orthogonal linearly independent set whose span is .
Because and are orthogonal for ,
So is an orthogonal set.
Furthermore, since the lengths of the vectors are the singular values of , and since there are nonzero singular values, if and only if
So are a linearly independent set (because they are orthogonal and all nonzero), and clearly they are each in .
Finally, we just need to show that .
To do this we'll show that for any in , we can write in terms of :
Say
Because is a basis for , we can write so
(because for ).
In summary: is an (orthogonal) linearly independent set whose span is , so it is an (orthogonal) basis for .
Notice that we have also proved that rank
In other words, if has nonzero singular values, has rank .
What we have just proved is that the eigenvectors of are rather special.
Note that, thinking of as a linear operator:
So we have just proved that
Now we can return to the definition of the SVD.
Theorem. Let be an matrix with rank . Then there exists an matrix whose diagonal entries are the first singular values of , and there exists an orthogonal matrix and an orthogonal matrix such that
The right singular vectors of are the corresponding eigenvectors of , normalized to be of unit length.
The left singular vectors of are the vectors : that is, take times each of the right singular values and normalize them. (If , then pick any vector that is orthogonal to all of the previous left singular vectors.)
Any factorization with and orthogonal and a diagonal matrix is called a singular value decomposition (SVD) of .
The columns of are called the left singular vectors and the columns of are called the right singular vectors of .
Aside: regarding the "Rolls Royce" property, consider how elegant this structure is.
In particular:
# image source https://bringatrailer.com/listing/1964-rolls-royce-james-young-phanton-v-limosine/
display(Image("images/18-rolls-royce.jpg", width=350))
We have built up enough tools now that the proof is quite straightforward.
Let and be the eigenvalues and eigenvectors of , and .
The starting point is to use the fact that we just proved:
is an orthogonal basis for
Next, let us normalize each to obtain an orthonormal basis , where
Then
Now the rank of (which is ) may be less than .
In that case, add additional orthonormal vectors to the set so that they span .
Now collect the vectors into matrices.
and
Recall that these matrices are orthogonal because the are orthonormal and the are orthonormal, as we previously proved.
So
So
Now, is an orthogonal matrix, so multiplying both sides on the right by :
Let's generalize this idea to get a better sense of how the SVD decomposes a matrix.
For the moment, let's consider an matrix with . (The situation when follows similarly).
The SVD looks like this, with singular values on the diagonal of :
Now, let's assume that the number of nonzero singular values is less than . (In the picture below, .)
Again, other cases would be similar.
In many cases we are only concerned with representing . In this case, we don't need the entirety of or .
To compute , we only need the leftmost columns of , and the upper rows of .
That's because all the other values on the diagonal of are zero, so they don't contribute anything to . So we might as well zeroize the corresponding columns of and rows of .
So we often work with the reduced SVD of :
Note that in the reduced SVD, has all nonzero entries on its diagonal, so it can be inverted.
However, we still have that .
(Reduced) SVD is useful in a variety of data science applications.
#
display(Image("images/18-knife.jpg", width=350))
One application that we will discuss today is the low-rank approximation of a matrix.
Recall that the rank of a matrix is the largest number of linearly independent columns of .
Or, equivalently, the dimension of .
Let's define the rank- approximation to :
When , the rank- approximation to is the closest rank- matrix to , i.e.,
where the Frobenius norm of a matrix is the matrix equivalent to the Euclidean norm of a vector.
Why is a rank- approximation valuable?
The reason is that a rank- matrix may take up much less space than the original .
The rank- approximation takes up space while itself takes space .
For example, if and , then the rank- approximation takes space of .
The key to using the SVD for matrix approximation is as follows:
The best rank- approximation to any matrix can be found via the SVD.
How do we use SVD to find the best rank- approximation to ?
In terms of the singular value decomposition,
the best rank- approximation to is formed by taking
and constructing
Furthermore, the distance (in Frobenius norm) of the best rank- approximation from is equal to .
That is, if you construct as shown above, then:
Let's see how low-rank approximations via SVD can be used in practice, and investigate some real data.
Image data often shows low effective rank, meaning that it can be simplified to a low-rank approximation with only a small loss.
For example, here is an original photo:
boat = np.loadtxt('data/boat.dat')
import matplotlib.cm as cm
plt.figure()
plt.imshow(boat,cmap = cm.Greys_r)
plt.axis('off');
We can think of this as a matrix whose entries are 1-byte grayscale values (numbers between 0 and 255).
print(boat)
[[127. 123. 125. ... 165. 169. 166.] [128. 126. 128. ... 169. 163. 167.] [128. 124. 128. ... 178. 160. 175.] ... [112. 112. 115. ... 101. 97. 104.] [110. 112. 117. ... 104. 93. 105.] [113. 115. 121. ... 102. 95. 97.]]
This image is 512 512 pixels in size. As a matrix, it has rank of 512.
But its effective rank is low. To see why, let's look at the spectrum of this image matrix:
u, s, vt = np.linalg.svd(boat, full_matrices = False)
plt.plot(s)
plt.xlabel('$k$', size = 16)
plt.ylabel(r'$\sigma_k$', size = 16)
plt.title('Singular Values of Boat Image', size = 16);
Based on the plot above, its effective rank is perhaps 40. In fact, the error when we approximate by a rank 40 matrix is only around 10%.
Let's find the closest rank-40 matrix and view it.
Let's find the closest rank-40 matrix to and view it.
We can do this quite easily using the SVD.
We simply construct our approximation of using only the first 40 columns of and top 40 rows of .
#
u, s, vt = np.linalg.svd(boat, full_matrices = False)
s[40:] = 0
boatApprox = u @ np.diag(s) @ vt
plt.figure(figsize=(9,6))
plt.subplot(1,2,1)
plt.imshow(boatApprox,cmap = cm.Greys_r)
plt.axis('off')
plt.title('Rank 40 Boat')
plt.subplot(1,2,2)
plt.imshow(boat,cmap = cm.Greys_r)
plt.axis('off')
plt.title('Rank 512 Boat');
# plt.subplots_adjust(wspace=0.5)
Note that the rank-40 boat takes up only 40/512 = 8% of the space of the original image!
This general principle is what makes image, video, and sound compression effective.
When you
these signals have been compressed using the fact that they are effectively low-rank matrices.
One interesting side effect of dimensionality reduction is that it often reduces the amount of noise in the data. Noise often shows up in the lower-order components of SVD, and dimensionality reduction sometimes removes it.
We'll look at data traffic on the Abilene network:
#
display(Image("images/19-Abilene-map.png", width=500))
Source: Internet2, circa 2005
with open('data/net-traffic/odnames','r') as f:
odnames = [line.strip() for line in f]
dates = pd.date_range('9/1/2003', freq = '10min', periods = 1008)
Atraf = pd.read_table('data/net-traffic/X', sep=' ', header=None, names=odnames, engine='python')
Atraf.index = dates
Atraf
ATLA-ATLA | ATLA-CHIN | ATLA-DNVR | ATLA-HSTN | ATLA-IPLS | ATLA-KSCY | ATLA-LOSA | ATLA-NYCM | ATLA-SNVA | ATLA-STTL | ... | WASH-CHIN | WASH-DNVR | WASH-HSTN | WASH-IPLS | WASH-KSCY | WASH-LOSA | WASH-NYCM | WASH-SNVA | WASH-STTL | WASH-WASH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2003-09-01 00:00:00 | 8466132.0 | 29346537.0 | 15792104.0 | 3646187.0 | 21756443.0 | 10792818.0 | 14220940.0 | 25014340.0 | 13677284.0 | 10591345.0 | ... | 53296727.0 | 18724766.0 | 12238893.0 | 52782009.0 | 12836459.0 | 31460190.0 | 105796930.0 | 13756184.0 | 13582945.0 | 120384980.0 |
2003-09-01 00:10:00 | 20524567.0 | 28726106.0 | 8030109.0 | 4175817.0 | 24497174.0 | 8623734.0 | 15695839.0 | 36788680.0 | 5607086.0 | 10714795.0 | ... | 68413060.0 | 28522606.0 | 11377094.0 | 60006620.0 | 12556471.0 | 32450393.0 | 70665497.0 | 13968786.0 | 16144471.0 | 135679630.0 |
2003-09-01 00:20:00 | 12864863.0 | 27630217.0 | 7417228.0 | 5337471.0 | 23254392.0 | 7882377.0 | 16176022.0 | 31682355.0 | 6354657.0 | 12205515.0 | ... | 67969461.0 | 37073856.0 | 15680615.0 | 61484233.0 | 16318506.0 | 33768245.0 | 71577084.0 | 13938533.0 | 14959708.0 | 126175780.0 |
2003-09-01 00:30:00 | 10856263.0 | 32243146.0 | 7136130.0 | 3695059.0 | 28747761.0 | 9102603.0 | 16200072.0 | 27472465.0 | 9402609.0 | 10934084.0 | ... | 66616097.0 | 43019246.0 | 12726958.0 | 64027333.0 | 16394673.0 | 33440318.0 | 79682647.0 | 16212806.0 | 16425845.0 | 112891500.0 |
2003-09-01 00:40:00 | 10068533.0 | 30164311.0 | 8061482.0 | 2922271.0 | 35642229.0 | 9104036.0 | 12279530.0 | 29171205.0 | 7624924.0 | 11327807.0 | ... | 66797282.0 | 40408580.0 | 11733121.0 | 54541962.0 | 16769259.0 | 33927515.0 | 81480788.0 | 16757707.0 | 15158825.0 | 123140310.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2003-09-07 23:10:00 | 8849096.0 | 33461807.0 | 5866138.0 | 3786793.0 | 19097140.0 | 10561532.0 | 26092040.0 | 28640962.0 | 8343867.0 | 8820650.0 | ... | 65925313.0 | 21751316.0 | 11058944.0 | 58591021.0 | 17137907.0 | 24297674.0 | 83293655.0 | 17329425.0 | 20865535.0 | 123125390.0 |
2003-09-07 23:20:00 | 9776675.0 | 31474607.0 | 5874654.0 | 11277465.0 | 14314837.0 | 9106198.0 | 26412752.0 | 26168288.0 | 8638782.0 | 9193717.0 | ... | 70075490.0 | 29126443.0 | 12667321.0 | 54571764.0 | 15383038.0 | 25238842.0 | 70015955.0 | 16526455.0 | 16881206.0 | 142106800.0 |
2003-09-07 23:30:00 | 9144621.0 | 32117262.0 | 5762691.0 | 7154577.0 | 17771350.0 | 10149256.0 | 29501669.0 | 25998158.0 | 11343171.0 | 9423042.0 | ... | 68544458.0 | 27817836.0 | 15892668.0 | 50326213.0 | 12098328.0 | 27689197.0 | 73553203.0 | 18022288.0 | 18471915.0 | 127918530.0 |
2003-09-07 23:40:00 | 8802106.0 | 29932510.0 | 5279285.0 | 5950898.0 | 20222187.0 | 10636832.0 | 19613671.0 | 26124024.0 | 8732768.0 | 8217873.0 | ... | 65087776.0 | 28836922.0 | 11075541.0 | 52574692.0 | 11933512.0 | 31632344.0 | 81693475.0 | 16677568.0 | 16766967.0 | 138180630.0 |
2003-09-07 23:50:00 | 8716795.6 | 22660870.0 | 6240626.4 | 5657380.6 | 17406086.0 | 8808588.5 | 15962917.0 | 18367639.0 | 7767967.3 | 7470650.1 | ... | 65599891.0 | 25862152.0 | 11673804.0 | 60086953.0 | 11851656.0 | 30979811.0 | 73577193.0 | 19167646.0 | 19402758.0 | 137288810.0 |
1008 rows × 121 columns
Atraf.shape
(1008, 121)
As we would expect, our traffic matrix has rank 121:
np.linalg.matrix_rank(Atraf)
121
However -- perhaps it has low effective rank.
The numpy
routine for computing SVD is np.linalg.svd
:
u, s, vt = np.linalg.svd(Atraf)
Now let's look at the singular values of Atraf
to see if it can be usefully approximated as a low-rank matrix:
#
fig = plt.figure(figsize=(6,4))
plt.plot(range(1,1+len(s)),s)
plt.xlabel(r'$k$',size=20)
plt.ylabel(r'$\sigma_k$',size=20)
plt.ylim(ymin = 0)
plt.xlim(xmin = -1)
plt.title(r'Singular Values of $A$',size=20);
This classic, sharp-elbow tells us that a few singular values are very large, and most singular values are quite small.
Zooming in for just small values, we can see that the elbow is around 4 - 6 singular values:
#
fig = plt.figure(figsize = (6, 4))
Anorm = np.linalg.norm(Atraf)
plt.plot(range(1, 21), s[0:20]/Anorm, '.-')
plt.xlim([0.5, 20])
plt.ylim([0, 1])
plt.xlabel(r'$k$', size=20)
plt.xticks(range(1, 21))
plt.ylabel(r'$\sigma_k$', size=20);
plt.title(r'Singular Values of $A$',size=20);
This pattern of singular values suggests low effective rank.
Let's compute the relative error of a rank- approximation to by measuring the Frobenius norm (which basically treats the matrix as a vector):
#
fig = plt.figure(figsize = (6, 4))
Anorm = np.linalg.norm(Atraf)
err = np.cumsum(s[::-1]**2)
err = np.sqrt(err[::-1])
plt.plot(range(0, 20), err[:20]/Anorm, '.-')
plt.xlim([0, 20])
plt.ylim([0, 1])
plt.xticks(range(1, 21))
plt.xlabel(r'$k$', size = 16)
plt.ylabel(r'relative F-norm error', size=16)
plt.title(r'Relative Error of rank-$k$ approximation to $A$', size=16);
Remarkably, we are down to 9% relative error using only a rank 20 approximation to .
So instead of storing
we only need to store
which is a 81% reduction in size.
In practice many datasets have low effective rank.
Here are some more examples.
User preferences over items.
Example: the Netflix prize worked with partially-observed matrices like this:
Where the rows correspond to users, the columns to movies, and the entries are ratings.
Although the problem matrix was of size 500,000 18,000, the winning approach modeled the matrix as having rank 20 to 40.
Source: [Koren et al, IEEE Computer, 2009]
Likes on Facebook.
Here, the matrices are
Source: [Viswanath et al., Usenix Security, 2014]
#
display(Image("images/18-facebook.png", width=700))
Social Media Activity.
Here, the matrices are
Source: [Viswanath et al., Usenix Security, 2014]
#
display(Image("images/18-yelp-twitter.png", width=700))
The reduced singular value decomposition of a rank- matrix has the form:
where