#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None
mp.rcParams['animation.html'] = 'jshtml';
#
def centerAxes(ax):
ax.spines['left'].set_position('zero')
ax.spines['right'].set_color('none')
ax.spines['bottom'].set_position('zero')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
bounds = np.array([ax.axes.get_xlim(), ax.axes.get_ylim()])
ax.plot(bounds[0][0],bounds[1][0],'')
ax.plot(bounds[0][1],bounds[1][1],'')
[This lecture is based on Prof. Crovella's CS 132 and CS 506 lecture notes, and Lior Pachter's blog.]
Last week, we began our investigation into singular value decomposition, the last and most powerful matrix factorization we will consider in this course.
Geometrically, SVD says that the linear transformation corresponding to can be decomposed into a sequence of three steps.
Rotation within , without scaling (i.e., rotating from the standard basis to another orthonormal basis).
Scaling each dimension by a constant , called a singular value.
Rotation within .
Note that this is not the inverse of step 1: the rotation might be by a different angle, across a different axis, and in a space of different dimension.
# Source: Wikipedia
display(Image("images/18-svd-geometric.png", width=350))
We talked about constructing the singular value decomposition using the properties of .
Definition. The singular values of are the square roots of the eigenvalues of . They are denoted by , , , and they are arranged in decreasing order .
The right singular vectors of are the corresponding eigenvectors of , normalized to be of unit length.
The left singular vectors of are the vectors : that is, take times each of the right singular values and normalize them. (If , then pick any vector that is orthogonal to all of the previous left singular vectors.)
Theorem. Let be an matrix with rank . Then there exists an matrix whose diagonal entries are the first singular values of , and there exists an orthogonal matrix and an orthogonal matrix such that
This creates a full decomposition:
But we often work with the reduced SVD of :
Let's define the rank- approximation to :
When , the rank- approximation to is the closest rank- matrix to , i.e.,
where the Frobenius norm of a matrix is the matrix equivalent to the Euclidean norm of a vector.
A rank- matrix may take up much less space than the original .
The rank- approximation takes up space while itself takes space .
For example, if and , then the rank- approximation takes space of .
In terms of the singular value decomposition,
the best rank- approximation to is formed by taking
and constructing
Furthermore, the distance (in Frobenius norm) of the best rank- approximation from is equal to .
That is, if you construct as shown above, then:
Let's see how low-rank approximations via SVD can be used in practice, and investigate some real data.
Image data often shows low effective rank, meaning that it can be simplified to a low-rank approximation with only a small loss.
For example, here is an original photo:
boat = np.loadtxt('data/boat.dat')
import matplotlib.cm as cm
plt.figure()
plt.imshow(boat,cmap = cm.Greys_r)
plt.axis('off');
We can think of this as a matrix whose entries are 1-byte grayscale values (numbers between 0 and 255).
print(boat)
[[127. 123. 125. ... 165. 169. 166.] [128. 126. 128. ... 169. 163. 167.] [128. 124. 128. ... 178. 160. 175.] ... [112. 112. 115. ... 101. 97. 104.] [110. 112. 117. ... 104. 93. 105.] [113. 115. 121. ... 102. 95. 97.]]
This image is 512 512 pixels in size. As a matrix, it has rank of 512.
But its effective rank is low. To see why, let's look at the spectrum of this image matrix:
u, s, vt = np.linalg.svd(boat, full_matrices = False)
plt.plot(s)
plt.xlabel('$k$', size = 16)
plt.ylabel(r'$\sigma_k$', size = 16)
plt.title('Singular Values of Boat Image', size = 16);
Based on the plot above, its effective rank is perhaps 40. In fact, the error when we approximate by a rank 40 matrix is only around 10%.
Let's find the closest rank-40 matrix and view it.
Let's find the closest rank-40 matrix to and view it.
We can do this quite easily using the SVD.
We simply construct our approximation of using only the first 40 columns of and top 40 rows of .
#
u, s, vt = np.linalg.svd(boat, full_matrices = False)
s[40:] = 0
boatApprox = u @ np.diag(s) @ vt
plt.figure(figsize=(9,6))
plt.subplot(1,2,1)
plt.imshow(boatApprox,cmap = cm.Greys_r)
plt.axis('off')
plt.title('Rank 40 Boat')
plt.subplot(1,2,2)
plt.imshow(boat,cmap = cm.Greys_r)
plt.axis('off')
plt.title('Rank 512 Boat');
# plt.subplots_adjust(wspace=0.5)
Note that the rank-40 boat takes up only 40(512+512)/512*512 = 80/512 = 16% of the space of the original image!
This general principle is what makes image, video, and sound compression effective.
When you
these signals have been compressed using the fact that they are effectively low-rank matrices.
One interesting side effect of dimensionality reduction is that it often reduces the amount of noise in the data. Noise often shows up in the lower-order components of SVD, and dimensionality reduction sometimes removes it.
We'll look at data traffic on the Abilene network:
#
display(Image("images/19-Abilene-map.png", width=500))
Source: Internet2, circa 2005
with open('data/net-traffic/odnames','r') as f:
odnames = [line.strip() for line in f]
dates = pd.date_range('9/1/2003', freq = '10min', periods = 1008)
Atraf = pd.read_table('data/net-traffic/X', sep=' ', header=None, names=odnames, engine='python')
Atraf.index = dates
Atraf
ATLA-ATLA | ATLA-CHIN | ATLA-DNVR | ATLA-HSTN | ATLA-IPLS | ATLA-KSCY | ATLA-LOSA | ATLA-NYCM | ATLA-SNVA | ATLA-STTL | ... | WASH-CHIN | WASH-DNVR | WASH-HSTN | WASH-IPLS | WASH-KSCY | WASH-LOSA | WASH-NYCM | WASH-SNVA | WASH-STTL | WASH-WASH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2003-09-01 00:00:00 | 8466132.0 | 29346537.0 | 15792104.0 | 3646187.0 | 21756443.0 | 10792818.0 | 14220940.0 | 25014340.0 | 13677284.0 | 10591345.0 | ... | 53296727.0 | 18724766.0 | 12238893.0 | 52782009.0 | 12836459.0 | 31460190.0 | 105796930.0 | 13756184.0 | 13582945.0 | 120384980.0 |
2003-09-01 00:10:00 | 20524567.0 | 28726106.0 | 8030109.0 | 4175817.0 | 24497174.0 | 8623734.0 | 15695839.0 | 36788680.0 | 5607086.0 | 10714795.0 | ... | 68413060.0 | 28522606.0 | 11377094.0 | 60006620.0 | 12556471.0 | 32450393.0 | 70665497.0 | 13968786.0 | 16144471.0 | 135679630.0 |
2003-09-01 00:20:00 | 12864863.0 | 27630217.0 | 7417228.0 | 5337471.0 | 23254392.0 | 7882377.0 | 16176022.0 | 31682355.0 | 6354657.0 | 12205515.0 | ... | 67969461.0 | 37073856.0 | 15680615.0 | 61484233.0 | 16318506.0 | 33768245.0 | 71577084.0 | 13938533.0 | 14959708.0 | 126175780.0 |
2003-09-01 00:30:00 | 10856263.0 | 32243146.0 | 7136130.0 | 3695059.0 | 28747761.0 | 9102603.0 | 16200072.0 | 27472465.0 | 9402609.0 | 10934084.0 | ... | 66616097.0 | 43019246.0 | 12726958.0 | 64027333.0 | 16394673.0 | 33440318.0 | 79682647.0 | 16212806.0 | 16425845.0 | 112891500.0 |
2003-09-01 00:40:00 | 10068533.0 | 30164311.0 | 8061482.0 | 2922271.0 | 35642229.0 | 9104036.0 | 12279530.0 | 29171205.0 | 7624924.0 | 11327807.0 | ... | 66797282.0 | 40408580.0 | 11733121.0 | 54541962.0 | 16769259.0 | 33927515.0 | 81480788.0 | 16757707.0 | 15158825.0 | 123140310.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2003-09-07 23:10:00 | 8849096.0 | 33461807.0 | 5866138.0 | 3786793.0 | 19097140.0 | 10561532.0 | 26092040.0 | 28640962.0 | 8343867.0 | 8820650.0 | ... | 65925313.0 | 21751316.0 | 11058944.0 | 58591021.0 | 17137907.0 | 24297674.0 | 83293655.0 | 17329425.0 | 20865535.0 | 123125390.0 |
2003-09-07 23:20:00 | 9776675.0 | 31474607.0 | 5874654.0 | 11277465.0 | 14314837.0 | 9106198.0 | 26412752.0 | 26168288.0 | 8638782.0 | 9193717.0 | ... | 70075490.0 | 29126443.0 | 12667321.0 | 54571764.0 | 15383038.0 | 25238842.0 | 70015955.0 | 16526455.0 | 16881206.0 | 142106800.0 |
2003-09-07 23:30:00 | 9144621.0 | 32117262.0 | 5762691.0 | 7154577.0 | 17771350.0 | 10149256.0 | 29501669.0 | 25998158.0 | 11343171.0 | 9423042.0 | ... | 68544458.0 | 27817836.0 | 15892668.0 | 50326213.0 | 12098328.0 | 27689197.0 | 73553203.0 | 18022288.0 | 18471915.0 | 127918530.0 |
2003-09-07 23:40:00 | 8802106.0 | 29932510.0 | 5279285.0 | 5950898.0 | 20222187.0 | 10636832.0 | 19613671.0 | 26124024.0 | 8732768.0 | 8217873.0 | ... | 65087776.0 | 28836922.0 | 11075541.0 | 52574692.0 | 11933512.0 | 31632344.0 | 81693475.0 | 16677568.0 | 16766967.0 | 138180630.0 |
2003-09-07 23:50:00 | 8716795.6 | 22660870.0 | 6240626.4 | 5657380.6 | 17406086.0 | 8808588.5 | 15962917.0 | 18367639.0 | 7767967.3 | 7470650.1 | ... | 65599891.0 | 25862152.0 | 11673804.0 | 60086953.0 | 11851656.0 | 30979811.0 | 73577193.0 | 19167646.0 | 19402758.0 | 137288810.0 |
1008 rows × 121 columns
Atraf.shape
(1008, 121)
As we would expect, our traffic matrix has rank 121:
np.linalg.matrix_rank(Atraf)
121
However -- perhaps it has low effective rank.
The numpy
routine for computing SVD is np.linalg.svd
:
u, s, vt = np.linalg.svd(Atraf)
Now let's look at the singular values of Atraf
to see if it can be usefully approximated as a low-rank matrix:
#
fig = plt.figure(figsize=(6,4))
plt.plot(range(1,1+len(s)),s)
plt.xlabel(r'$k$',size=20)
plt.ylabel(r'$\sigma_k$',size=20)
plt.ylim(ymin = 0)
plt.xlim(xmin = -1)
plt.title(r'Singular Values of $A$',size=20);
This classic, sharp-elbow tells us that a few singular values are very large, and most singular values are quite small.
Zooming in for just small values, we can see that the elbow is around 4 - 6 singular values:
#
fig = plt.figure(figsize = (6, 4))
Anorm = np.linalg.norm(Atraf)
plt.plot(range(1, 21), s[0:20]/Anorm, '.-')
plt.xlim([0.5, 20])
plt.ylim([0, 1])
plt.xlabel(r'$k$', size=20)
plt.xticks(range(1, 21))
plt.ylabel(r'$\sigma_k$', size=20);
plt.title(r'Singular Values of $A$',size=20);
This pattern of singular values suggests low effective rank.
Let's compute the relative error of a rank- approximation to by measuring the Frobenius norm (which basically treats the matrix as a vector):
#
fig = plt.figure(figsize = (6, 4))
Anorm = np.linalg.norm(Atraf)
err = np.cumsum(s[::-1]**2)
err = np.sqrt(err[::-1])
plt.plot(range(0, 20), err[:20]/Anorm, '.-')
plt.xlim([0, 20])
plt.ylim([0, 1])
plt.xticks(range(1, 21))
plt.xlabel(r'$k$', size = 16)
plt.ylabel(r'relative F-norm error', size=16)
plt.title(r'Relative Error of rank-$k$ approximation to $A$', size=16);
Remarkably, we are down to 9% relative error using only a rank 20 approximation to .
So instead of storing
we only need to store
which is a 81% reduction in size.
In practice many datasets have low effective rank.
Here are some more examples.
User preferences over items.
Example: the Netflix prize worked with partially-observed matrices like this:
Where the rows correspond to users, the columns to movies, and the entries are ratings.
Although the problem matrix was of size 500,000 18,000, the winning approach modeled the matrix as having rank 20 to 40.
Source: [Koren et al, IEEE Computer, 2009]
Likes on Facebook.
Here, the matrices are
Source: [Viswanath et al., Usenix Security, 2014]
#
display(Image("images/18-facebook.png", width=700))
Social Media Activity.
Here, the matrices are
Source: [Viswanath et al., Usenix Security, 2014]
#
display(Image("images/18-yelp-twitter.png", width=700))
The reduced singular value decomposition of a rank- matrix has the form:
where
How can we understand the low-effective-rank phenomenon in general?
There are two helpful interpretations:
The first interpretation of low-rank behavior is in answering the question:
"What is the strongest pattern in the data?"
Using the SVD we form the low-rank approximation as
with
In this interpretation, we think of each column of as a combination of the columns of .
How can this be helpful?
Consider the set of network traffic traces. There are clearly some common patterns. How can we find them?
#
with open('data/net-traffic/AbileneFlows/odnames','r') as f:
odnames = [line.strip() for line in f]
dates = pd.date_range('9/1/2003',freq='10min',periods=1008)
Atraf = pd.read_table('data/net-traffic/AbileneFlows/X',sep=' ',header=None,names=odnames,engine='python')
Atraf.index = dates
plt.figure(figsize=(10,8))
for i in range(1,13):
ax = plt.subplot(4,3,i)
Atraf.iloc[:,i-1].plot()
plt.title(odnames[i])
plt.subplots_adjust(hspace=1)
plt.suptitle('Twelve Example Traffic Traces', size=20);
Let's use as our example the first column of .
This happens to be the ATLA-CHIN flow.
The equation above tells us that
In other words, (the first column of ) is the "strongest" pattern occurring in , and its strength is measured by .
Here is an view of the first 2 columns of for the traffic matrix data.
These are the strongest patterns occurring across all of the 121 traces.
#
u, s, vt = np.linalg.svd(Atraf, full_matrices = False)
uframe = pd.DataFrame(u @ np.diag(s), index=pd.date_range('9/1/2003', freq = '10min', periods = 1008))
uframe[0].plot()
uframe[1].plot()
plt.title('First Two Columns of $U$');
The next interpretation of low-rank behavior is that it exposes "latent factors" that describe the data.
Returning to the low-rank decomposition:
In this interpretation, we think of each element of as the inner product of a row of and a row of .
Let's say we are working with a matrix of users and items.
In particular, let the items be movies and matrix entries be ratings, as in the Netflix prize.
Recall the structure from a previous slide:
Then the rating that a user gives a movie is the inner product of a element vector that corresponds to the user, and a element vector that corresponds to the movie.
In other words:
We can therefore think of user 's preferences as being captured by , ie., a point in .
We have described everything we need to know to predict user 's ratings via a -element vector.
The -element vector is called a latent factor.
Likewise, we can think of as a "description" of movie (another latent factor).
The value in using latent factors comes from the summarization of user preferences, and the predictive power one obtains.
For example, the winning entry in the Netflix prize competition modeled user preferences with a 20-element latent factor.
The remarkable thing is that a person's preferences for all 18,000 movies can be reasonably well captured in a 20-element vector!
Here is a figure from the paper that described the winning strategy in the Netflix prize.
It shows a hypothetical latent space in which each user, and each movie, is represented by a latent vector.
#
display(Image("images/19-movie.png", width=500))
Source: Koren et al, IEEE Computer, 2009
In practice, this is perhaps a 20- or 40-dimensional space.
Here are some representations of movies in that space (reduced to 2-D).
Notice how the space seems to capture similarity among movies!
#
display(Image("images/19-netflix.png", width=500))
Source: Koren et al, IEEE Computer, 2009
In summary: