In [1]:
#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None

mp.rcParams['animation.html'] = 'jshtml';

Announcements¶

  • Homework:
    • Homework 8 due today, 4/14 at 8pm
    • Optional bonus homework out tonight, due next Friday
  • Final exam: Monday, May 8 from 12-2pm
  • Upcoming office hours:
    • Today: Peer tutor Daniel Cho from 12:30-3:30pm in CCDS 16th floor
    • Today: Abhishek Tiwari from 4-5pm on CCDS 13th floor
  • Reading
    • Aggarwal Sections 7.1-7.4
    • Further reading: Deisenroth, Faisel, Ong Chapter 4.5

Lecture 32: Reduced SVD¶

[This lecture is based on Prof. Crovella's CS 132 and CS 506 lecture notes.]

Recap from last lecture¶

Last time, we began our investigation into singular value decomposition, the last and most powerful matrix factorization we will consider in this course.

A=UΣVTA=UΣVT

SVD applies to any matrix MM, no matter its size, (a)symmetry, invertibility, or diagonalizability.

Using SVD, we can "compress" a matrix of real-world messy data (exactly or approximately) into matrices of smaller dimension.

m objects⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩n features⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋮⋮a1⋮⋮⋮⋮a2⋮⋮…⋮⋮an⋮⋮⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=k⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋮⋮⋮⋮σ1u1σkuk⋮⋮⋮⋮⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦×[……v1…………vk……]m objects{[⋮⋮a1⋮⋮⋮⋮a2⋮⋮…⋮⋮an⋮⋮]⏞n features=[⋮⋮⋮⋮σ1u1σkuk⋮⋮⋮⋮]⏞k×[……v1…………vk……]

Geometrically, SVD says that the linear transformation corresponding to MM can be decomposed into a sequence of three steps.

  1. Rotation within RnRn, without scaling (i.e., rotating from the standard basis to another orthonormal basis).

  2. Scaling each dimension ii by a constant σiσi, called a singular value.

    • If m≠nm≠n, then the space is also transformed from RnRn to RmRm at the same time. Some dimensions might be projected away.
    • Some dimensions might also have such a small value of σiσi that they might as well be projected away. That is, we might choose to remove them with little impact on the accuracy of the result.
  3. Rotation within RmRm.

    Note that this is not the inverse of step 1: the rotation might be by a different angle, across a different axis, and in a space of different dimension.

In [5]:
# Source: Wikipedia
display(Image("images/18-svd-geometric.png", width=450))

Algebraically, SVD is based on a very simple question:

Among all unit vectors, what is the vector xx that maximizes ∥Ax∥‖Ax‖?

In other words, in which direction does AA create the largest output vector from a unit input?

In [3]:
#
display(Image("images/18-Lay-fig-7-4-1.jpg", width=650))

Unfortunately, the eigenvalues and eigenvectors of AA don't directly answer this question.

Eigenvectors have the property that they get mapped by AA to a scalar multiple of themselves. They don't get mapped to the farthest points on the ellipse.

In [6]:
#
V = np.array([[2,1],[.1,1]])
L = np.array([[1.2,0],
              [0,0.7]])
A = V @ L @ np.linalg.inv(V)
#
ax = dm.plotSetup(-1.5,1.5,-1.5, 1.5, size=(9,6))
ut.centerAxes(ax)
theta = [2 * np.pi * f for f in np.array(range(360))/360.0]
x = [np.array([np.sin(t), np.cos(t)]) for t in theta]
Ax = [A.dot(xv) for xv in x]
ax.plot([xv[0] for xv in x],[xv[1] for xv in x],'-b')
ax.plot([Axv[0] for Axv in Ax],[Axv[1] for Axv in Ax],'--r')
theta_step = np.linspace(0, 2*np.pi, 24)
for th in theta_step:
    x = np.array([np.sin(th), np.cos(th)])
    ut.plotArrowVec(ax, A @ x, x, head_width=.04, head_length=.04, length_includes_head = True, color='g')
u, s, v = np.linalg.svd(A)
ut.plotArrowVec(ax, [0.3* V[0][0], 0.3*V[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [0.3* V[0][1], 0.3*V[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ax.set_title(r'Eigenvectors of $A$ and the image of the unit circle under $A$');

The largest value of ∥Ax∥‖Ax‖ is the long axis of the ellipse. Clearly there is some xx that is mapped to that point by AA. That xx is what we want to find.

Equivalently, we can maximize ∥Ax∥2=xT(ATA)x‖Ax‖2=xT(ATA)x Note that:

  • A⊤AA⊤A is a symmmetric matrix.
  • This is a quadratic form!
  • Its maximum value, subject to the constraint that ∥x∥=1‖x‖=1, is the largest eigenvalue of A⊤AA⊤A.

The eigenvector corresponding to the largest eigenvalue of ATAATA indeed shows us where ∥Ax∥‖Ax‖ is maximized -- where the ellipse is longest.

Also, the other eigenvector of ATAATA shows us where the ellipse is narrowest.

In [7]:
#
ax = dm.plotSetup(-1.5,1.5,-1.5, 1.5, size=(9,6))
ut.centerAxes(ax)
theta = [2 * np.pi * f for f in np.array(range(360))/360.0]
x = [np.array([np.sin(t), np.cos(t)]) for t in theta]
Ax = [A.dot(xv) for xv in x]
ax.plot([xv[0] for xv in x],[xv[1] for xv in x],'-b')
ax.plot([Axv[0] for Axv in Ax],[Axv[1] for Axv in Ax],'--r')
theta_step = np.linspace(0, 2*np.pi, 24)
#for th in theta_step:
#    x = np.array([np.sin(th), np.cos(th)])
#    ut.plotArrowVec(ax, A @ x, x, head_width=.04, head_length=.04, length_includes_head = True, color='g')
u, s, v = np.linalg.svd(A)
ut.plotArrowVec(ax, [v[0][0], v[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [v[0][1], v[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [s[0]*u[0][0], s[0]*u[1][0]], [v[0][0], v[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='g')
ut.plotArrowVec(ax, [s[1]*u[0][1], s[1]*u[1][1]], [v[0][1], v[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='g')
ax.set_title(r'Eigenvectors of $A^TA$ and their images under $A$');

This matrix A⊤AA⊤A has many special properties, which help us in the singular value decomposition of AA.

Definition. The singular values of AA are the square roots of the eigenvalues of A⊤AA⊤A. They are denoted by σ1=√λ1σ1=λ1, ……, σn=√λnσn=λn, and they are arranged in decreasing order σ1≥σ2≥⋯σn≥0σ1≥σ2≥⋯σn≥0.

And the normalized orthogonal set of eigenvectors of ATAATA, v1,…,vnv1,…,vn, are the right singular values of AA.

m×n⎡⎢ ⎢ ⎢ ⎢⎣⋱⋱A⋱⋱⎤⎥ ⎥ ⎥ ⎥⎦=m×m⎡⎢ ⎢ ⎢⎣⋮⋮⋮u1u2⋯um⋮⋮⋮⎤⎥ ⎥ ⎥⎦m×n⎡⎢⎣σ100000σ2⋯00000σm00⎤⎥⎦n×n⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋯⋯v1⋯⋯⋯⋯v2⋯⋯⋮⋮⋯⋯vn⋯⋯⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[⋱⋱A⋱⋱]⏞m×n=[⋮⋮⋮u1u2⋯um⋮⋮⋮]⏞m×m[σ100000σ2⋯00000σm00]⏞m×n[⋯⋯v1⋯⋯⋯⋯v2⋯⋯⋮⋮⋯⋯vn⋯⋯]⏞n×n

The Eigenvectors of ATAATA Lead To an Orthogonal Basis for ColACol⁡A¶

Now: we know that vectors v1,…,vnv1,…,vn can be an orthogonal set because they are eigenvectors of the symmetric matrix ATAATA.

However, it's also the case that Av1,…,AvnAv1,…,Avn are an orthogonal set.

This fact is key to the SVD.

Another way to look at this is to consider that Avi=σiuiAvi=σiui. By definition, UU is also orthonormal. In other words, we've found a set of vectors v1,…,vnv1,…,vn that, when transformed with AA, preserves orthognonality.

Theorem. Suppose {v1,…,vn}{v1,…,vn} is an orthonormal basis of RnRn consisting of eigenvectors of ATAATA, arranged so that the corresponding eigenvalues of ATAATA satisfy λ1≥⋯≥λn,λ1≥⋯≥λn, and suppose AA has rr nonzero singular values.

Then {Av1,…,Avr}{Av1,…,Avr} is an orthogonal basis for ColA,Col⁡A, and rank A=rA=r.

Note how surprising this is: while {v1,…,vn}{v1,…,vn} are a basis for RnRn, ColACol⁡A is a subspace of RmRm.

Nonetheless,

  • two eigenvectors vivi and vj∈Rnvj∈Rn are orthogonal, and
  • their images AviAvi and Avj∈RmAvj∈Rm are also orthogonal.

Proof. What we need to do is establish that {Av1,…,Avr}{Av1,…,Avr} is an orthogonal linearly independent set whose span is ColACol⁡A.

Because vivi and vjvj are orthogonal for i≠ji≠j,

(Avi)T(Avj)=vTiATAvj=vTi(λjvj)=0.(Avi)T(Avj)=viTATAvj=viT(λjvj)=0.

So {Av1,…,Avn}{Av1,…,Avn} is an orthogonal set.

Furthermore, since the lengths of the vectors Av1,…,AvnAv1,…,Avn are the singular values of AA, and since there are rr nonzero singular values, Avi≠0Avi≠0 if and only if 1≤i≤r.1≤i≤r.

So Av1,…,AvrAv1,…,Avr are a linearly independent set (because they are orthogonal and all nonzero), and clearly they are each in ColACol⁡A.

Finally, we just need to show that Span{Av1,…,Avr}=ColASpan⁡{Av1,…,Avr}=Col⁡A.

To do this we'll show that for any yy in Col ACol⁡ A, we can write yy in terms of {Av1,…,Avr}{Av1,…,Avr}:

Say y=Ax.y=Ax.

Because {v1,…,vn}{v1,…,vn} is a basis for RnRn, we can write x=c1v1+⋯+cnvn,x=c1v1+⋯+cnvn, so

y=Ax=c1Av1+⋯+crAvr+⋯+cnAvn.y=Ax=c1Av1+⋯+crAvr+⋯+cnAvn.
=c1Av1+⋯+crAvr.=c1Av1+⋯+crAvr.

(because Avi=0Avi=0 for i>ri>r).

In summary: {Av1,…,Avn}{Av1,…,Avn} is an (orthogonal) linearly independent set whose span is ColACol⁡A, so it is an (orthogonal) basis for ColACol⁡A.

Notice that we have also proved that rank A=dimColA=r.A=dim⁡Col⁡A=r.

In other words, if AA has rr nonzero singular values, AA has rank rr.

32.1 The Singular Value Decomposition¶

What we have just proved is that the eigenvectors of ATAATA are rather special.

Note that, thinking of AA as a linear operator:

  • its domain is RnRn, and
  • its range is ColA.Col⁡A.

So we have just proved that

  • the set {vi}{vi} is an orthogonal basis for the domain of AA, and
  • the set {Avi}{Avi} is an orthogonal basis for the range of AA.

Now we can return to the definition of the SVD.

Theorem. Let AA be an m×nm×n matrix with rank rr. Then there exists an m×nm×n matrix ΣΣ whose diagonal entries are the first rr singular values of AA, σ1≥σ2≥⋯≥σr>0,σ1≥σ2≥⋯≥σr>0, and there exists an m×mm×m orthogonal matrix UU and an n×nn×n orthogonal matrix VV such that

A=UΣVTA=UΣVTAV=UΣAV=UΣ

The right singular vectors of AA are the corresponding eigenvectors v1,…,vnv1,…,vn of A⊤AA⊤A, normalized to be of unit length.

The left singular vectors of AA are the vectors ui=1σiAviui=1σiAvi: that is, take AA times each of the right singular values and normalize them. (If σi=0σi=0, then pick any vector uiui that is orthogonal to all of the previous left singular vectors.)

m×n⎡⎢ ⎢ ⎢ ⎢⎣⋱⋱A⋱⋱⎤⎥ ⎥ ⎥ ⎥⎦=m×m⎡⎢ ⎢ ⎢⎣⋮⋮⋮u1u2⋯um⋮⋮⋮⎤⎥ ⎥ ⎥⎦m×n⎡⎢⎣σ100000σ2⋯00000σm00⎤⎥⎦n×n⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋯⋯v1⋯⋯⋯⋯v2⋯⋯⋮⋮⋯⋯vn⋯⋯⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[⋱⋱A⋱⋱]⏞m×n=[⋮⋮⋮u1u2⋯um⋮⋮⋮]⏞m×m[σ100000σ2⋯00000σm00]⏞m×n[⋯⋯v1⋯⋯⋯⋯v2⋯⋯⋮⋮⋯⋯vn⋯⋯]⏞n×n

Any factorization A=UΣVT,A=UΣVT, with UU and VV orthogonal and ΣΣ a diagonal matrix is called a singular value decomposition (SVD) of AA.

The columns of UU are called the left singular vectors and the columns of VV are called the right singular vectors of AA.

Aside: regarding the "Rolls Royce" property, consider how elegant this structure is.

In particular:

  • AA is an arbitrary matrix
  • UU and VV are both orthonormal matrices
  • ΣΣ is a diagonal matrix
  • all singular values are positive or zero
  • there are as many positive singular values as the rank of AA
    • (not part of the theorem but we'll see it is true)
In [8]:
# image source https://bringatrailer.com/listing/1964-rolls-royce-james-young-phanton-v-limosine/
display(Image("images/18-rolls-royce.jpg", width=350))

Proof of SVD¶

We have built up enough tools now that the proof is quite straightforward.

Let {λi}{λi} and {vi}{vi} be the eigenvalues and eigenvectors of ATAATA, and σi=√λiσi=λi.

The starting point is to use the fact that we just proved:

{Av1,…,Avr}{Av1,…,Avr} is an orthogonal basis for Col A.Col⁡ A.

Next, let us normalize each AviAvi to obtain an orthonormal basis {u1,…,ur}{u1,…,ur}, where

ui=1∥Avi∥Avi=1σiAviui=1‖Avi‖Avi=1σiAvi

Then

Avi=σiui(1≤i≤r)Avi=σiui(1≤i≤r)

Now the rank of AA (which is rr) may be less than mm.

In that case, add additional orthonormal vectors {ur+1…um}{ur+1…um} to the set so that they span RmRm.

Now collect the vectors into matrices.

U=[u1⋯um]U=[u1⋯um]

and

V=[v1⋯vn]V=[v1⋯vn]

Recall that these matrices are orthogonal because the {vi}{vi} are orthonormal and the {Avi}{Avi} are orthonormal, as we previously proved.

So

AV=[Av1⋯Avrn−r0⋯0]AV=[Av1⋯Avr0⋯0⏞n−r]
=[σ1u1⋯σrur0⋯0]=UΣ.=[σ1u1⋯σrur0⋯0]=UΣ.

So

AV=UΣAV=UΣ

Now, VV is an orthogonal matrix, so multiplying both sides on the right by VTVT:

UΣVT=AVVT=A.UΣVT=AVVT=A.

21.1 Reduced SVD¶

Let's generalize this idea to get a better sense of how the SVD decomposes a matrix.

For the moment, let's consider an m×nm×n matrix AA with m<nm<n. (The situation when m>nm>n follows similarly).

The SVD looks like this, with singular values on the diagonal of ΣΣ:

m×n⎡⎢ ⎢ ⎢ ⎢⎣⋱⋱A⋱⋱⎤⎥ ⎥ ⎥ ⎥⎦=m×m⎡⎢ ⎢ ⎢ ⎢⎣⋱U⋱⎤⎥ ⎥ ⎥ ⎥⎦m×n⎡⎢ ⎢ ⎢ ⎢⎣⋱Σ⋱⎤⎥ ⎥ ⎥ ⎥⎦n×n⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋱⋱V⊤⋱⋱⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[⋱⋱A⋱⋱]⏞m×n=[⋱U⋱]⏞m×m[⋱Σ⋱]⏞m×n[⋱⋱V⊤⋱⋱]⏞n×n

Now, let's assume that the number of nonzero singular values rr is less than mm. (In the picture below, r=2r=2.)

Again, other cases would be similar.

m×n⎡⎢ ⎢ ⎢ ⎢⎣⋱⋱A⋱⋱⎤⎥ ⎥ ⎥ ⎥⎦=m×m⎡⎢ ⎢ ⎢ ⎢⎣⋱U⋱⎤⎥ ⎥ ⎥ ⎥⎦m×n⎡⎢⎣σ100000σr00000000⎤⎥⎦n×n⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋱⋱V⊤⋱⋱⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[⋱⋱A⋱⋱]⏞m×n=[⋱U⋱]⏞m×m[σ100000σr00000000]⏞m×n[⋱⋱V⊤⋱⋱]⏞n×n

In many cases we are only concerned with representing AA. In this case, we don't need the entirety of UU or VV.

To compute AA, we only need the rr leftmost columns of UU, and the rr upper rows of V⊤V⊤.

That's because all the other values on the diagonal of ΣΣ are zero, so they don't contribute anything to AA. So we might as well zeroize the corresponding columns of UU and rows of V⊤V⊤.

m×n⎡⎢ ⎢ ⎢ ⎢⎣⋱⋱A⋱⋱⎤⎥ ⎥ ⎥ ⎥⎦=m×m⎡⎢ ⎢ ⎢⎣⋮⋮0u1ur0⋮⋮0⎤⎥ ⎥ ⎥⎦m×n⎡⎢⎣σ100000σr00000000⎤⎥⎦n×n⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋯⋯v1⋯⋯⋯⋯vr⋯⋯000000000000000⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[⋱⋱A⋱⋱]⏞m×n=[⋮⋮0u1ur0⋮⋮0]⏞m×m[σ100000σr00000000]⏞m×n[⋯⋯v1⋯⋯⋯⋯vr⋯⋯000000000000000]⏞n×n

So we often work with the reduced SVD of AA:

m×n⎡⎢ ⎢ ⎢ ⎢⎣⋱⋱A⋱⋱⎤⎥ ⎥ ⎥ ⎥⎦=m×r⎡⎢ ⎢ ⎢⎣⋮⋮u1ur⋮⋮⎤⎥ ⎥ ⎥⎦r×r[σ100σr]r×n[⋯⋯v1⋯⋯⋯⋯vr⋯⋯][⋱⋱A⋱⋱]⏞m×n=[⋮⋮u1ur⋮⋮]⏞m×r[σ100σr]⏞r×r[⋯⋯v1⋯⋯⋯⋯vr⋯⋯]⏞r×n

Note that in the reduced SVD, ΣΣ has all nonzero entries on its diagonal, so it can be inverted.

However, we still have that A=UΣVTA=UΣVT.

32.2 Dimensionality Reduction¶

(Reduced) SVD is useful in a variety of data science applications.

In [19]:
#
display(Image("images/18-knife.jpg", width=350))

One application that we will discuss today is the low-rank approximation of a matrix.

Recall that the rank of a matrix AA is the largest number of linearly independent columns of AA.

Or, equivalently, the dimension of ColACol⁡A.

Let's define the rank-kk approximation to AA:

When k<RankAk<Rank⁡A, the rank-kk approximation to AA is the closest rank-kk matrix to AA, i.e.,

A(k)=argminRankB=k∥A−B∥F,A(k)=arg⁡minRank⁡B=k‖A−B‖F,

where the Frobenius norm of a matrix ∥M∥F=√∑mi=1∑nj=1|mi,j|2‖M‖F=∑i=1m∑j=1n|mi,j|2 is the matrix equivalent to the Euclidean norm of a vector.

Why is a rank-kk approximation valuable?

The reason is that a rank-kk matrix may take up much less space than the original AA.

m⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩n⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋮⋮a1⋮⋮⋮⋮a2⋮⋮…⋮⋮an⋮⋮⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=k⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋮⋮⋮⋮σ1u1σkuk⋮⋮⋮⋮⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦×[……v1…………vk……]m{[⋮⋮a1⋮⋮⋮⋮a2⋮⋮…⋮⋮an⋮⋮]⏞n=[⋮⋮⋮⋮σ1u1σkuk⋮⋮⋮⋮]⏞k×[……v1…………vk……]

The rank-kk approximation takes up space (m+n)k(m+n)k while AA itself takes space mnmn.

For example, if k=10k=10 and m=n=1000m=n=1000, then the rank-kk approximation takes space 20000/1000000=2%20000/1000000=2% of AA.

The key to using the SVD for matrix approximation is as follows:

The best rank-kk approximation to any matrix can be found via the SVD.

How do we use SVD to find the best rank-kk approximation to AA?

In terms of the singular value decomposition,

the best rank-kk approximation to AA is formed by taking

  • U′=U′= the kk leftmost columns of UU,
  • Σ′=Σ′= the k×kk×k upper left submatrix of ΣΣ, and
  • (V′)T=(V′)T= the kk upper rows of VTVT,

and constructing

A(k)=U′Σ′(V′)T.A(k)=U′Σ′(V′)T.

Furthermore, the distance (in Frobenius norm) of the best rank-kk approximation A(k)A(k) from AA is equal to √∑ri=k+1σ2i∑i=k+1rσi2.

That is, if you construct A(k)A(k) as shown above, then:

∥A−A(k)∥2F=r∑i=k+1σ2i‖A−A(k)‖F2=∑i=k+1rσi2

32.3 Empirical Examples of SVD¶

Let's see how low-rank approximations via SVD can be used in practice, and investigate some real data.

Image compression¶

Image data often shows low effective rank, meaning that it can be simplified to a low-rank approximation with only a small loss.

For example, here is an original photo:

In [10]:
boat = np.loadtxt('data/boat.dat')
import matplotlib.cm as cm
plt.figure()
plt.imshow(boat,cmap = cm.Greys_r)
plt.axis('off');

We can think of this as a 512×512512×512 matrix AA whose entries are 1-byte grayscale values (numbers between 0 and 255).

In [22]:
print(boat)
[[127. 123. 125. ... 165. 169. 166.]
 [128. 126. 128. ... 169. 163. 167.]
 [128. 124. 128. ... 178. 160. 175.]
 ...
 [112. 112. 115. ... 101.  97. 104.]
 [110. 112. 117. ... 104.  93. 105.]
 [113. 115. 121. ... 102.  95.  97.]]

This image is 512 ×× 512 pixels in size. As a matrix, it has rank of 512.

But its effective rank is low. To see why, let's look at the spectrum of this image matrix:

In [11]:
u, s, vt = np.linalg.svd(boat, full_matrices = False)
plt.plot(s)
plt.xlabel('$k$', size = 16)
plt.ylabel(r'$\sigma_k$', size = 16)
plt.title('Singular Values of Boat Image', size = 16);

Based on the plot above, its effective rank is perhaps 40. In fact, the error when we approximate AA by a rank 40 matrix is only around 10%.

Let's find the closest rank-40 matrix and view it.

Let's find the closest rank-40 matrix to AA and view it.

We can do this quite easily using the SVD.

We simply construct our approximation of AA using only the first 40 columns of UU and top 40 rows of VTVT.

In [12]:
#
u, s, vt = np.linalg.svd(boat, full_matrices = False)
s[40:] = 0
boatApprox = u @ np.diag(s) @ vt
plt.figure(figsize=(9,6))
plt.subplot(1,2,1)
plt.imshow(boatApprox,cmap = cm.Greys_r)
plt.axis('off')
plt.title('Rank 40 Boat')
plt.subplot(1,2,2)
plt.imshow(boat,cmap = cm.Greys_r)
plt.axis('off')
plt.title('Rank 512 Boat');
# plt.subplots_adjust(wspace=0.5)

Note that the rank-40 boat takes up only 40/512 = 8% of the space of the original image!

This general principle is what makes image, video, and sound compression effective.

When you

  • watch HDTV, or
  • listen to an MP3, or
  • look at a JPEG image,

these signals have been compressed using the fact that they are effectively low-rank matrices.

One interesting side effect of dimensionality reduction is that it often reduces the amount of noise in the data. Noise often shows up in the lower-order components of SVD, and dimensionality reduction sometimes removes it.

Networks¶

We'll look at data traffic on the Abilene network:

In [13]:
#
display(Image("images/19-Abilene-map.png", width=500))

Source: Internet2, circa 2005

In [17]:
with open('data/net-traffic/odnames','r') as f:
    odnames = [line.strip() for line in f]
dates = pd.date_range('9/1/2003', freq = '10min', periods = 1008)
Atraf = pd.read_table('data/net-traffic/X', sep='  ', header=None, names=odnames, engine='python')
Atraf.index = dates
Atraf
Out[17]:
ATLA-ATLA ATLA-CHIN ATLA-DNVR ATLA-HSTN ATLA-IPLS ATLA-KSCY ATLA-LOSA ATLA-NYCM ATLA-SNVA ATLA-STTL ... WASH-CHIN WASH-DNVR WASH-HSTN WASH-IPLS WASH-KSCY WASH-LOSA WASH-NYCM WASH-SNVA WASH-STTL WASH-WASH
2003-09-01 00:00:00 8466132.0 29346537.0 15792104.0 3646187.0 21756443.0 10792818.0 14220940.0 25014340.0 13677284.0 10591345.0 ... 53296727.0 18724766.0 12238893.0 52782009.0 12836459.0 31460190.0 105796930.0 13756184.0 13582945.0 120384980.0
2003-09-01 00:10:00 20524567.0 28726106.0 8030109.0 4175817.0 24497174.0 8623734.0 15695839.0 36788680.0 5607086.0 10714795.0 ... 68413060.0 28522606.0 11377094.0 60006620.0 12556471.0 32450393.0 70665497.0 13968786.0 16144471.0 135679630.0
2003-09-01 00:20:00 12864863.0 27630217.0 7417228.0 5337471.0 23254392.0 7882377.0 16176022.0 31682355.0 6354657.0 12205515.0 ... 67969461.0 37073856.0 15680615.0 61484233.0 16318506.0 33768245.0 71577084.0 13938533.0 14959708.0 126175780.0
2003-09-01 00:30:00 10856263.0 32243146.0 7136130.0 3695059.0 28747761.0 9102603.0 16200072.0 27472465.0 9402609.0 10934084.0 ... 66616097.0 43019246.0 12726958.0 64027333.0 16394673.0 33440318.0 79682647.0 16212806.0 16425845.0 112891500.0
2003-09-01 00:40:00 10068533.0 30164311.0 8061482.0 2922271.0 35642229.0 9104036.0 12279530.0 29171205.0 7624924.0 11327807.0 ... 66797282.0 40408580.0 11733121.0 54541962.0 16769259.0 33927515.0 81480788.0 16757707.0 15158825.0 123140310.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2003-09-07 23:10:00 8849096.0 33461807.0 5866138.0 3786793.0 19097140.0 10561532.0 26092040.0 28640962.0 8343867.0 8820650.0 ... 65925313.0 21751316.0 11058944.0 58591021.0 17137907.0 24297674.0 83293655.0 17329425.0 20865535.0 123125390.0
2003-09-07 23:20:00 9776675.0 31474607.0 5874654.0 11277465.0 14314837.0 9106198.0 26412752.0 26168288.0 8638782.0 9193717.0 ... 70075490.0 29126443.0 12667321.0 54571764.0 15383038.0 25238842.0 70015955.0 16526455.0 16881206.0 142106800.0
2003-09-07 23:30:00 9144621.0 32117262.0 5762691.0 7154577.0 17771350.0 10149256.0 29501669.0 25998158.0 11343171.0 9423042.0 ... 68544458.0 27817836.0 15892668.0 50326213.0 12098328.0 27689197.0 73553203.0 18022288.0 18471915.0 127918530.0
2003-09-07 23:40:00 8802106.0 29932510.0 5279285.0 5950898.0 20222187.0 10636832.0 19613671.0 26124024.0 8732768.0 8217873.0 ... 65087776.0 28836922.0 11075541.0 52574692.0 11933512.0 31632344.0 81693475.0 16677568.0 16766967.0 138180630.0
2003-09-07 23:50:00 8716795.6 22660870.0 6240626.4 5657380.6 17406086.0 8808588.5 15962917.0 18367639.0 7767967.3 7470650.1 ... 65599891.0 25862152.0 11673804.0 60086953.0 11851656.0 30979811.0 73577193.0 19167646.0 19402758.0 137288810.0

1008 rows × 121 columns

In [7]:
Atraf.shape
Out[7]:
(1008, 121)

As we would expect, our traffic matrix has rank 121:

In [8]:
np.linalg.matrix_rank(Atraf)
Out[8]:
121

However -- perhaps it has low effective rank.

The numpy routine for computing SVD is np.linalg.svd:

In [9]:
u, s, vt = np.linalg.svd(Atraf)

Now let's look at the singular values of Atraf to see if it can be usefully approximated as a low-rank matrix:

In [10]:
#
fig = plt.figure(figsize=(6,4))
plt.plot(range(1,1+len(s)),s)
plt.xlabel(r'$k$',size=20)
plt.ylabel(r'$\sigma_k$',size=20)
plt.ylim(ymin = 0)
plt.xlim(xmin = -1)
plt.title(r'Singular Values of $A$',size=20);

This classic, sharp-elbow tells us that a few singular values are very large, and most singular values are quite small.

Zooming in for just small kk values, we can see that the elbow is around 4 - 6 singular values:

In [11]:
#
fig = plt.figure(figsize = (6, 4))
Anorm = np.linalg.norm(Atraf)
plt.plot(range(1, 21), s[0:20]/Anorm, '.-')
plt.xlim([0.5, 20])
plt.ylim([0, 1])
plt.xlabel(r'$k$', size=20)
plt.xticks(range(1, 21))
plt.ylabel(r'$\sigma_k$', size=20);
plt.title(r'Singular Values of $A$',size=20);

This pattern of singular values suggests low effective rank.

Let's compute the relative error of a rank-kk approximation to AA by measuring the Frobenius norm ∥A−A(k)∥F‖A−A(k)‖F (which basically treats the matrix as a vector):

In [8]:
#
fig = plt.figure(figsize = (6, 4))
Anorm = np.linalg.norm(Atraf)
err = np.cumsum(s[::-1]**2)
err = np.sqrt(err[::-1])
plt.plot(range(0, 20), err[:20]/Anorm, '.-')
plt.xlim([0, 20])
plt.ylim([0, 1])
plt.xticks(range(1, 21))
plt.xlabel(r'$k$', size = 16)
plt.ylabel(r'relative F-norm error', size=16)
plt.title(r'Relative Error of rank-$k$ approximation to $A$', size=16);

Remarkably, we are down to 9% relative error using only a rank 20 approximation to AA.

So instead of storing

  • mn=mn= (1008 ⋅⋅ 121) = 121,968 values,

we only need to store

  • k(m+n)k(m+n) = 20 ⋅⋅ (1008 + 121) = 22,580 values,

which is a 81% reduction in size.

Low Effective Rank is Common¶

In practice many datasets have low effective rank.

Here are some more examples.

User preferences over items.

Example: the Netflix prize worked with partially-observed matrices like this:

⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣   ⋮    32 11 1  … 2 4 …55 4 1  15   ⋮   ⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[   ⋮    32 11 1  … 2 4 …55 4 1  15   ⋮   ]

Where the rows correspond to users, the columns to movies, and the entries are ratings.

Although the problem matrix was of size 500,000 ×× 18,000, the winning approach modeled the matrix as having rank 20 to 40.

Source: [Koren et al, IEEE Computer, 2009]

Likes on Facebook.

Here, the matrices are

  1. Number of likes: Timebins ×× Users
  2. Number of likes: Users ×× Page Categories
  3. Entropy of likes across categories: Timebins ×× Users

Source: [Viswanath et al., Usenix Security, 2014]

In [16]:
#
display(Image("images/18-facebook.png", width=700))

Social Media Activity.

Here, the matrices are

  1. Number of Yelp reviews: Timebins ×× Users
  2. Number of Yelp reviews: Users ×× Yelp Categories
  3. Number of Tweets: Users ×× Topic Categories

Source: [Viswanath et al., Usenix Security, 2014]

In [17]:
#
display(Image("images/18-yelp-twitter.png", width=700))

Summary¶

The reduced singular value decomposition of a rank-rr matrix AA has the form:

A=UΣVTA=UΣVT

where

  1. UU is m×rm×r
  2. The columns of UU are mutually orthogonal and unit length, ie., UTU=IUTU=I.
  3. VV is n×rn×r.
  4. The columns of VV are mutually orthogonal and unit length, ie., VTV=IVTV=I.
  5. The matrix ΣΣ is an r×rr×r diagonal matrix, whose diagonal values are σ1≥σ2≥⋯≥σr>0σ1≥σ2≥⋯≥σr>0.
In [ ]: