In [1]:
#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None

mp.rcParams['animation.html'] = 'jshtml';
In [2]:
#
def centerAxes(ax):
    ax.spines['left'].set_position('zero')
    ax.spines['right'].set_color('none')
    ax.spines['bottom'].set_position('zero')
    ax.spines['top'].set_color('none')
    ax.xaxis.set_ticks_position('bottom')
    ax.yaxis.set_ticks_position('left')
    bounds = np.array([ax.axes.get_xlim(), ax.axes.get_ylim()])
    ax.plot(bounds[0][0],bounds[1][0],'')
    ax.plot(bounds[0][1],bounds[1][1],'')

Announcements¶

  • Homework:
    • Optional bonus homework out, due Friday
      • Will replace your lowest homework score
  • Final exam: Monday, May 8 from 12-2pm
  • Upcoming office hours:
    • Today: Prof McDonald from 4:30-6pm in CCDS 1341
    • Tomorrow: Peer tutor Rohan Anand from 1:30-3pm in CCDS 16th floor
  • Reading
    • Aggarwal Sections 8.1-8.2

Lecture 33: SVD and Principal Component Analysis¶

[This lecture is based on Prof. Crovella's CS 132 and CS 506 lecture notes, and Lior Pachter's blog.]

Recap from last lecture¶

Last week, we began our investigation into singular value decomposition, the last and most powerful matrix factorization we will consider in this course.

A=UΣVTA=UΣVT

Geometrically, SVD says that the linear transformation corresponding to MM can be decomposed into a sequence of three steps.

  1. Rotation within RnRn, without scaling (i.e., rotating from the standard basis to another orthonormal basis).

  2. Scaling each dimension ii by a constant σiσi, called a singular value.

    • If m≠nm≠n, then the space is also transformed from RnRn to RmRm at the same time. Some dimensions might be projected away.
    • Some dimensions might also have such a small value of σiσi that they might as well be projected away. That is, we might choose to remove them with little impact on the accuracy of the result.
  3. Rotation within RmRm.

    Note that this is not the inverse of step 1: the rotation might be by a different angle, across a different axis, and in a space of different dimension.

In [27]:
# Source: Wikipedia
display(Image("images/18-svd-geometric.png", width=350))

We talked about constructing the singular value decomposition using the properties of A⊤AA⊤A.

Definition. The singular values of AA are the square roots of the eigenvalues of A⊤AA⊤A. They are denoted by σ1=√λ1σ1=λ1, ……, σn=√λnσn=λn, and they are arranged in decreasing order σ1≥σ2≥⋯σn≥0σ1≥σ2≥⋯σn≥0.

The right singular vectors of AA are the corresponding eigenvectors v1,…,vnv1,…,vn of A⊤AA⊤A, normalized to be of unit length.

The left singular vectors of AA are the vectors ui=1σiAviui=1σiAvi: that is, take AA times each of the right singular values and normalize them. (If σi=0σi=0, then pick any vector uiui that is orthogonal to all of the previous left singular vectors.)

Theorem. Let AA be an m×nm×n matrix with rank rr. Then there exists an m×nm×n matrix ΣΣ whose diagonal entries are the first rr singular values of AA, σ1≥σ2≥⋯≥σr>0,σ1≥σ2≥⋯≥σr>0, and there exists an m×mm×m orthogonal matrix UU and an n×nn×n orthogonal matrix VV such that

A=UΣVTA=UΣVT

This creates a full decomposition:

m×n⎡⎢ ⎢ ⎢ ⎢⎣⋱⋱A⋱⋱⎤⎥ ⎥ ⎥ ⎥⎦=m×m⎡⎢ ⎢ ⎢⎣⋮⋮0u1ur0⋮⋮0⎤⎥ ⎥ ⎥⎦m×n⎡⎢⎣σ100000σr00000000⎤⎥⎦n×n⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋯⋯v1⋯⋯⋯⋯vr⋯⋯000000000000000⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[⋱⋱A⋱⋱]⏞m×n=[⋮⋮0u1ur0⋮⋮0]⏞m×m[σ100000σr00000000]⏞m×n[⋯⋯v1⋯⋯⋯⋯vr⋯⋯000000000000000]⏞n×n

But we often work with the reduced SVD of AA:

m×n⎡⎢ ⎢ ⎢ ⎢⎣⋱⋱A⋱⋱⎤⎥ ⎥ ⎥ ⎥⎦=m×r⎡⎢ ⎢ ⎢⎣⋮⋮u1ur⋮⋮⎤⎥ ⎥ ⎥⎦r×r[σ100σr]r×n[⋯⋯v1⋯⋯⋯⋯vr⋯⋯][⋱⋱A⋱⋱]⏞m×n=[⋮⋮u1ur⋮⋮]⏞m×r[σ100σr]⏞r×r[⋯⋯v1⋯⋯⋯⋯vr⋯⋯]⏞r×n

Dimensionality Reduction¶

Let's define the rank-kk approximation to AA:

When k<RankAk<Rank⁡A, the rank-kk approximation to AA is the closest rank-kk matrix to AA, i.e.,

A(k)=argminRankB=k∥A−B∥F,A(k)=arg⁡minRank⁡B=k‖A−B‖F,

where the Frobenius norm of a matrix ∥M∥F=√∑mi=1∑nj=1|mi,j|2‖M‖F=∑i=1m∑j=1n|mi,j|2 is the matrix equivalent to the Euclidean norm of a vector.

A rank-kk matrix may take up much less space than the original AA.

m⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩n⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋮⋮a1⋮⋮⋮⋮a2⋮⋮…⋮⋮an⋮⋮⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=k⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋮⋮⋮⋮σ1u1σkuk⋮⋮⋮⋮⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦×[……v1…………vk……]m{[⋮⋮a1⋮⋮⋮⋮a2⋮⋮…⋮⋮an⋮⋮]⏞n=[⋮⋮⋮⋮σ1u1σkuk⋮⋮⋮⋮]⏞k×[……v1…………vk……]

The rank-kk approximation takes up space (m+n)k(m+n)k while AA itself takes space mnmn.

For example, if k=10k=10 and m=n=1000m=n=1000, then the rank-kk approximation takes space 20000/1000000=2%20000/1000000=2% of AA.

In terms of the singular value decomposition,

the best rank-kk approximation to AA is formed by taking

  • U′=U′= the kk leftmost columns of UU,
  • Σ′=Σ′= the k×kk×k upper left submatrix of ΣΣ, and
  • (V′)T=(V′)T= the kk upper rows of VTVT,

and constructing

A(k)=U′Σ′(V′)T.A(k)=U′Σ′(V′)T.

Furthermore, the distance (in Frobenius norm) of the best rank-kk approximation A(k)A(k) from AA is equal to √∑ri=k+1σ2i∑i=k+1rσi2.

That is, if you construct A(k)A(k) as shown above, then:

∥A−A(k)∥2F=r∑i=k+1σ2i‖A−A(k)‖F2=∑i=k+1rσi2

33.1 Empirical Examples of SVD¶

Let's see how low-rank approximations via SVD can be used in practice, and investigate some real data.

Image compression¶

Image data often shows low effective rank, meaning that it can be simplified to a low-rank approximation with only a small loss.

For example, here is an original photo:

In [3]:
boat = np.loadtxt('data/boat.dat')
import matplotlib.cm as cm
plt.figure()
plt.imshow(boat,cmap = cm.Greys_r)
plt.axis('off');

We can think of this as a 512×512512×512 matrix AA whose entries are 1-byte grayscale values (numbers between 0 and 255).

In [22]:
print(boat)
[[127. 123. 125. ... 165. 169. 166.]
 [128. 126. 128. ... 169. 163. 167.]
 [128. 124. 128. ... 178. 160. 175.]
 ...
 [112. 112. 115. ... 101.  97. 104.]
 [110. 112. 117. ... 104.  93. 105.]
 [113. 115. 121. ... 102.  95.  97.]]

This image is 512 ×× 512 pixels in size. As a matrix, it has rank of 512.

But its effective rank is low. To see why, let's look at the spectrum of this image matrix:

In [4]:
u, s, vt = np.linalg.svd(boat, full_matrices = False)
plt.plot(s)
plt.xlabel('$k$', size = 16)
plt.ylabel(r'$\sigma_k$', size = 16)
plt.title('Singular Values of Boat Image', size = 16);

Based on the plot above, its effective rank is perhaps 40. In fact, the error when we approximate AA by a rank 40 matrix is only around 10%.

Let's find the closest rank-40 matrix and view it.

Let's find the closest rank-40 matrix to AA and view it.

We can do this quite easily using the SVD.

We simply construct our approximation of AA using only the first 40 columns of UU and top 40 rows of VTVT.

In [5]:
#
u, s, vt = np.linalg.svd(boat, full_matrices = False)
s[40:] = 0
boatApprox = u @ np.diag(s) @ vt
plt.figure(figsize=(9,6))
plt.subplot(1,2,1)
plt.imshow(boatApprox,cmap = cm.Greys_r)
plt.axis('off')
plt.title('Rank 40 Boat')
plt.subplot(1,2,2)
plt.imshow(boat,cmap = cm.Greys_r)
plt.axis('off')
plt.title('Rank 512 Boat');
# plt.subplots_adjust(wspace=0.5)

Note that the rank-40 boat takes up only 40(512+512)/512*512 = 80/512 = 16% of the space of the original image!

This general principle is what makes image, video, and sound compression effective.

When you

  • watch HDTV, or
  • listen to an MP3, or
  • look at a JPEG image,

these signals have been compressed using the fact that they are effectively low-rank matrices.

One interesting side effect of dimensionality reduction is that it often reduces the amount of noise in the data. Noise often shows up in the lower-order components of SVD, and dimensionality reduction sometimes removes it.

Networks¶

We'll look at data traffic on the Abilene network:

In [6]:
#
display(Image("images/19-Abilene-map.png", width=500))

Source: Internet2, circa 2005

In [8]:
with open('data/net-traffic/odnames','r') as f:
    odnames = [line.strip() for line in f]
dates = pd.date_range('9/1/2003', freq = '10min', periods = 1008)
Atraf = pd.read_table('data/net-traffic/X', sep='  ', header=None, names=odnames, engine='python')
Atraf.index = dates
Atraf
Out[8]:
ATLA-ATLA ATLA-CHIN ATLA-DNVR ATLA-HSTN ATLA-IPLS ATLA-KSCY ATLA-LOSA ATLA-NYCM ATLA-SNVA ATLA-STTL ... WASH-CHIN WASH-DNVR WASH-HSTN WASH-IPLS WASH-KSCY WASH-LOSA WASH-NYCM WASH-SNVA WASH-STTL WASH-WASH
2003-09-01 00:00:00 8466132.0 29346537.0 15792104.0 3646187.0 21756443.0 10792818.0 14220940.0 25014340.0 13677284.0 10591345.0 ... 53296727.0 18724766.0 12238893.0 52782009.0 12836459.0 31460190.0 105796930.0 13756184.0 13582945.0 120384980.0
2003-09-01 00:10:00 20524567.0 28726106.0 8030109.0 4175817.0 24497174.0 8623734.0 15695839.0 36788680.0 5607086.0 10714795.0 ... 68413060.0 28522606.0 11377094.0 60006620.0 12556471.0 32450393.0 70665497.0 13968786.0 16144471.0 135679630.0
2003-09-01 00:20:00 12864863.0 27630217.0 7417228.0 5337471.0 23254392.0 7882377.0 16176022.0 31682355.0 6354657.0 12205515.0 ... 67969461.0 37073856.0 15680615.0 61484233.0 16318506.0 33768245.0 71577084.0 13938533.0 14959708.0 126175780.0
2003-09-01 00:30:00 10856263.0 32243146.0 7136130.0 3695059.0 28747761.0 9102603.0 16200072.0 27472465.0 9402609.0 10934084.0 ... 66616097.0 43019246.0 12726958.0 64027333.0 16394673.0 33440318.0 79682647.0 16212806.0 16425845.0 112891500.0
2003-09-01 00:40:00 10068533.0 30164311.0 8061482.0 2922271.0 35642229.0 9104036.0 12279530.0 29171205.0 7624924.0 11327807.0 ... 66797282.0 40408580.0 11733121.0 54541962.0 16769259.0 33927515.0 81480788.0 16757707.0 15158825.0 123140310.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2003-09-07 23:10:00 8849096.0 33461807.0 5866138.0 3786793.0 19097140.0 10561532.0 26092040.0 28640962.0 8343867.0 8820650.0 ... 65925313.0 21751316.0 11058944.0 58591021.0 17137907.0 24297674.0 83293655.0 17329425.0 20865535.0 123125390.0
2003-09-07 23:20:00 9776675.0 31474607.0 5874654.0 11277465.0 14314837.0 9106198.0 26412752.0 26168288.0 8638782.0 9193717.0 ... 70075490.0 29126443.0 12667321.0 54571764.0 15383038.0 25238842.0 70015955.0 16526455.0 16881206.0 142106800.0
2003-09-07 23:30:00 9144621.0 32117262.0 5762691.0 7154577.0 17771350.0 10149256.0 29501669.0 25998158.0 11343171.0 9423042.0 ... 68544458.0 27817836.0 15892668.0 50326213.0 12098328.0 27689197.0 73553203.0 18022288.0 18471915.0 127918530.0
2003-09-07 23:40:00 8802106.0 29932510.0 5279285.0 5950898.0 20222187.0 10636832.0 19613671.0 26124024.0 8732768.0 8217873.0 ... 65087776.0 28836922.0 11075541.0 52574692.0 11933512.0 31632344.0 81693475.0 16677568.0 16766967.0 138180630.0
2003-09-07 23:50:00 8716795.6 22660870.0 6240626.4 5657380.6 17406086.0 8808588.5 15962917.0 18367639.0 7767967.3 7470650.1 ... 65599891.0 25862152.0 11673804.0 60086953.0 11851656.0 30979811.0 73577193.0 19167646.0 19402758.0 137288810.0

1008 rows × 121 columns

In [10]:
Atraf.shape
Out[10]:
(1008, 121)

As we would expect, our traffic matrix has rank 121:

In [11]:
np.linalg.matrix_rank(Atraf)
Out[11]:
121

However -- perhaps it has low effective rank.

The numpy routine for computing SVD is np.linalg.svd:

In [12]:
u, s, vt = np.linalg.svd(Atraf)

Now let's look at the singular values of Atraf to see if it can be usefully approximated as a low-rank matrix:

In [13]:
#
fig = plt.figure(figsize=(6,4))
plt.plot(range(1,1+len(s)),s)
plt.xlabel(r'$k$',size=20)
plt.ylabel(r'$\sigma_k$',size=20)
plt.ylim(ymin = 0)
plt.xlim(xmin = -1)
plt.title(r'Singular Values of $A$',size=20);

This classic, sharp-elbow tells us that a few singular values are very large, and most singular values are quite small.

Zooming in for just small kk values, we can see that the elbow is around 4 - 6 singular values:

In [14]:
#
fig = plt.figure(figsize = (6, 4))
Anorm = np.linalg.norm(Atraf)
plt.plot(range(1, 21), s[0:20]/Anorm, '.-')
plt.xlim([0.5, 20])
plt.ylim([0, 1])
plt.xlabel(r'$k$', size=20)
plt.xticks(range(1, 21))
plt.ylabel(r'$\sigma_k$', size=20);
plt.title(r'Singular Values of $A$',size=20);

This pattern of singular values suggests low effective rank.

Let's compute the relative error of a rank-kk approximation to AA by measuring the Frobenius norm ∥A−A(k)∥F‖A−A(k)‖F (which basically treats the matrix as a vector):

In [15]:
#
fig = plt.figure(figsize = (6, 4))
Anorm = np.linalg.norm(Atraf)
err = np.cumsum(s[::-1]**2)
err = np.sqrt(err[::-1])
plt.plot(range(0, 20), err[:20]/Anorm, '.-')
plt.xlim([0, 20])
plt.ylim([0, 1])
plt.xticks(range(1, 21))
plt.xlabel(r'$k$', size = 16)
plt.ylabel(r'relative F-norm error', size=16)
plt.title(r'Relative Error of rank-$k$ approximation to $A$', size=16);

Remarkably, we are down to 9% relative error using only a rank 20 approximation to AA.

So instead of storing

  • mn=mn= (1008 ⋅⋅ 121) = 121,968 values,

we only need to store

  • k(m+n)k(m+n) = 20 ⋅⋅ (1008 + 121) = 22,580 values,

which is a 81% reduction in size.

Low Effective Rank is Common¶

In practice many datasets have low effective rank.

Here are some more examples.

User preferences over items.

Example: the Netflix prize worked with partially-observed matrices like this:

⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣   ⋮    32 11 1  … 2 4 …55 4 1  15   ⋮   ⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[   ⋮    32 11 1  … 2 4 …55 4 1  15   ⋮   ]

Where the rows correspond to users, the columns to movies, and the entries are ratings.

Although the problem matrix was of size 500,000 ×× 18,000, the winning approach modeled the matrix as having rank 20 to 40.

Source: [Koren et al, IEEE Computer, 2009]

Likes on Facebook.

Here, the matrices are

  1. Number of likes: Timebins ×× Users
  2. Number of likes: Users ×× Page Categories
  3. Entropy of likes across categories: Timebins ×× Users

Source: [Viswanath et al., Usenix Security, 2014]

In [16]:
#
display(Image("images/18-facebook.png", width=700))

Social Media Activity.

Here, the matrices are

  1. Number of Yelp reviews: Timebins ×× Users
  2. Number of Yelp reviews: Users ×× Yelp Categories
  3. Number of Tweets: Users ×× Topic Categories

Source: [Viswanath et al., Usenix Security, 2014]

In [17]:
#
display(Image("images/18-yelp-twitter.png", width=700))

Summary¶

The reduced singular value decomposition of a rank-rr matrix AA has the form:

A=UΣVTA=UΣVT

where

  1. UU is m×rm×r
  2. The columns of UU are mutually orthogonal and unit length, ie., UTU=IUTU=I.
  3. VV is n×rn×r.
  4. The columns of VV are mutually orthogonal and unit length, ie., VTV=IVTV=I.
  5. The matrix ΣΣ is an r×rr×r diagonal matrix, whose diagonal values are σ1≥σ2≥⋯≥σr>0σ1≥σ2≥⋯≥σr>0.

33.2 Interpretations of Low Effective Rank¶

How can we understand the low-effective-rank phenomenon in general?

There are two helpful interpretations:

  1. Common Patterns
  2. Latent Factors

Low Rank Implies Common Patterns¶

The first interpretation of low-rank behavior is in answering the question:

"What is the strongest pattern in the data?"

Using the SVD we form the low-rank approximation as

  • U′=U′= the kk leftmost columns of UU,
  • Σ′=Σ′= the k×kk×k upper left submatrix of ΣΣ, and
  • V′=V′= the kk leftmost columns of VV, and constructing

with A≈U′Σ′(V′)TA≈U′Σ′(V′)T

In this interpretation, we think of each column of AA as a combination of the columns of U′U′.

How can this be helpful?

Consider the set of network traffic traces. There are clearly some common patterns. How can we find them?

In [13]:
#
with open('data/net-traffic/AbileneFlows/odnames','r') as f:
    odnames = [line.strip() for line in f]
dates = pd.date_range('9/1/2003',freq='10min',periods=1008)
Atraf = pd.read_table('data/net-traffic/AbileneFlows/X',sep='  ',header=None,names=odnames,engine='python')
Atraf.index = dates
plt.figure(figsize=(10,8))
for i in range(1,13):
    ax = plt.subplot(4,3,i)
    Atraf.iloc[:,i-1].plot()
    plt.title(odnames[i])
plt.subplots_adjust(hspace=1)
plt.suptitle('Twelve Example Traffic Traces', size=20);

Let's use as our example a1,a1, the first column of AA.

This happens to be the ATLA-CHIN flow.

The equation above tells us that

a1≈v11σ1u1+v12σ2u2+⋯+v1kσkuk.a1≈v11σ1u1+v12σ2u2+⋯+v1kσkuk.

In other words, u1u1 (the first column of UU) is the "strongest" pattern occurring in AA, and its strength is measured by σ1σ1.

Here is an view of the first 2 columns of UΣUΣ for the traffic matrix data.

These are the strongest patterns occurring across all of the 121 traces.

In [12]:
#
u, s, vt = np.linalg.svd(Atraf, full_matrices = False)
uframe = pd.DataFrame(u @ np.diag(s), index=pd.date_range('9/1/2003', freq = '10min', periods = 1008))
uframe[0].plot()
uframe[1].plot()
plt.title('First Two Columns of $U$');

Low Rank Defines Latent Factors¶

The next interpretation of low-rank behavior is that it exposes "latent factors" that describe the data.

Returning to the low-rank decomposition:

A≈U′Σ′(V′)TA≈U′Σ′(V′)T

In this interpretation, we think of each element of AA as the inner product of a row of U′Σ′U′Σ′ and a row of V′V′.

Let's say we are working with a matrix of users and items.

In particular, let the items be movies and matrix entries be ratings, as in the Netflix prize.

Recall the structure from a previous slide:

users⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩movies⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋮⋮a1⋮⋮⋮⋮a2⋮⋮…⋮⋮an⋮⋮⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=k⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⋮⋮⋮⋮σ1u1σkuk⋮⋮⋮⋮⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦×[……v1…………vk……]users{[⋮⋮a1⋮⋮⋮⋮a2⋮⋮…⋮⋮an⋮⋮]⏞movies=[⋮⋮⋮⋮σ1u1σkuk⋮⋮⋮⋮]⏞k×[……v1…………vk……]

Then the rating that a user gives a movie is the inner product of a kk element vector that corresponds to the user, and a kk element vector that corresponds to the movie.

In other words:

aij=uTivjaij=uiTvj

We can therefore think of user ii's preferences as being captured by uiui, ie., a point in RkRk.

We have described everything we need to know to predict user ii's ratings via a kk-element vector.

The kk-element vector is called a latent factor.

Likewise, we can think of vjvj as a "description" of movie jj (another latent factor).

The value in using latent factors comes from the summarization of user preferences, and the predictive power one obtains.

For example, the winning entry in the Netflix prize competition modeled user preferences with a 20-element latent factor.

The remarkable thing is that a person's preferences for all 18,000 movies can be reasonably well captured in a 20-element vector!

Here is a figure from the paper that described the winning strategy in the Netflix prize.

It shows a hypothetical latent space in which each user, and each movie, is represented by a latent vector.

In [19]:
#
display(Image("images/19-movie.png", width=500))

Source: Koren et al, IEEE Computer, 2009

In practice, this is perhaps a 20- or 40-dimensional space.

Here are some representations of movies in that space (reduced to 2-D).

Notice how the space seems to capture similarity among movies!

In [20]:
#
display(Image("images/19-netflix.png", width=500))

Source: Koren et al, IEEE Computer, 2009

In summary:

  • When we are working with data matrices, it is valuable to consider the effective rank
  • Many (many) datasets in real life show low effective rank.
  • This property can be explored precisely using the Singular Value Decomposition of the matrix.
  • When low effective rank is present,
    • the matrix can be compressed with only small loss of accuracy
    • we can extract the "strongest" patterns in the data
    • we can describe each data item in terms of the inner product of latent factors.