#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None
mp.rcParams['animation.html'] = 'jshtml';
Definition. A quadratic form is a function of variables, eg, in which every term has degree two.
Every quadratic form can be expressed as , where is a symmetric matrix. Note that the result of this expression is a scalar.
A quadratic form is called: | if: | which happens when the eigenvalues of are: |
---|---|---|
positive definite | for all | all positive |
positive semidefinite | for all | all positive or 0 |
indefinite | can be positive or negative | both positive and negative |
negative definite | for all | all negative |
negative semidefinite | for all | all negative or 0 |
If , then we will refer to as a positive (semi)definite or negative (semi)definite matrix.
A common kind of optimization is to find the maximum or the minimum value of a quadratic form for in some specified set.
This is called constrained optimization.
For example, a common constraint is that varies over the set of unit vectors.
#
fig = ut.three_d_figure((15, 1),
'Intersection of the positive definite quadratic form z = 3 x1^2 + 7 x2 ^2 with the constraint ||x|| = 1',
-2, 2, -2, 2, 0, 8,
equalAxes = False, figsize = (7, 7), qr = qr_setting)
qf = np.array([[3., 0.],[0., 7.]])
for angle in np.linspace(0, 2*np.pi, 200):
x = np.array([np.cos(angle), np.sin(angle)])
z = x.T @ qf @ x
fig.plotPoint(x[0], x[1], z, 'b')
fig.plotPoint(x[0], x[1], 0, 'g')
fig.plotQF(qf, alpha=0.5)
fig.ax.set_zlabel('z')
fig.desc['zlabel'] = 'z'
# do not call fig.save here
Theorem. Let be a symmetric matrix, and let
Then is the greatest eigenvalue of .
Similarly, the minimum value of over all unit vectors is equal to the smallest eigenvector .
[This lecture is based on Prof. Crovella's CS 132 and CS 506 lecture notes.]
Today we will study the most useful decomposition in applied Linear Algebra.
Pretty exciting, eh?
The Singular Value Decomposition is the “Swiss Army Knife” and the “Rolls Royce” of matrix decompositions.
-- Diane O'Leary
#
display(Image("images/18-knife.jpg", width=350))
# image source https://bringatrailer.com/listing/1964-rolls-royce-james-young-phanton-v-limosine/
display(Image("images/18-rolls-royce.jpg", width=350))
Before I show you how this new matrix factorization works, I want to explain what it does and why it is useful to data science.
This new technique, called singular value decomposition or SVD, can approximate and simplify any dataset that is provided in matrix form.
Data Type | Rows | Columns | Elements |
---|---|---|---|
Network Traffic | Sources | Destinations | Number of bytes |
Social Media | Users | Time bins | Number of posts/tweets/likes |
Web Browsing | Users | Content categories | Visit counts/bytes downloaded |
Web Browsing | Users | Time bins | Visit counts/bytes downloaded |
Using SVD, we can "compress" a matrix of real-world messy data (exactly or approximately) into matrices of smaller dimension.
Notice that contains a row for each object.
In a sense we have transformed objects from an dimensional space to a dimensional space, where is (probably much) smaller than .
Data science models, at their core, are meant to be a simplification of the data.
In particular, instead of thinking of the data as thousands or millions of individual data points, we think of it as being "close" to a best-fit linear function, a small number of clusters, a parametric distribution, etc, etc.
From this simpler description, we hope to gain insight.
There is an interesting question here: why does this process often lead to insight?
That is, why does it happen so often that a large dataset can be described in terms of a much simpler model?
This is because often "Occam's razor" can be applied:
Among competing hypotheses, the one with the fewest assumptions should be selected.
In other words, the world is full of simple (but often hidden) patterns, and it is more common for a set of observations to be determined by a simple process than a complex process.
From which one can justify the observation that "modeling works suprisingly often."
SVD applies to any matrix , no matter its size, (a)symmetry, invertibility, or diagonalizability.
Recall the geometric view of an matrix : it corresponds to a linear transformation that maps vectors in to vectors in .
SVD says that the linear transformation corresponding to can be decomposed into a sequence of three steps:
# Source: Wikipedia
display(Image("images/18-svd-geometric.png", width=350))
Rotation within .
Note that this is not the inverse of step 1: the rotation might be by a different angle, across a different axis, and in a space of different dimension.
Theorem. Let be an matrix with rank . Then there exists an matrix whose diagonal entries are the first singular values of , and there exists an orthogonal matrix and an orthogonal matrix such that
Any factorization with and orthogonal and a diagonal matrix is called a singular value decomposition (SVD) of .
The columns of are called the left singular vectors and the columns of are called the right singular vectors of .
Consider the diagonalization of a symmetric positive-definite matrix . If we set
then the singular value decomposition of is just its eigendecomposition.
SVD is a generalization of diagonalization.
Remember our geometric interpretation of the diagonalization of a square matrix :
Compute the coordinates of in the basis
Scale those coordinates according to the diagonal matrix
Find the point that has those scaled coordinates in the basis
SVD lets us do a similar decomposition for any arbitrary matrix !
Today we'll work through what each of the components of this decomposition are.
In our diagonalization , the diagonal values of are the eigenvectors of .
We will see that this does not work for any arbitrary
Instead, SVD can be understood through a very simple question:
Among all unit vectors, what is the vector that maximizes ?
In other words, in which direction does create the largest output vector from a unit input?
To set the stage to answer this question, let's review a few facts.
You recall that the eigenvalues of a square matrix measure the amount that "stretches or shrinks" certain special vectors (the eigenvectors).
For example, for a square , if and then
Here is one such example, for a matrix
#
V = np.array([[2,1],[.1,1]])
L = np.array([[1.2,0],
[0,0.7]])
A = V @ L @ np.linalg.inv(V)
#
ax = dm.plotSetup(-1.5,1.5,-1.5, 1.5, size=(9,6))
ut.centerAxes(ax)
theta = [2 * np.pi * f for f in np.array(range(360))/360.0]
x = [np.array([np.sin(t), np.cos(t)]) for t in theta]
Ax = [A.dot(xv) for xv in x]
ax.plot([xv[0] for xv in x],[xv[1] for xv in x],'-b')
ax.plot([Axv[0] for Axv in Ax],[Axv[1] for Axv in Ax],'--r')
theta_step = np.linspace(0, 2*np.pi, 24)
for th in theta_step:
x = np.array([np.sin(th), np.cos(th)])
ut.plotArrowVec(ax, A @ x, x, head_width=.04, head_length=.04, length_includes_head = True, color='g')
u, s, v = np.linalg.svd(A)
ut.plotArrowVec(ax, [0.3* V[0][0], 0.3*V[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [0.3* V[0][1], 0.3*V[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ax.set_title(r'Eigenvectors of $A$ and the image of the unit circle under $A$');
The largest value of is the long axis of the ellipse. Clearly there is some that is mapped to that point by . That is what we want to find.
And let's make clear that we can apply this idea to arbitrary (non-square) matrices.
Here is an example that shows that we can still ask the question of what unit maximizes even when is not square.
Consider the matrix
The linear transformation maps the unit sphere in onto an ellipse in , as shown here:
#
display(Image("images/18-Lay-fig-7-4-1.jpg", width=650))
Now, here is a way to answer our question:
Problem. Find the unit vector at which the length is maximized, and compute this maximum length.
Solution. The quantity is maximized at the same that maximizes , and is easier to study.
So let's look for the unit vector at which is maximized.
Observe that
Now, is a symmetric matrix.
So we see that is a quadratic form!
... and we are seeking to maximize it subject to the constraint .
Recall from last time that we solved precisely this kind of constrained optimization problem:
the maximum value of a quadratic form, subject to the constraint that , is the largest eigenvalue of the symmetric matrix.
So the maximum value of subject to is , the largest eigenvalue of .
Also, the maximum is attained at a unit eigenvector of corresponding to .
For the matrix in the 2 3 example,
The eigenvalues of are and
The corresponding unit eigenvectors are, respectively,
For , the maximum value of is at 's eigenvector
This example shows that the key to understanding the effect of on the unit sphere in is to examime the quadratic form
We can also go back to our 2 2 example.
Let's plot the eigenvectors of .
#
ax = dm.plotSetup(-1.5,1.5,-1.5, 1.5, size=(9,6))
ut.centerAxes(ax)
theta = [2 * np.pi * f for f in np.array(range(360))/360.0]
x = [np.array([np.sin(t), np.cos(t)]) for t in theta]
Ax = [A.dot(xv) for xv in x]
ax.plot([xv[0] for xv in x],[xv[1] for xv in x],'-b')
ax.plot([Axv[0] for Axv in Ax],[Axv[1] for Axv in Ax],'--r')
theta_step = np.linspace(0, 2*np.pi, 24)
#for th in theta_step:
# x = np.array([np.sin(th), np.cos(th)])
# ut.plotArrowVec(ax, A @ x, x, head_width=.04, head_length=.04, length_includes_head = True, color='g')
u, s, v = np.linalg.svd(A)
ut.plotArrowVec(ax, [v[0][0], v[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [v[0][1], v[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [s[0]*u[0][0], s[0]*u[1][0]], [v[0][0], v[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='g')
ut.plotArrowVec(ax, [s[1]*u[0][1], s[1]*u[1][1]], [v[0][1], v[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='g')
ax.set_title(r'Eigenvectors of $A^TA$ and their images under $A$');
We see that the eigenvector corresponding to the largest eigenvalue of indeed shows us where is maximized -- where the ellipse is longest.
Also, the other eigenvector of shows us where the ellipse is narrowest.
In fact, the entire geometric behavior of the transformation is captured by the quadratic form .
Let's continue to consider to be an arbitrary matrix.
Notice that even though is not square in general, is square and symmetric.
So, there is a lot we can say about .
In particular, since is symmetric, it can be orthogonally diagonalized.
So let be an orthonormal basis for consisting of eigenvectors of , and let be the corresponding eigenvalues of .
Then, for any eigenvector ,
(since is an eigenvector of )
(since is a unit vector.)
Now any expression is nonnegative.
So the eigenvalues of are all nonnegative.
That is: is positive semidefinite.
We can therefore renumber the eigenvalues so that
Definition. The singular values of are the square roots of the eigenvalues of . They are denoted by and they are arranged in decreasing order.
That is, for
By the above argument, the singular values of are the lengths of the vectors
Where are the eigenvectors of , normalized to unit length.
Now: we know that vectors are an orthogonal set because they are eigenvectors of the symmetric matrix .
However, it's also the case that are an orthogonal set.
This fact is key to the SVD.
Another way to look at this is to consider that . By definition, is also orthonormal. In other words, we've found a set of vectors that, when transformed with , preserves orthognonality.
Theorem. Suppose is an orthonormal basis of consisting of eigenvectors of , arranged so that the corresponding eigenvalues of satisfy and suppose has nonzero singular values.
Then is an orthogonal basis for and rank .
Note how surprising this is: while are a basis for , is a subspace of .
Nonetheless,
Proof. What we need to do is establish that is an orthogonal linearly independent set whose span is .
Because and are orthogonal for ,
So is an orthogonal set.
Furthermore, since the lengths of the vectors are the singular values of , and since there are nonzero singular values, if and only if
So are a linearly independent set (because they are orthogonal and all nonzero), and clearly they are each in .
Finally, we just need to show that .
To do this we'll show that for any in , we can write in terms of :
Say
Because is a basis for , we can write so
(because for ).
In summary: is an (orthogonal) linearly independent set whose span is , so it is an (orthogonal) basis for .
Notice that we have also proved that rank
In other words, if has nonzero singular values, has rank .