In [2]:

#
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
from matplotlib import animation
from importlib import reload
from datetime import datetime
from IPython.display import Image, display_html, display, Math, HTML;
qr_setting = None

mp.rcParams['animation.html'] = 'jshtml';

Announcements¶

Homework:
- Homework 8 due Friday, 4/14 at 8pm
Final exam: Monday, May 8 from 12-2pm
Upcoming office hours:
- Today: Prof McDonald from 4:45-6pm in CCDS 1341
- Tomorrow: Peer tutor Rohan Anand from 1:30-3pm in CCDS 16th floor
Reading
- Aggarwal Sections 7.1-7.4

Recap from last lecture¶

Definition. A quadratic form is a function of variables, eg, $x_{1}, x_{2}, \dots, x_{n},$ in which every term has degree two.

Every quadratic form can be expressed as $x^{⊤} A x$ , where $A$ is a symmetric matrix. Note that the result of this expression is a scalar.

A quadratic form $Q$ is called:	if:	which happens when the eigenvalues of $A$ are:
positive definite	$Q (x) > 0$ for all $x \neq 0$	all positive
positive semidefinite	$Q (x) \geq 0$ for all $x \neq 0$	all positive or 0
indefinite	$Q (x)$ can be positive or negative	both positive and negative
negative definite	$Q (x) < 0$ for all $x \neq 0$	all negative
negative semidefinite	$Q (x) \leq 0$ for all $x \neq 0$	all negative or 0

If $Q = x^{⊤} A x$ , then we will refer to $A$ as a positive (semi)definite or negative (semi)definite matrix.

A common kind of optimization is to find the maximum or the minimum value of a quadratic form $Q (x)$ for $x$ in some specified set.

This is called constrained optimization.

For example, a common constraint is that $x$ varies over the set of unit vectors.

minimize Q (x) subject to ‖ x ‖ = 1

In [3]:

#
fig = ut.three_d_figure((15, 1), 
                        'Intersection of the positive definite quadratic form z = 3 x1^2 + 7 x2 ^2 with the constraint ||x|| = 1', 
                        -2, 2, -2, 2, 0, 8, 
                        equalAxes = False, figsize = (7, 7), qr = qr_setting)
qf = np.array([[3., 0.],[0., 7.]])
for angle in np.linspace(0, 2*np.pi, 200):
    x = np.array([np.cos(angle), np.sin(angle)])
    z = x.T @ qf @ x
    fig.plotPoint(x[0], x[1], z, 'b')
    fig.plotPoint(x[0], x[1], 0, 'g')
fig.plotQF(qf, alpha=0.5)
fig.ax.set_zlabel('z')
fig.desc['zlabel'] = 'z'
# do not call fig.save here

Theorem. Let $A$ be a symmetric matrix, and let

M = max_{x^{T} x = 1} x^{T} A x .

Then $M$ is the greatest eigenvalue $λ_{1}$ of $A$ .

Similarly, the minimum value of $Q (x)$ over all unit vectors is equal to the smallest eigenvector $λ_{n}$ .

Lecture 31: Singular Value Decomposition¶

[This lecture is based on Prof. Crovella's CS 132 and CS 506 lecture notes.]

31.1 Our Last Matrix Factorization¶

Today we will study the most useful decomposition in applied Linear Algebra.

Pretty exciting, eh?

The Singular Value Decomposition is the “Swiss Army Knife” and the “Rolls Royce” of matrix decompositions.

-- Diane O'Leary

In [3]:

#
display(Image("images/18-knife.jpg", width=350))

In [4]:

# image source https://bringatrailer.com/listing/1964-rolls-royce-james-young-phanton-v-limosine/
display(Image("images/18-rolls-royce.jpg", width=350))

The data science view¶

Before I show you how this new matrix factorization works, I want to explain what it does and why it is useful to data science.

This new technique, called singular value decomposition or SVD, can approximate and simplify any dataset that is provided in matrix form.

m data objects {\begin{matrix}  \end{matrix} \overset{n features}{\overset{⏞}{[\begin{array}{ccccc} \begin{matrix} a_{11} \\ ⋮ \\ a_{i 1} \\ ⋮ \\ a_{m 1} \end{matrix} & \begin{matrix} \dots \\ ⋱ \\ \dots \\ ⋱ \\ \dots \end{matrix} & \begin{matrix} a_{1 j} \\ ⋮ \\ a_{i j} \\ ⋮ \\ a_{m j} \end{matrix} & \begin{matrix} \dots \\ ⋱ \\ \dots \\ ⋱ \\ \dots \end{matrix} & \begin{matrix} a_{1 n} \\ ⋮ \\ a_{i n} \\ ⋮ \\ a_{m n} \end{matrix} \end{array}]}}

Data Type	Rows	Columns	Elements
Network Traffic	Sources	Destinations	Number of bytes
Social Media	Users	Time bins	Number of posts/tweets/likes
Web Browsing	Users	Content categories	Visit counts/bytes downloaded
Web Browsing	Users	Time bins	Visit counts/bytes downloaded

Using SVD, we can "compress" a matrix of real-world messy data (exactly or approximately) into matrices of smaller dimension.

objects {\begin{matrix}  \end{matrix} \overset{features}{\overset{⏞}{[\begin{array}{cccc} \begin{matrix} ⋮ \\ ⋮ \\ a_{1} \\ ⋮ \\ ⋮ \end{matrix} & \begin{matrix} ⋮ \\ ⋮ \\ a_{2} \\ ⋮ \\ ⋮ \end{matrix} & \dots & \begin{matrix} ⋮ \\ ⋮ \\ a_{n} \\ ⋮ \\ ⋮ \end{matrix} \end{array}]}} = \overset{k}{\overset{⏞}{[\begin{array}{cc} ⋮ & ⋮ \\ ⋮ & ⋮ \\ σ_{1} u_{1} & σ_{k} u_{k} \\ ⋮ & ⋮ \\ ⋮ & ⋮ \end{array}]}} \times [\begin{array}{ccccc} \dots & \dots & v_{1} & \dots & \dots \\ \dots & \dots & v_{k} & \dots & \dots \end{array}]

A = U (Σ V^{T})

Notice that $U$ contains a row for each object.

In a sense we have transformed objects from an $n$ dimensional space to a $k$ dimensional space, where $k$ is (probably much) smaller than $n$ .

Data science models, at their core, are meant to be a simplification of the data.

In particular, instead of thinking of the data as thousands or millions of individual data points, we think of it as being "close" to a best-fit linear function, a small number of clusters, a parametric distribution, etc, etc.

From this simpler description, we hope to gain insight.

There is an interesting question here: why does this process often lead to insight?

That is, why does it happen so often that a large dataset can be described in terms of a much simpler model?

This is because often "Occam's razor" can be applied:

Among competing hypotheses, the one with the fewest assumptions should be selected.

In other words, the world is full of simple (but often hidden) patterns, and it is more common for a set of observations to be determined by a simple process than a complex process.

From which one can justify the observation that "modeling works suprisingly often."

The geometric view¶

SVD applies to any matrix $M$ , no matter its size, (a)symmetry, invertibility, or diagonalizability.

Recall the geometric view of an $m \times n$ matrix $M$ : it corresponds to a linear transformation that maps vectors in $R^{n}$ to vectors in $R^{m}$ .

SVD says that the linear transformation corresponding to $M$ can be decomposed into a sequence of three steps:

Rotation within $R^{n}$ , without scaling (i.e., rotating from the standard basis to another orthonormal basis).

In [27]:

# Source: Wikipedia
display(Image("images/18-svd-geometric.png", width=350))

Scaling each dimension $i$ by a constant $σ_{i}$ , called a singular value.
- If $m \neq n$ , then the space is also transformed from $R^{n}$ to $R^{m}$ at the same time. Some dimensions might be projected away.
- Some dimensions might also have such a small value of $σ_{i}$ that they might as well be projected away. That is, we might choose to remove them with little impact on the accuracy of the result.

Rotation within $R^{m}$ .

Note that this is not the inverse of step 1: the rotation might be by a different angle, across a different axis, and in a space of different dimension.

The algebraic view¶

Theorem. Let $A$ be an $m \times n$ matrix with rank $r$ . Then there exists an $m \times n$ matrix $Σ$ whose diagonal entries are the first $r$ singular values of $A$ , $σ_{1} \geq σ_{2} \geq \dots \geq σ_{r} > 0,$ and there exists an $m \times m$ orthogonal matrix $U$ and an $n \times n$ orthogonal matrix $V$ such that

A = U Σ V^{T}

\overset{m \times n}{\overset{⏞}{[\begin{array}{ccccc} A \end{array}]}} = \overset{m \times m}{\overset{⏞}{[\begin{array}{ccc} U \end{array}]}} \overset{m \times n}{\overset{⏞}{[\begin{array}{ccccc} Σ \end{array}]}} \overset{n \times n}{\overset{⏞}{[\begin{array}{ccccc} V^{⊤} \end{array}]}}

Any factorization $A = U Σ V^{T},$ with $U$ and $V$ orthogonal and $Σ$ a diagonal matrix is called a singular value decomposition (SVD) of $A$ .

The columns of $U$ are called the left singular vectors and the columns of $V$ are called the right singular vectors of $A$ .

Consider the diagonalization of a symmetric positive-definite matrix $S = P D P^{T}$ . If we set

U = P = V, and Σ = D

then the singular value decomposition of $S$ is just its eigendecomposition.

SVD is a generalization of diagonalization.

Remember our geometric interpretation of the diagonalization of a square matrix $A x = P D P^{- 1} x$ :

Compute the coordinates of $x$ in the basis $B$
Scale those coordinates according to the diagonal matrix $D$
Find the point that has those scaled coordinates in the basis $B$

SVD lets us do a similar decomposition for any arbitrary matrix $A$ !

\overset{m \times n}{\overset{⏞}{[\begin{array}{ccccc} A \end{array}]}} = \overset{m \times m}{\overset{⏞}{[\begin{array}{ccc} ⋮ & ⋮ & ⋮ \\ u_{1} & u_{2} & \dots & u_{m} \\ ⋮ & ⋮ & ⋮ \end{array}]}} \overset{m \times n}{\overset{⏞}{[\begin{array}{ccccc} σ_{1} & 0 & 0 & 0 & 0 \\ 0 & σ_{2} & \dots & 0 & 0 & 0 \\ 0 & 0 & σ_{m} & 0 & 0 \end{array}]}} \overset{n \times n}{\overset{⏞}{[\begin{array}{ccccc} \dots & \dots & v_{1} & \dots & \dots \\ \dots & \dots & v_{2} & \dots & \dots \\ ⋮ \\ ⋮ \\ \dots & \dots & v_{n} & \dots & \dots \end{array}]}}

Today we'll work through what each of the components of this decomposition are.

31.2 Maximizing $‖ A x ‖$ ¶

In our diagonalization $A = P D P^{- 1}$ , the diagonal values of $D$ are the eigenvectors of $A$ .

We will see that this does not work for any arbitrary $A \in R^{n \times m} .$

Instead, SVD can be understood through a very simple question:

Among all unit vectors, what is the vector $x$ that maximizes $‖ A x ‖$ ?

In other words, in which direction does $A$ create the largest output vector from a unit input?

To set the stage to answer this question, let's review a few facts.

You recall that the eigenvalues of a square matrix $A$ measure the amount that $A$ "stretches or shrinks" certain special vectors (the eigenvectors).

For example, for a square $A$ , if $A x = λ x$ and $‖ x ‖ = 1,$ then

‖ A x ‖ = ‖ λ x ‖ = | λ | ‖ x ‖ = | λ | .

Here is one such example, for a $2 \times 2$ matrix

A = [\begin{matrix} 1.23 & - 0.53 \\ 0.03 & 0.67 \end{matrix}] .

In [4]:

#
V = np.array([[2,1],[.1,1]])
L = np.array([[1.2,0],
              [0,0.7]])
A = V @ L @ np.linalg.inv(V)
#
ax = dm.plotSetup(-1.5,1.5,-1.5, 1.5, size=(9,6))
ut.centerAxes(ax)
theta = [2 * np.pi * f for f in np.array(range(360))/360.0]
x = [np.array([np.sin(t), np.cos(t)]) for t in theta]
Ax = [A.dot(xv) for xv in x]
ax.plot([xv[0] for xv in x],[xv[1] for xv in x],'-b')
ax.plot([Axv[0] for Axv in Ax],[Axv[1] for Axv in Ax],'--r')
theta_step = np.linspace(0, 2*np.pi, 24)
for th in theta_step:
    x = np.array([np.sin(th), np.cos(th)])
    ut.plotArrowVec(ax, A @ x, x, head_width=.04, head_length=.04, length_includes_head = True, color='g')
u, s, v = np.linalg.svd(A)
ut.plotArrowVec(ax, [0.3* V[0][0], 0.3*V[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [0.3* V[0][1], 0.3*V[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ax.set_title(r'Eigenvectors of $A$ and the image of the unit circle under $A$');

The largest value of $‖ A x ‖$ is the long axis of the ellipse. Clearly there is some $x$ that is mapped to that point by $A$ . That $x$ is what we want to find.

And let's make clear that we can apply this idea to arbitrary (non-square) matrices.

Here is an example that shows that we can still ask the question of what unit $x$ maximizes $‖ A x ‖$ even when $A$ is not square.

Consider the $2 \times 3$ matrix $A = [\begin{matrix} 4 & 11 & 14 \\ 8 & 7 & - 2 \end{matrix}] .$

The linear transformation $x \mapsto A x$ maps the unit sphere ${x : ‖ x ‖ = 1}$ in $R^{3}$ onto an ellipse in $R^{2}$ , as shown here:

In [2]:

#
display(Image("images/18-Lay-fig-7-4-1.jpg", width=650))

$‖ A x ‖^{2}$ is a Quadratic Form¶

Now, here is a way to answer our question:

Problem. Find the unit vector $x$ at which the length $‖ A x ‖$ is maximized, and compute this maximum length.

Solution. The quantity $‖ A x ‖^{2}$ is maximized at the same $x$ that maximizes $‖ A x ‖$ , and $‖ A x ‖^{2}$ is easier to study.

So let's look for the unit vector $x$ at which $‖ A x ‖^{2}$ is maximized.

Observe that

‖ A x ‖^{2} = (A x)^{T} (A x)

= x^{T} A^{T} A x

= x^{T} (A^{T} A) x

Now, $A^{T} A$ is a symmetric matrix.

So we see that $‖ A x ‖^{2} = x^{T} A^{T} A x$ is a quadratic form!

... and we are seeking to maximize it subject to the constraint $‖ x ‖ = 1$ .

Recall from last time that we solved precisely this kind of constrained optimization problem:

the maximum value of a quadratic form, subject to the constraint that $‖ x ‖ = 1$ , is the largest eigenvalue of the symmetric matrix.

So the maximum value of $‖ A x ‖^{2}$ subject to $‖ x ‖ = 1$ is $λ_{1}$ , the largest eigenvalue of $A^{T} A$ .

Also, the maximum is attained at a unit eigenvector of $A^{T} A$ corresponding to $λ_{1}$ .

For the matrix $A$ in the 2 $\times$ 3 example,

A^{T} A = [\begin{matrix} 4 & 8 \\ 11 & 7 \\ 14 & - 2 \end{matrix}] [\begin{matrix} 4 & 11 & 14 \\ 8 & 7 & - 2 \end{matrix}] = [\begin{matrix} 80 & 100 & 40 \\ 100 & 170 & 140 \\ 40 & 140 & 200 \end{matrix}] .

The eigenvalues of $A^{T} A$ are $λ_{1} = 360, λ_{2} = 90,$ and $λ_{3} = 0.$

The corresponding unit eigenvectors are, respectively,

v_{1} = [\begin{matrix} 1 / 3 \\ 2 / 3 \\ 2 / 3 \end{matrix}], v_{2} = [\begin{matrix} - 2 / 3 \\ - 1 / 3 \\ 2 / 3 \end{matrix}], v_{3} = [\begin{matrix} 2 / 3 \\ - 2 / 3 \\ 1 / 3 \end{matrix}] .

For $‖ x ‖ = 1$ , the maximum value of $‖ A x ‖$ is at $A^{T} A$ 's eigenvector $‖ A v_{1} ‖ = \sqrt{360} .$

This example shows that the key to understanding the effect of $A$ on the unit sphere in $R^{3}$ is to examime the quadratic form $x^{T} (A^{T} A) x .$

We can also go back to our 2 $\times$ 2 example.

Let's plot the eigenvectors of $A^{T} A$ .

In [6]:

#
ax = dm.plotSetup(-1.5,1.5,-1.5, 1.5, size=(9,6))
ut.centerAxes(ax)
theta = [2 * np.pi * f for f in np.array(range(360))/360.0]
x = [np.array([np.sin(t), np.cos(t)]) for t in theta]
Ax = [A.dot(xv) for xv in x]
ax.plot([xv[0] for xv in x],[xv[1] for xv in x],'-b')
ax.plot([Axv[0] for Axv in Ax],[Axv[1] for Axv in Ax],'--r')
theta_step = np.linspace(0, 2*np.pi, 24)
#for th in theta_step:
#    x = np.array([np.sin(th), np.cos(th)])
#    ut.plotArrowVec(ax, A @ x, x, head_width=.04, head_length=.04, length_includes_head = True, color='g')
u, s, v = np.linalg.svd(A)
ut.plotArrowVec(ax, [v[0][0], v[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [v[0][1], v[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='Black')
ut.plotArrowVec(ax, [s[0]*u[0][0], s[0]*u[1][0]], [v[0][0], v[1][0]], head_width=.04, head_length=.04, length_includes_head = True, color='g')
ut.plotArrowVec(ax, [s[1]*u[0][1], s[1]*u[1][1]], [v[0][1], v[1][1]], head_width=.04, head_length=.04, length_includes_head = True, color='g')
ax.set_title(r'Eigenvectors of $A^TA$ and their images under $A$');

We see that the eigenvector corresponding to the largest eigenvalue of $A^{T} A$ indeed shows us where $‖ A x ‖$ is maximized -- where the ellipse is longest.

Also, the other eigenvector of $A^{T} A$ shows us where the ellipse is narrowest.

In fact, the entire geometric behavior of the transformation $x \mapsto A x$ is captured by the quadratic form $x^{T} A^{T} A x$ .

31.3 The Singular Values of a Matrix¶

Let's continue to consider $A$ to be an arbitrary $m \times n$ matrix.

Notice that even though $A$ is not square in general, $A^{T} A$ is square and symmetric.

So, there is a lot we can say about $A^{T} A$ .

In particular, since $A^{T} A$ is symmetric, it can be orthogonally diagonalized.

So let ${v_{1}, \dots, v_{n}}$ be an orthonormal basis for $R^{n}$ consisting of eigenvectors of $A^{T} A$ , and let $λ_{1}, \dots, λ_{n}$ be the corresponding eigenvalues of $A^{T} A$ .

Then, for any eigenvector $v_{i}$ ,

‖ A v_{i} ‖^{2} = (A v_{i})^{T} A v_{i} = v_{i}^{T} A^{T} A v_{i}

= v_{i}^{T} (λ_{i}) v_{i}

(since $v_{i}$ is an eigenvector of $A^{T} A$ )

= λ_{i}

(since $v_{i}$ is a unit vector.)

Now any expression $‖ \cdot ‖^{2}$ is nonnegative.

So the eigenvalues of $A^{T} A$ are all nonnegative.

That is: $A^{T} A$ is positive semidefinite.

We can therefore renumber the eigenvalues so that

λ_{1} \geq λ_{2} \geq \dots \geq λ_{n} \geq 0.

Definition. The singular values of $A$ are the square roots of the eigenvalues of $A^{T} A$ . They are denoted by $σ_{1}, \dots, σ_{n},$ and they are arranged in decreasing order.

That is, $σ_{i} = \sqrt{λ_{i}}$ for $i = 1, \dots, n .$

\overset{m \times n}{\overset{⏞}{[\begin{array}{ccccc} A \end{array}]}} = \overset{m \times m}{\overset{⏞}{[\begin{array}{ccc} ⋮ & ⋮ & ⋮ \\ u_{1} & u_{2} & \dots & u_{m} \\ ⋮ & ⋮ & ⋮ \end{array}]}} \overset{m \times n}{\overset{⏞}{[\begin{array}{ccccc} σ_{1} & 0 & 0 & 0 & 0 \\ 0 & σ_{2} & \dots & 0 & 0 & 0 \\ 0 & 0 & σ_{m} & 0 & 0 \end{array}]}} \overset{n \times n}{\overset{⏞}{[\begin{array}{ccccc} \dots & \dots & v_{1} & \dots & \dots \\ \dots & \dots & v_{2} & \dots & \dots \\ ⋮ \\ ⋮ \\ \dots & \dots & v_{n} & \dots & \dots \end{array}]}}

By the above argument, the singular values of $A$ are the lengths of the vectors $A v_{1}, \dots, A v_{n} .$

Where $v_{1}, \dots, v_{n}$ are the eigenvectors of $A^{T} A$ , normalized to unit length.

The Eigenvectors of $A^{T} A$ Lead To an Orthogonal Basis for $Col A$ ¶

Now: we know that vectors $v_{1}, \dots, v_{n}$ are an orthogonal set because they are eigenvectors of the symmetric matrix $A^{T} A$ .

However, it's also the case that $A v_{1}, \dots, A v_{n}$ are an orthogonal set.

This fact is key to the SVD.

Another way to look at this is to consider that $A v_{i} = σ_{i} u_{i}$ . By definition, $u_{i}$ is also orthonormal. In other words, we've found a set of vectors $v_{1}, \dots, v_{n}$ that, when transformed with $A$ , preserves orthognonality.

Theorem. Suppose ${v_{1}, \dots, v_{n}}$ is an orthonormal basis of $R^{n}$ consisting of eigenvectors of $A^{T} A$ , arranged so that the corresponding eigenvalues of $A^{T} A$ satisfy $λ_{1} \geq \dots \geq λ_{n},$ and suppose $A$ has $r$ nonzero singular values.

Then ${A v_{1}, \dots, A v_{r}}$ is an orthogonal basis for $Col A,$ and rank $A = r$ .

Note how surprising this is: while ${v_{1}, \dots, v_{n}}$ are a basis for $R^{n}$ , $Col A$ is a subspace of $R^{m}$ .

Nonetheless,

two eigenvectors $v_{i}$ and $v_{j} \in R^{n}$ are orthogonal, and
their images $A v_{i}$ and $A v_{j} \in R^{m}$ are also orthogonal.

Proof. What we need to do is establish that ${A v_{1}, \dots, A v_{r}}$ is an orthogonal linearly independent set whose span is $Col A$ .

Because $v_{i}$ and $v_{j}$ are orthogonal for $i \neq j$ ,

(A v_{i})^{T} (A v_{j}) = v_{i}^{T} A^{T} A v_{j} = v_{i}^{T} (λ_{j} v_{j}) = 0.

So ${A v_{1}, \dots, A v_{n}}$ is an orthogonal set.

Furthermore, since the lengths of the vectors $A v_{1}, \dots, A v_{n}$ are the singular values of $A$ , and since there are $r$ nonzero singular values, $A v_{i} \neq 0$ if and only if $1 \leq i \leq r .$

So $A v_{1}, \dots, A v_{r}$ are a linearly independent set (because they are orthogonal and all nonzero), and clearly they are each in $Col A$ .

Finally, we just need to show that $Span {A v_{1}, \dots, A v_{r}} = Col A$ .

To do this we'll show that for any $y$ in $Col A$ , we can write $y$ in terms of ${A v_{1}, \dots, A v_{r}}$ :

Say $y = A x .$

Because ${v_{1}, \dots, v_{n}}$ is a basis for $R^{n}$ , we can write $x = c_{1} v_{1} + \dots + c_{n} v_{n},$ so

y = A x = c_{1} A v_{1} + \dots + c_{r} A v_{r} + \dots + c_{n} A v_{n} .

= c_{1} A v_{1} + \dots + c_{r} A v_{r} .

(because $A v_{i} = 0$ for $i > r$ ).

In summary: ${A v_{1}, \dots, A v_{n}}$ is an (orthogonal) linearly independent set whose span is $Col A$ , so it is an (orthogonal) basis for $Col A$ .

Notice that we have also proved that rank $A = \dim Col A = r .$

In other words, if $A$ has $r$ nonzero singular values, $A$ has rank $r$ .