In [1]:
# for QR codes use inline
%matplotlib inline
qr_setting = 'url'
#
# for lecture use notebook
# %matplotlib notebook
# qr_setting = None
#
%config InlineBackend.figure_format='retina'
# import libraries
import numpy as np
import matplotlib as mp
import pandas as pd
import matplotlib.pyplot as plt
import laUtilities as ut
import slideUtilities as sl
import demoUtilities as dm
import pandas as pd
from importlib import reload
from datetime import datetime
from matplotlib import animation
from IPython.display import Image
from IPython.display import display_html
from IPython.display import display
from IPython.display import Math
from IPython.display import Latex
from IPython.display import HTML;

Announcements¶

  • Homework:
    • Homework 9 out, due Sunday
  • Final exam: Monday, May 8 from 12-2pm
  • Upcoming office hours:
    • Tomorrow: Peer tutor Rohan Anand from 1:30-3pm in CCDS 16th floor
    • Tomorrow: Abhishek Tiwari from 3:30-4:30pm on CCDS 13th floor

Lecture 35: Markov Chains¶

In [2]:
# Image credit: http://en.wikipedia.org/wiki/Andrey_Markov#mediaviewer/File:AAMarkov.jpg
display(Image("images/AAMarkov.jpg", width=250))
display(HTML("<b>Andrei Markov, 1856 - 1922, St Petersburg.</b>"))
Andrei Markov, 1856 - 1922, St Petersburg.
In [3]:
# Image credit: http://en.wikipedia.org/wiki/Andrey_Markov#mediaviewer/File:AAMarkov.jpg
display(Image("images/AAMarkov.jpg", width=250))
display(HTML("<b>Andrei Markov, 1856 - 1922, St Petersburg.</b>"))
Andrei Markov, 1856 - 1922, St Petersburg.

From the official web site of the Russian Academy of Sciences. Licensed under Public Domain via Wikimedia Commons.

Markov was part of the great tradition of mathematics in Russia.

In the early 1900s, Markov pioneered the study of systems in which the future state of the system depends only on the present state in a random fashion. This has turned out to be a terrifically useful idea. For example, it is the starting point for analysis of the movement of stock prices, and the dynamics of animal populations.

For example, Markov studied the sequence of letters found in the text of Pushkin's Eugene Onegin, in particular the sequence of consonants and vowels. He observed that the probability of the next letter being either a consonant or vowel was related to what the preceding letter is.

This eventually led to the idea of a system in which one transitions between states, and the probability of going to another state depends only on the current state.

These have since been termed "Markov Chains."

Markov chains are essential tools in understanding, explaining, and predicting phenomena in computer science, physics, biology, economics, and finance.

35.1 Dynamical Systems¶

Today we will study Markov Chains as an application of linear algebra. You will see how the concepts we use, such as vectors and matrices, get applied to a particular problem. In DS122, you will study Markov Chains as an application of probability.

Many applications in computing are concerned with how a system behaves over time.

Think of a Web server that is processing requests for Web pages, or network that is moving packets from place to place.

We would like to describe how systems like these operate, and analyze them to understand their performance limits.

The way we model this is:

  • we define some vector that describes the state of the system, and
  • we formulate a rule that tells us how to compute the next state of the system based on the current state of the system.

So we would say that the state of the system at time kk is a vector xk∈Rn,xk∈Rn, and

xk+1=T(xk),for timek=0,1,2...xk+1=T(xk),for timek=0,1,2...

where T:Rn→Rn.T:Rn→Rn.

The vector xkxk is called the state vector.

This situation is so common that it goes by many names:

  • In physics, this is called a dynamical system.

    • Here, xkxk might represent the position and velocity of a particle.
  • When studying algorithms, this is called a recurrence relation.

    • Here, xkxk might represent the number of steps needed to solve a problem of size kk.
  • Most commonly, this is called a difference equation.

    • The reason for this terminology is that it is a discrete analog of a differential equation in kk.

Of course, we are going to be particularly interested in the case where TT is a linear transformation.

Then we know that we can write the difference equation as:

xk+1=Axk,xk+1=Axk,

where A∈Rn×n.A∈Rn×n.

This is a linear difference equation.

Example.

Here is our warm-up problem.

We are interested in the population of two regions, say the city and the suburbs.

Fix an initial year (say 2000) and let

x0=[population of the city in 2000population of the suburbs in 2000].x0=[population of the city in 2000population of the suburbs in 2000].

Then

x1=[population of the city in 2001population of the suburbs in 2001],x1=[population of the city in 2001population of the suburbs in 2001],x2=[population of the city in 2002population of the suburbs in 2002],x2=[population of the city in 2002population of the suburbs in 2002],…etc.…etc.

We only concern ourselves with movements of people between the two regions.

  • no immigration, emigration, birth, death, etc.

We assume that measurements have shown the following pattern:

in any given year,

  • 5% of the people in the city move to the suburbs, and
  • 3% of the people in the suburbs move to the city.

You can think of this as:

From CityFrom SuburbsTo City.95.03To Suburbs.05.97From CityFrom SuburbsTo City.95.03To Suburbs.05.97

Then we can capture this update rule as a matrix:

A=[.95.03.05.97].A=[.95.03.05.97].

We can see that this is correct by verifying that:

[city pop. in 2001suburb pop. in 2001]=[.95.03.05.97][city pop. in 2000suburb pop. in 2000].[city pop. in 2001suburb pop. in 2001]=[.95.03.05.97][city pop. in 2000suburb pop. in 2000].

35.2 Markov Chains¶

Let's look at AA again:

A=[.95.03.05.97].A=[.95.03.05.97].

We note that AA has a special property: each of its columns adds up to 1.

Also, it would not make sense to have negative entries in AA.

The reason that columns sum to 1 is that the total number of people in the system is not changing over time.

This leads to three definitions:

Definition. A probability vector is a vector of nonnegative entries that sums to 1.

Definition. A stochastic matrix is a square matrix of nonnegative values whose columns each sum to 1.

Definition. A Markov chain is a dynamical system whose state is a probability vector and which evolves according to a stochastic matrix.

That is, it is a probability vector x0x0 and a stochastic matrix A∈Rn×nA∈Rn×n such that

xk+1=Axkfork=0,1,2,...xk+1=Axkfork=0,1,2,...

So we think of a probability vector x0x0 as describing how things are "distributed" across various categories -- the fraction of items that are in each category.

And we think of the stochastic matrix AA as describing how things "redistribute" themselves at each time step.

Example. Suppose that in 2000 the population of the city is 600,000 and the population of the suburbs is 400,000. What is the distribution of the population in 2001? In 2002? In 2020?

Solution. First, we convert the population distribution to a probability vector. This is done by simply normalizing by the sum of the vector elements.

600,000+400,000=1,000,000.600,000+400,000=1,000,000.
11,000,000[600,000400,000]=[0.600.40].11,000,000[600,000400,000]=[0.600.40].

Then the distribution of population in 2001 is:

x1=Ax0=[.95.03.05.97][0.600.40]=[0.5820.418].x1=Ax0=[.95.03.05.97][0.600.40]=[0.5820.418].

And the distribution of the population in 2002 is:

x2=Ax1=[.95.03.05.97][0.5820.418]=[0.5650.435].x2=Ax1=[.95.03.05.97][0.5820.418]=[0.5650.435].

Note that another way we could have written this is:

x2=Ax1=A(Ax0)=A2x0.x2=Ax1=A(Ax0)=A2x0.

To answer the question for 2020, i.e., k=20,k=20, we note that

x20=20A⋯Ax0=A20x0.x20=A⋯A⏞20x0=A20x0.
In [5]:
# stochastic matrix A
A = np.array(
    [[0.95,0.03],
     [0.05,0.97]])
#
# initial state vector x_0
x_0 = np.array([0.60,0.40])
#
# compute A^20
A_20 = np.linalg.matrix_power(A, 20)
#
# compute x_20
x_20 = A_20 @ x_0
print(x_20)
[0.417456 0.582544]

So we find that after 20 years, only 42% of the population will remain in the city.

35.3 Predicting the Distant Future¶

We noticed that the population of the city is going down. Will everyone eventually live in the suburbs?

A important question about a Markov Chain is: what will happen in the distant future?

For example, what happens to the population distribution in our example "in the long run?"

Rather than answering that question right now, we'll take a more interesting example.

Suppose we have a system whose state transition is described by the stochastic matrix

P=⎡⎢⎣.5.2.3.3.8.3.20.4⎤⎥⎦P=[.5.2.3.3.8.3.20.4]

and which starts in the state

x0=⎡⎢⎣100⎤⎥⎦.x0=[100].

Consider the Markov Chain defined by PP and x0x0, that is the chain defined as

xk+1=Pxkfork=0,1,2...xk+1=Pxkfork=0,1,2...

What happens to the system as time passes?

Let's compute the state vectors x1,…,x15x1,…,x15 to find out.

In [2]:
# stochastic matrix A
A = np.array(
    [[.5,.2,.3],
     [.3,.8,.3],
     [.2, 0,.4]])
#
# initial state vector
x = np.array([1,0,0])
#
# array to hold each future state vector
xs = np.zeros((15,3))
# 
# compute future state vectors
for i in range(15):
    xs[i] = x
    print(f'x({i}) = {x}')
    x = A @ x
x(0) = [1 0 0]
x(1) = [0.5 0.3 0.2]
x(2) = [0.37 0.45 0.18]
x(3) = [0.329 0.525 0.146]
x(4) = [0.3133 0.5625 0.1242]
x(5) = [0.30641 0.58125 0.11234]
x(6) = [0.303157 0.590625 0.106218]
x(7) = [0.3015689 0.5953125 0.1031186]
x(8) = [0.30078253 0.59765625 0.10156122]
x(9) = [0.30039088 0.59882813 0.10078099]
x(10) = [0.30019536 0.59941406 0.10039057]
x(11) = [0.30009767 0.59970703 0.1001953 ]
x(12) = [0.30004883 0.59985352 0.10009765]
x(13) = [0.30002441 0.59992676 0.10004883]
x(14) = [0.30001221 0.59996338 0.10002441]

What is going on here? Let's look at these values graphically.

In [3]:
ax = plt.subplot(131)
plt.plot(range(15),xs.T[0],'o-')
ax.set_ylim([0,1])
plt.title(r'$x_1$',size=24)
ax = plt.subplot(132)
plt.plot(range(15),xs.T[1],'o-')
ax.set_ylim([0,1])
plt.title(r'$x_2$',size=24)
ax = plt.subplot(133)
plt.plot(range(15),xs.T[2],'o-')
ax.set_ylim([0,1])
plt.title(r'$x_3$',size=24)
plt.tight_layout()

Based on visual inspection, these vectors seem to be approaching

q=⎡⎢⎣.3.6.1⎤⎥⎦.q=[.3.6.1].

The components of xkxk don't seem to be changing much past about k=10.k=10.

In fact, we can confirm that the this system would be stable at ⎡⎢⎣.3.6.1⎤⎥⎦[.3.6.1] by noting that:

⎡⎢⎣.5.2.3.3.8.3.20.4⎤⎥⎦⎡⎢⎣.3.6.1⎤⎥⎦=⎡⎢⎣.15+.12+.03.09+.48+.03.06+0+.04⎤⎥⎦=⎡⎢⎣.3.6.1⎤⎥⎦.[.5.2.3.3.8.3.20.4][.3.6.1]=[.15+.12+.03.09+.48+.03.06+0+.04]=[.3.6.1].

This calculation is exact. So it seems that:

  • the sequence of vectors is approaching ⎡⎢⎣.3.6.1⎤⎥⎦[.3.6.1] as a limit, and
  • when and if they get to that point, they will stabilize there.

35.4 Steady-State Vectors¶

This convergence to a "steady state" is quite remarkable. Is this a general phenomenon?

Definition. If PP is a stochastic matrix, then a steady-state vector (or equilibrium vector) for PP is a probability vector qq such that:

Pq=q.Pq=q.

It can be shown that every stochastic matrix has at least one steady-state vector.

(We'll study this more closely in a later lecture.)

Example.

⎡⎢⎣.3.6.1⎤⎥⎦[.3.6.1] is the steady-state vector for ⎡⎢⎣.5.2.3.3.8.3.20.4⎤⎥⎦.[.5.2.3.3.8.3.20.4].

Example.

Recalling the population-movement example above.

The probability vector q=[.375.625]q=[.375.625] is a steady-state vector for the population migration matrix AA, because

Aq=[.95.03.05.97][.375.625]=[.35625+.01875.01875+.60625]=[.375.625]=q.Aq=[.95.03.05.97][.375.625]=[.35625+.01875.01875+.60625]=[.375.625]=q.

To interpret this:

  • if the total population of the region is 1 million,
  • then if there are 375,000 persons in the city and 625,000 persons in the suburbs,
  • the populations of both the city and the suburbs would stabilize -- they would stay the same in all future years.

35.5 Finding the Steady State¶

OK, so it seems that the two Markov Chains we have studied so far each have a steady state. This leads to two questions:

  • Can we compute the steady state?
    • So far we have guessed what the steady state is, and then checked. Can we compute the steady state directly?
  • How do we know if:
    • a Markov Chain has a unique steady state, and
    • whether it will always converge to that steady state?

Let's start by thinking about how to compute the steady-state directly.

Example. Let P=[.6.3.4.7].P=[.6.3.4.7]. Find a steady-state vector for PP.

Solution. Let's simply solve the equation Px=x.Px=x.

Px=xPx=x
Px−x=0Px−x=0
Px−Ix=0Px−Ix=0
(P−I)x=0(P−I)x=0

Now, P−IP−I is a matrix, so this is a linear system that we can solve.

P−I=[.6.3.4.7]−[1001]=[−.4.3.4−.3].P−I=[.6.3.4.7]−[1001]=[−.4.3.4−.3].

To find all solutions of (P−I)x=0,(P−I)x=0, we row reduce the augmented matrix:

[−.4.30.4−.30]∼[−.4.30000]∼[1−3/40000].[−.4.30.4−.30]∼[−.4.30000]∼[1−3/40000].

So x1=34x2x1=34x2 and x2x2 is free. The general solution is [34x2x2].[34x2x2].

This means that there are an infinite set of solutions. Which one are we interested in?

Remember that our vectors xx are probability vectors. So we are interested in the solution in which the vector elements are nonnegative and sum to 1.

The simple way to find this is to take any solution, and divide it by the sum of the entries (so that the sum then adds to 1.)

Let's choose x2=1x2=1, so the specific solution is:

[341].[341].

Normalizing this by the sum of the entries (7474) we get:

q=[3747].q=[3747].

So, we have found how to solve a Markov Chain for its steady state:

  • Solve the linear system (P−I)x=0.(P−I)x=0.
  • The system will have an infinite number of solutions, with one free variable. Obtain a general solution.
  • Pick any specific solution (choose any value for the free variable), and normalize it so the entries add up to 1.

35.6 Existence of, and Convergence to, Steady State¶

Finally: when does a system have a unique solution that is a probability vector, and how do we know it will converge to that vector?

Of course, a linear system in general might have no solutions, or it might have a unique solution that is not a probability vector.

So what we are asking is, when does a system defined by a Markov Chain have an infinite set of solutions, so that we can find one of them that is a probability vector?

Definition. We say that a stochastic matrix PP is regular if some matrix power PkPk contains only strictly positive entries.

For

P=⎡⎢⎣.5.2.3.3.8.3.20.4⎤⎥⎦,P=[.5.2.3.3.8.3.20.4],

We note that PP does not have every entry strictly positive.

However:

P2=⎡⎢⎣.37.26.33.45.70.45.18.04.22⎤⎥⎦.P2=[.37.26.33.45.70.45.18.04.22].

Since every entry in P2P2 is positive, PP is a regular stochastic matrix.

Theorem. If PP is an n×nn×n regular stochastic matrix, then PP has a unique steady-state vector q.q. Further, if x0x0 is any initial state and xk+1=Pxkxk+1=Pxk for k=0,1,2,…,k=0,1,2,…, then the Markov Chain {xk}{xk} converges to qq as k→∞.k→∞.

Note the phrase "any initial state."

This is a remarkable property of a Markov Chain: it converges to its steady-state vector no matter what state the chain starts in.

We say that the long-term behavior of the chain has "no memory of the starting state."

Example. Consider a computer system that consists of a disk, a CPU, and a network interface.

A set of jobs are loaded into the system. Each job makes requests for service from each of the components of the system.

After receiving service, the job then next requests service from the same or a different component, and so on.

Jobs move between system components according to the following diagram.

For example, after receiving service from the Disk, 70% of jobs return to the disk for another unit of service; 20% request service from the network interface; and 10% request service from the CPU.

Figure

Assume the system runs for a long time. Determine whether the system will stabilize, and if so, find the fraction of jobs that are using each device once stabilized. Which device is busiest? Least busy?

Solution. From the diagram, the movement of jobs among components is given by:

From DiskFrom NetFrom CPUTo Disk0.700.100.30To Net0.200.800.30To CPU0.100.100.40From DiskFrom NetFrom CPUTo Disk0.700.100.30To Net0.200.800.30To CPU0.100.100.40

This corresponds to the stochastic matrix:

P=⎡⎢⎣0.700.100.300.200.800.300.100.100.40⎤⎥⎦.P=[0.700.100.300.200.800.300.100.100.40].

First of all, this is a regular matrix because PP has all strictly positive entries. So it has a unique steady-state vector.

Next, we find the steady state of the system by solving (P−I)x=0:(P−I)x=0:

[P−I0]=⎡⎢⎣−0.300.100.3000.20−0.200.3000.100.10−0.600⎤⎥⎦∼⎡⎢⎣10−9/4001−15/400000⎤⎥⎦.[P−I0]=[−0.300.100.3000.20−0.200.3000.100.10−0.600]∼[10−9/4001−15/400000].

Thus the general solution of (P−I)x=0(P−I)x=0 is

x1=94x3,x2=154x3,x3free.x1=94x3,x2=154x3,x3free.

Since x3x3 is free (there are infinitely many solutions) we can find a solution whose entries sum to 1. This is:

q≈⎡⎢⎣.32.54.14⎤⎥⎦.q≈[.32.54.14].

So we see that in the long run,

  • about 32% of the jobs will be using the disk,
  • about 54% will be using the network interface, and
  • about 14% will be using the CPU.

Again, the important fact here is that we did not have to concern ourselves with the state in which that the system started.

The influence of the starting state is eventually lost!

Let's demonstrate that computationally. Let's say that at the start, all the jobs happened to be using the CPU. Then:

x0=⎡⎢⎣001⎤⎥⎦.x0=[001].

Then let's look at how the three elements of the xx vector evolve with time, by computing Px0Px0, P2x0P2x0, P3x0P3x0, etc.

In [4]:
#
# system definition
A = np.array(
    [[.7,.1,.3],
     [.2,.8,.3],
     [.1,.1,.4]])
x = np.array([0,0,1])
xs = np.zeros((15,3))
#
# compute 15 steps in the evolution of the system stae
for i in range(15):
    xs[i] = x
    x = A @ x
#
# plot the results
ax = plt.subplot(131)
plt.plot(range(15),xs.T[0],'o-')
ax.set_ylim([0,1])
plt.title(r'$x_1$',size=24)
ax = plt.subplot(132)
plt.plot(range(15),xs.T[1],'o-')
ax.set_ylim([0,1])
plt.title(r'$x_2$',size=24)
ax = plt.subplot(133)
plt.plot(range(15),xs.T[2],'o-')
ax.set_ylim([0,1])
plt.title(r'$x_3$',size=24)
plt.tight_layout()

Notice how the system starts in the state ⎡⎢⎣001⎤⎥⎦[001],

but quickly (within about 10 steps) reaches the equilibrium state that we predicted: ⎡⎢⎣.32.54.14⎤⎥⎦.[.32.54.14].

Now let's compare what happens if the system starts in a different state, say ⎡⎢⎣100⎤⎥⎦[100]:

In [5]:
A = np.array([[.7,.1,.3],[.2,.8,.3],[.1,.1,.4]])
x = np.array([1,0,0])
ys = np.zeros((15,3))
for i in range(15):
    ys[i] = x
    x = A.dot(x)
ax = plt.subplot(131)
plt.plot(range(15),xs.T[0],'o-',range(15),ys.T[0],'o-r')
ax.set_ylim([0,1])
plt.title(r'$x_1$',size=24)
ax = plt.subplot(132)
plt.plot(range(15),xs.T[1],'o-',range(15),ys.T[1],'o-r')
ax.set_ylim([0,1])
plt.title(r'$x_2$',size=24)
ax = plt.subplot(133)
plt.plot(range(15),xs.T[2],'o-',range(15),ys.T[2],'o-r')
ax.set_ylim([0,1])
plt.title(r'$x_3$',size=24)
plt.tight_layout()

This shows graphically that even though the system started in a very different state, it quickly converges to the steady state regardless of the starting state.

In [3]:
A = np.array([[.7,.1,.3],[.2,.8,.3],[.1,.1,.4]])
x = np.array([[0., 0., 1.], [1., 0., 0.], [0., 1., 0.], [0.5, 0., 0.5], [0., 0.5, 0.5], [0.5, 0.5, 0.]]).T
colors = ['r', 'b', 'g', 'y', 'c', 'm']
npts = x.shape[1]
fig = ut.three_d_figure((11, 1), fig_desc = 'Convergence to Steady State',
                        xmin = 0, xmax = 1, ymin = 0, ymax = 1, zmin = 0, zmax = 1, qr = None)
#plt.close()
for pt in range(npts):
    fig.plotPoint(x[0][pt], x[1][pt], x[2][pt], colors[pt])
for i in range(10):
    xnext = A @ x
    for pt in range(npts):
        # fig.plotPoint(xnext[0][pt], xnext[1][pt], xnext[2][pt], colors[pt])
        fig.plotLine(np.array([[x[0][pt], xnext[0][pt]], [x[1][pt], xnext[1][pt]], [x[2][pt], xnext[2][pt]]]).T, 
                     colors[pt])
    x = xnext
fig.text(x[0][0]-0.05, x[1][0], x[2][0]+0.05, 'Steady State', 'Steady State', 12)
fig.plotPoint(x[0][0], x[1][0], x[2][0], 'k')
fig.ax.view_init(azim=40.0,elev=20.0)
fig.save()
#
def anim(frame):
    fig.ax.view_init(azim = frame, elev = 20)
    # fig.canvas.draw()
#
# create and display the animation 
#HTML(animation.FuncAnimation(fig.fig, anim,
#                       frames = 5 * np.arange(72),
#                       fargs = None,
#                       interval = 100).to_jshtml(default_mode = 'loop'))

Although we have been showing each component converging separately, in fact the entire state vector of the system can be thought of as evolving in space.

This figure shows the movement of the state vector starting from six different initial conditions, and shows how regardless of the starting state, the eventual state is the same.

Summary¶

  • Many phenomena can be describe using Markov's idea:
    • There are "states", and
    • Transition between states only depends on the current state.
  • Examples: population movement, jobs in a computer, consonants/vowels in a text...
  • Such a system can be expressed in terms of a stochastic matrix and a probability vector.
  • Evolution of the system in time is described by matrix multiplication.
  • Using linear algebraic tools we can predict the steady state of such a system!
In [ ]: