05 - PCCA and TPT analysis

Creative Commons Licence

In this notebook, we will cover how to use PCCA++ to extract a coarse representation of the MSM. We will further investigate how to use transition path theory (TPT) to follow the pathways of the processes. When we want to analyze pathways, models with fewer states are more often desirable, since these are easier to understand. PCCA++ allows us to assign the microstates directly to metastable macrostates and TPT uses this group assignment to compute fluxes and pathways.

Another method to get a model with fewer states are hidden Markov state models (HMM), introduced in Notebook 07 ➜ 📓. In contrast to computing memberships of microstates to meta stable sets as in PCCA++, in HMMs we directly obtain a model with fewer states.

While we will mostly rely on previously estimated/validated models, it will be helpful to understand the topics - data loading/visualization (Notebook 01 ➜ 📓) - dimension reduction (Notebook 02 ➜ 📓) - the estimation and validation process (Notebook 03 ➜ 📓)

Here you can find literature on the used methods: - roeblitz-weber-14 - weinan-06 - metzner-09

Maintainers: [@cwehmeyer](https://github.com/cwehmeyer), [@marscher](https://github.com/marscher), [@thempel](https://github.com/thempel), [@psolsson](https://github.com/psolsson)

Remember: - to run the currently highlighted cell, hold ⇧ Shift and press ⏎ Enter; - to get help for a specific function, place the cursor within the function’s brackets, hold ⇧ Shift, and press ⇥ Tab; - you can find the full documentation at PyEMMA.org.


⚠️ We have assigned the integer numbers $1 \dots `$ ``nstates` to PCCA++ metastable states. As PyEMMA is written in Python, it internally indexes states starting from \(0\). In consequence, numbers in the code cells differ by \(-1\) from the plot labels and markdown text.

[1]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import mdshare
import pyemma
/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/mdshare/repository.py:53: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  data = load(fh)

Case 1: preprocessed, two-dimensional data (toy model)

We start by loading the data and the previously analyzed MSM (estimated in Notebook 04 ➜ 📓) from disk:

[2]:
file = mdshare.fetch('hmm-doublewell-2d-100k.npz', working_directory='data')
with np.load(file) as fh:
    data = fh['trajectory']

msm = pyemma.load('nb4.pyemma', model_name='doublewell_msm')
bayesian_msm = pyemma.load('nb4.pyemma', model_name='doublewell_bayesian_msm')
cluster = pyemma.load('nb4.pyemma', model_name='doublewell_cluster')
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-1-dbad461422f6> in <module>
      3     data = fh['trajectory']
      4
----> 5 msm = pyemma.load('nb4.pyemma', model_name='doublewell_msm')
      6 bayesian_msm = pyemma.load('nb4.pyemma', model_name='doublewell_bayesian_msm')
      7 cluster = pyemma.load('nb4.pyemma', model_name='doublewell_cluster')

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/__init__.py in load(filename, model_name)
     14 def load(filename, model_name='default'):
     15     from .serialization import SerializableMixIn
---> 16     return SerializableMixIn.load(file_name=filename, model_name=model_name)
     17
     18

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/serialization.py in load(cls, file_name, model_name)
    265         obj : the de-serialized object
    266         """
--> 267         from .h5file import H5File
    268         with H5File(file_name, model_name=model_name, mode='r') as f:
    269             return f.model

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/h5file.py in <module>
     20 import numpy as np
     21 from io import BytesIO
---> 22 from .pickle_extensions import HDF5PersistentPickler
     23
     24 __author__ = 'marscher'

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/pickle_extensions.py in <module>
     66
     67 # we cache this during runtime
---> 68 _DEFAULT_BLOSC_OPTIONS = _check_blosc_avail()
     69
     70

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/pickle_extensions.py in _check_blosc_avail()
     44     fid, name = tempfile.mkstemp()
     45     try:
---> 46         with h5py.File(name) as h5f:
     47             try:
     48                 h5f.create_dataset('test', shape=(1,1), **blosc_opts)

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, **kwds)
    422             with phil:
    423                 fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0, **kwds)
--> 424                 fid = make_fid(name, mode, userblock_size,
    425                                fapl, fcpl=make_fcpl(track_order=track_order, fs_strategy=fs_strategy,
    426                                fs_persist=fs_persist, fs_threshold=fs_threshold),

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    188         if swmr and swmr_support:
    189             flags |= h5f.ACC_SWMR_READ
--> 190         fid = h5f.open(name, flags, fapl=fapl)
    191     elif mode == 'r+':
    192         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

OSError: Unable to open file (file signature not found)

We currently have an MSM with \(50\) discrete states which was validated for two metastable states in the previous notebook. Internally, the metastable states have been computed using the Perron Cluster Cluster Analysis (PCCA++) method roeblitz-14. Let’s analyze this in more detail here. We can explicitly compute it by calling msm.pcca().

[3]:
nstates = 2
msm.pcca(nstates)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-fde0212f0242> in <module>
      1 nstates = 2
----> 2 msm.pcca(nstates)

NameError: name 'msm' is not defined

PCCA++ computes membership distributions, i.e., probabilities of micro-states to belong to the same metastable state. It does so by using the properties of slow processes in eigenvector space. Let us visualize the membership distributions in the same fashion as before:

[4]:
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
for i, ax in enumerate(axes.flat):
    pyemma.plots.plot_contour(
        *data.T, msm.metastable_distributions[i][cluster.dtrajs[0]], ax=ax, cmap='afmhot_r',
        mask=True, method='nearest', cbar_label='metastable distribution {}'.format(i + 1))
    ax.scatter(*cluster.clustercenters.T, s=15, c='k')
    ax.set_xlabel('$x$')
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)
    ax.set_aspect('equal')
axes[0].set_ylabel('$y$')
fig.tight_layout()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-91a1891291e0> in <module>
      2 for i, ax in enumerate(axes.flat):
      3     pyemma.plots.plot_contour(
----> 4         *data.T, msm.metastable_distributions[i][cluster.dtrajs[0]], ax=ax, cmap='afmhot_r',
      5         mask=True, method='nearest', cbar_label='metastable distribution {}'.format(i + 1))
      6     ax.scatter(*cluster.clustercenters.T, s=15, c='k')

NameError: name 'msm' is not defined
../../_images/tutorials_notebooks_05-pcca-tpt_7_1.png

As expected, PCCA++ has assigned metastable states to the basins of the double well. Since PCCA++, in simplified words, does a clustering in eigenvector space and the first eigenvector separated these states already, the nice separation comes to no surprise.

It is important to note, though, that PCCA++ in general does not yield a coarse transition matrix. How to obtain this will be covered in Notebook 07 ➜ 📓. However, we can compute mean first passage times (MFPTs) and equilibrium probabilities on the metastable sets and extract representative structures.

The stationary probability of metastable states can simply be computed by summing over all of its micro-states (please note that the PCCA++ object returned by msm.pcca() also has a convenience function to do that.)

[5]:
for i, s in enumerate(msm.metastable_sets):
    print('π_{} = {:f}'.format(i + 1, msm.pi[s].sum()))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-0d15b75e9d68> in <module>
----> 1 for i, s in enumerate(msm.metastable_sets):
      2     print('π_{} = {:f}'.format(i + 1, msm.pi[s].sum()))

NameError: name 'msm' is not defined

We use the mfpt() method of the original MSM object to compute MFPTs between pairs of metastable sets (accessible via the metastable_sets attribute of the MSM object).

[6]:
mfpt = np.zeros((nstates, nstates))
for i in range(nstates):
    for j in range(nstates):
        mfpt[i, j] = msm.mfpt(
            msm.metastable_sets[i],
            msm.metastable_sets[j])

from pandas import DataFrame
print('MFPT / steps:')
DataFrame(np.round(mfpt, decimals=2), index=range(1, nstates + 1), columns=range(1, nstates + 1))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-d2b96f76bd87> in <module>
      2 for i in range(nstates):
      3     for j in range(nstates):
----> 4         mfpt[i, j] = msm.mfpt(
      5             msm.metastable_sets[i],
      6             msm.metastable_sets[j])

NameError: name 'msm' is not defined

As described above, the errors can be estimated from the Bayesian MSM. Instead of just printing means and confidence intervals, let’s compute the samples explicitly and histogram them.

[7]:
mfpt_sample = np.zeros((nstates, nstates, bayesian_msm.nsamples))
for i in range(nstates):
    for j in range(nstates):
        mfpt_sample[i, j] = bayesian_msm.sample_f(
            'mfpt',
            msm.metastable_sets[i],
            msm.metastable_sets[j])

fig, ax = plt.subplots()
ax.hist(mfpt_sample[0, 1], histtype='step', label='MS 1 -> MS 2', density=True)
ax.hist(mfpt_sample[1, 0], histtype='step', label='MS 2 -> MS 1', density=True)
ax.set_xlabel('MFPT (steps)')
ax.set_title('Bayesian MFPT sample histograms')
fig.legend(loc=10);
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-97557ef8c89a> in <module>
----> 1 mfpt_sample = np.zeros((nstates, nstates, bayesian_msm.nsamples))
      2 for i in range(nstates):
      3     for j in range(nstates):
      4         mfpt_sample[i, j] = bayesian_msm.sample_f(
      5             'mfpt',

NameError: name 'bayesian_msm' is not defined

We clearly see that there is no overlap of the distributions approximated by the Bayesian MSM.

To do a more detailed analysis of the transition paths, we make use of transition path theory (TPT) in its MSM formulation. We first analyze the flux between the two metastable sets:

[8]:
A = msm.metastable_sets[0]
B = msm.metastable_sets[1]
flux = pyemma.msm.tpt(msm, A, B)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-006be119690a> in <module>
----> 1 A = msm.metastable_sets[0]
      2 B = msm.metastable_sets[1]
      3 flux = pyemma.msm.tpt(msm, A, B)

NameError: name 'msm' is not defined

In TPT, many properties are derived from the committor functions. They describe the probability of reaching a set \(A\) before set \(B\) as a function of some state \(x\). In order to understand this, we plot the committor of the previously defined sets as a function of the cluster centers:

[9]:
fig, ax = plt.subplots(figsize=(5, 4))
pyemma.plots.plot_contour(
    *data.T,
    flux.committor[cluster.dtrajs[0]],
    ax=ax,
    cmap='brg',
    mask=True,
    cbar_label=r'committor A $\to$ B')
ax.set_xlabel('$x$')
ax.set_xlim(-4, 4)
ax.set_ylim(-4, 4)
ax.set_aspect('equal')
ax.set_ylabel('$y$')
fig.tight_layout()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-f390b30a2261> in <module>
      2 pyemma.plots.plot_contour(
      3     *data.T,
----> 4     flux.committor[cluster.dtrajs[0]],
      5     ax=ax,
      6     cmap='brg',

NameError: name 'flux' is not defined
../../_images/tutorials_notebooks_05-pcca-tpt_17_1.png

We see that the committor for the double well data approximates a step function between the two basins. In other words, the probability of transitioning from metastable state \(A\) to \(B\) is only \(1\) if we already are in state \(B\). If we are in \(A\), it is \(0\) by definition. The clustering did not resolve the transition region, so this particular example does not provide more information. In the next example we will see more.

Case 2: low-dimensional molecular dynamics data (alanine dipeptide)

Again, we load the model that we have estimated previously.

[10]:
pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.xtc', working_directory='data')

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(periodic=False)
data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

msm = pyemma.load('nb4.pyemma', model_name='ala2_msm')
bayesian_msm = pyemma.load('nb4.pyemma', model_name='ala2_bayesian_msm')
cluster = pyemma.load('nb4.pyemma', model_name='ala2_cluster')

# not to be used in MSM estimation (artificical transitions between individual trajectories)!
dtrajs_concatenated = np.concatenate(cluster.dtrajs)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-1-93c7142c4806> in <module>
      7 data_concatenated = np.concatenate(data)
      8
----> 9 msm = pyemma.load('nb4.pyemma', model_name='ala2_msm')
     10 bayesian_msm = pyemma.load('nb4.pyemma', model_name='ala2_bayesian_msm')
     11 cluster = pyemma.load('nb4.pyemma', model_name='ala2_cluster')

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/__init__.py in load(filename, model_name)
     14 def load(filename, model_name='default'):
     15     from .serialization import SerializableMixIn
---> 16     return SerializableMixIn.load(file_name=filename, model_name=model_name)
     17
     18

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/serialization.py in load(cls, file_name, model_name)
    265         obj : the de-serialized object
    266         """
--> 267         from .h5file import H5File
    268         with H5File(file_name, model_name=model_name, mode='r') as f:
    269             return f.model

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/h5file.py in <module>
     20 import numpy as np
     21 from io import BytesIO
---> 22 from .pickle_extensions import HDF5PersistentPickler
     23
     24 __author__ = 'marscher'

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/pickle_extensions.py in <module>
     66
     67 # we cache this during runtime
---> 68 _DEFAULT_BLOSC_OPTIONS = _check_blosc_avail()
     69
     70

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/pickle_extensions.py in _check_blosc_avail()
     44     fid, name = tempfile.mkstemp()
     45     try:
---> 46         with h5py.File(name) as h5f:
     47             try:
     48                 h5f.create_dataset('test', shape=(1,1), **blosc_opts)

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, **kwds)
    422             with phil:
    423                 fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0, **kwds)
--> 424                 fid = make_fid(name, mode, userblock_size,
    425                                fapl, fcpl=make_fcpl(track_order=track_order, fs_strategy=fs_strategy,
    426                                fs_persist=fs_persist, fs_threshold=fs_threshold),

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    188         if swmr and swmr_support:
    189             flags |= h5f.ACC_SWMR_READ
--> 190         fid = h5f.open(name, flags, fapl=fapl)
    191     elif mode == 'r+':
    192         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

OSError: Unable to open file (file signature not found)

In the previous Notebook 04 ➜ 📓, we saw that four metastable states are a reasonable choice for our MSM. We, thus, perform PCCA++ with this number of states for further analysis and print out the stationary probabilities of the metastable sets:

[11]:
nstates = 4
msm.pcca(nstates)
for i, s in enumerate(msm.metastable_sets):
    print('π_{} = {:f}'.format(i + 1, msm.pi[s].sum()))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-bab2c07a86d4> in <module>
      1 nstates = 4
----> 2 msm.pcca(nstates)
      3 for i, s in enumerate(msm.metastable_sets):
      4     print('π_{} = {:f}'.format(i + 1, msm.pi[s].sum()))

NameError: name 'msm' is not defined

We visualize the metastable memberships:

[12]:
fig, axes = plt.subplots(1, 4, figsize=(15, 3))
for i, ax in enumerate(axes.flat):
    pyemma.plots.plot_contour(
        *data_concatenated.T,
        msm.metastable_distributions[i][dtrajs_concatenated],
        ax=ax,
        cmap='afmhot_r',
        mask=True,
        cbar_label='metastable distribution {}'.format(i + 1))
    ax.set_xlabel('$\Phi$')
axes[0].set_ylabel('$\Psi$')
fig.tight_layout()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-e1de71a1e17e> in <module>
      3     pyemma.plots.plot_contour(
      4         *data_concatenated.T,
----> 5         msm.metastable_distributions[i][dtrajs_concatenated],
      6         ax=ax,
      7         cmap='afmhot_r',

NameError: name 'msm' is not defined
../../_images/tutorials_notebooks_05-pcca-tpt_23_1.png

PCCA++ nicely separates the high-density regions and we find that each of the basins was assigned a metastable set. This indicates that our projection indeed describes the slow dynamics.

We concatenate all three discrete trajectories and obtain a single trajectory of metastable states which we use to visualize the metastable state memberships of all datapoints. We further compute the state with the highest membership to a PCCA metastable state to plot a state label there.

⚠️ Please remember that the concatenated discrete trajectories (dtrajs) are not meant to be used for MSM estimation (artificial transitions), but only for visualization and indexing purposes!

[13]:
metastable_traj = msm.metastable_assignments[dtrajs_concatenated]
highest_membership = msm.metastable_distributions.argmax(1)
coarse_state_centers = cluster.clustercenters[msm.active_set[highest_membership]]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-d867181188cc> in <module>
----> 1 metastable_traj = msm.metastable_assignments[dtrajs_concatenated]
      2 highest_membership = msm.metastable_distributions.argmax(1)
      3 coarse_state_centers = cluster.clustercenters[msm.active_set[highest_membership]]

NameError: name 'msm' is not defined

Now, we use the mfpt() method of the MSM object to compute MFPTs between pairs of metastable sets and compute the inverse MFPTs for visualization purposes:

[14]:
mfpt = np.zeros((nstates, nstates))
for i in range(nstates):
    for j in range(nstates):
        mfpt[i, j] = msm.mfpt(
            msm.metastable_sets[i],
            msm.metastable_sets[j])

inverse_mfpt = np.zeros_like(mfpt)
nz = mfpt.nonzero()
inverse_mfpt[nz] = 1.0 / mfpt[nz]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-c2dd79f39af1> in <module>
      2 for i in range(nstates):
      3     for j in range(nstates):
----> 4         mfpt[i, j] = msm.mfpt(
      5             msm.metastable_sets[i],
      6             msm.metastable_sets[j])

NameError: name 'msm' is not defined

We visualize our model in backbone torsion space:

[15]:
fig, ax = plt.subplots(figsize=(10, 7))
_, _, misc = pyemma.plots.plot_state_map(
    *data_concatenated.T, metastable_traj, ax=ax, zorder=-1)
misc['cbar'].set_ticklabels(range(1, nstates + 1))  # set state numbers 1 ... nstates

pyemma.plots.plot_network(
    inverse_mfpt,
    pos=coarse_state_centers,
    figpadding=0,
    arrow_label_format='%.1f ps',
    arrow_labels=mfpt,
    size=12,
    show_frame=True,
    ax=ax)

ax.set_xlabel('$\Phi$')
ax.set_ylabel('$\Psi$')
ax.set_xlim(-np.pi, np.pi)
ax.set_ylim(-np.pi, np.pi)
fig.tight_layout()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-ef63743cdda3> in <module>
      1 fig, ax = plt.subplots(figsize=(10, 7))
      2 _, _, misc = pyemma.plots.plot_state_map(
----> 3     *data_concatenated.T, metastable_traj, ax=ax, zorder=-1)
      4 misc['cbar'].set_ticklabels(range(1, nstates + 1))  # set state numbers 1 ... nstates
      5

NameError: name 'metastable_traj' is not defined
../../_images/tutorials_notebooks_05-pcca-tpt_29_1.png

Have you noticed how well the metastable state coloring agrees with the eigenvector visualization of the three slowest processes?

If we could afford a shorter lag time, we might even be able to resolve more processes and, thus, subdivide the metastable states three and four. We show how to do this with HMMs in Notebook 07 ➜ 📓.

Now we define a small function to visualize samples of metastable states with NGLView.

[16]:
def visualize_metastable(samples, cmap, selection='backbone'):
    """ visualize metastable states
    Parameters
    ----------
    samples: list of mdtraj.Trajectory objects
        each element contains all samples for one metastable state.
    cmap: matplotlib.colors.ListedColormap
        color map used to visualize metastable states before.
    selection: str
        which part of the molecule to selection for visualization. For details have a look here:
        http://mdtraj.org/latest/examples/atom-selection.html#Atom-Selection-Language
    """
    import nglview
    from matplotlib.colors import to_hex

    widget = nglview.NGLWidget()
    widget.clear_representations()
    ref = samples[0]
    for i, s in enumerate(samples):
        s = s.superpose(ref)
        s = s.atom_slice(s.top.select(selection))
        comp = widget.add_trajectory(s)
        comp.add_ball_and_stick()

    # this has to be done in a separate loop for whatever reason...
    x = np.linspace(0, 1, num=len(samples))
    for i, x_ in enumerate(x):
        c = to_hex(cmap(x_))
        widget.update_ball_and_stick(color=c, component=i, repr_index=i)
        widget.remove_cartoon(component=i)
    return widget

We now sample some representative structures and visualize these with the aid of NGLView. For the sake of clarity, we draw only the backbone atoms. Since we have obtained several samples for each metastable state, you can click the play button to iterate over all samples. For each iteration, the samples of all four states will be drawn. You can double click the molecule to show it at full screen. Press escape to go back.

[17]:
cmap = mpl.cm.get_cmap('viridis', nstates)

my_samples = [pyemma.coordinates.save_traj(files, idist, outfile=None, top=pdb)
              for idist in msm.sample_by_distributions(msm.metastable_distributions, 50)]

visualize_metastable(my_samples, cmap, selection='backbone')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-30c329870e1a> in <module>
      2
      3 my_samples = [pyemma.coordinates.save_traj(files, idist, outfile=None, top=pdb)
----> 4               for idist in msm.sample_by_distributions(msm.metastable_distributions, 50)]
      5
      6 visualize_metastable(my_samples, cmap, selection='backbone')

NameError: name 'msm' is not defined

Coming back to TPT, we now have more than two metastable states and can expect more insights from analyzing the transition paths. As an example, we will focus on the committor between metastable sets \(0\) and \(3\) as defined above.

[18]:
A = msm.metastable_sets[0]
B = msm.metastable_sets[3]
flux = pyemma.msm.tpt(msm, A, B)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-b57e598ab757> in <module>
----> 1 A = msm.metastable_sets[0]
      2 B = msm.metastable_sets[3]
      3 flux = pyemma.msm.tpt(msm, A, B)

NameError: name 'msm' is not defined

Before we go on with the visualization, let’s coarse grain the flux with the metastable sets estimated with PCCA++:

[19]:
cg, cgflux = flux.coarse_grain(msm.metastable_sets)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-22fb228be896> in <module>
----> 1 cg, cgflux = flux.coarse_grain(msm.metastable_sets)

NameError: name 'flux' is not defined

We now show an overlay of the committor probabilities and the most likely transition path from the coarse graining TPT:

[20]:
fig, ax = plt.subplots(figsize=(10, 7))

pyemma.plots.plot_contour(
    *data_concatenated.T,
    flux.committor[dtrajs_concatenated],
    cmap='brg',
    ax=ax,
    mask=True,
    cbar_label=r'committor 1 $\to$ 4',
    alpha=0.8,
    zorder=-1);

pyemma.plots.plot_flux(
    cgflux,
    coarse_state_centers,
    cgflux.stationary_distribution,
    state_labels=['A','' ,'', 'B'],
    ax=ax,
    show_committor=False,
    figpadding=0,
    show_frame=True,
    arrow_label_format='%2.e / ps');

ax.set_xlabel('$\Phi$')
ax.set_ylabel('$\Psi$')
ax.set_xlim(-np.pi, np.pi)
ax.set_ylim(-np.pi, np.pi)
fig.tight_layout()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-466f40b58c25> in <module>
      3 pyemma.plots.plot_contour(
      4     *data_concatenated.T,
----> 5     flux.committor[dtrajs_concatenated],
      6     cmap='brg',
      7     ax=ax,

NameError: name 'flux' is not defined
../../_images/tutorials_notebooks_05-pcca-tpt_39_1.png

First, the color map shows us a region with committor probability \(\approx 0.5\). This indicates that this particular metastable state is a transition state in the pathway from \(A\) to \(B\). Second, the plot_flux() function displays the most likely transition pathway along this path. There are other, less likely pathways included in the plot as well. The arrow thickness indicates the flux between the states.

We can decompose the flux into these individual pathways by:

[21]:
paths, path_fluxes = cgflux.pathways(fraction=0.99)
print('percentage       \tpath')
print('-------------------------------------')
for i in range(len(paths)):
    print(np.round(path_fluxes[i] / np.sum(path_fluxes), 3),' \t', paths[i] + 1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-3eb7fb10d318> in <module>
----> 1 paths, path_fluxes = cgflux.pathways(fraction=0.99)
      2 print('percentage       \tpath')
      3 print('-------------------------------------')
      4 for i in range(len(paths)):
      5     print(np.round(path_fluxes[i] / np.sum(path_fluxes), 3),' \t', paths[i] + 1)

NameError: name 'cgflux' is not defined

As expected, about \(85\%\) of the flux goes through only one pathway. To get a cleaner picture, the plot_flux() function supports a minflux keyword argument that can be increased to exclude very low fluxes from the plot.

Exercise 1

Define a featurizer that loads the heavy atom coordinates and load the data into memory. Also load the TICA object from Notebook 04 ➜ 📓 to transform the featurized data. Further, the estimated MSM, Bayesian MSM, and Cluster objects should be loaded from disk.

Solution

[22]:
feat = pyemma.coordinates.featurizer(pdb)
pairs = feat.pairs(feat.select_Heavy())
feat.add_distances(pairs, periodic=False)
data = pyemma.coordinates.load(files, features=feat)

tica = pyemma.load('nb4.pyemma', model_name='ala2tica_tica')
tica_output = tica.transform(data)
tica_concatenated = np.concatenate(tica_output)

msm = pyemma.load('nb4.pyemma', model_name='ala2tica_msm')
bayesian_msm = pyemma.load('nb4.pyemma', model_name='ala2tica_bayesian_msm')

cluster = pyemma.load('nb4.pyemma', model_name='ala2tica_cluster')
dtrajs_concatenated = np.concatenate(cluster.dtrajs)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-1-81fdd4a9efe8> in <module>
      4 data = pyemma.coordinates.load(files, features=feat)
      5
----> 6 tica = pyemma.load('nb4.pyemma', model_name='ala2tica_tica')
      7 tica_output = tica.transform(data)
      8 tica_concatenated = np.concatenate(tica_output)

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/__init__.py in load(filename, model_name)
     14 def load(filename, model_name='default'):
     15     from .serialization import SerializableMixIn
---> 16     return SerializableMixIn.load(file_name=filename, model_name=model_name)
     17
     18

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/serialization.py in load(cls, file_name, model_name)
    265         obj : the de-serialized object
    266         """
--> 267         from .h5file import H5File
    268         with H5File(file_name, model_name=model_name, mode='r') as f:
    269             return f.model

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/h5file.py in <module>
     20 import numpy as np
     21 from io import BytesIO
---> 22 from .pickle_extensions import HDF5PersistentPickler
     23
     24 __author__ = 'marscher'

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/pickle_extensions.py in <module>
     66
     67 # we cache this during runtime
---> 68 _DEFAULT_BLOSC_OPTIONS = _check_blosc_avail()
     69
     70

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/pyemma/_base/serialization/pickle_extensions.py in _check_blosc_avail()
     44     fid, name = tempfile.mkstemp()
     45     try:
---> 46         with h5py.File(name) as h5f:
     47             try:
     48                 h5f.create_dataset('test', shape=(1,1), **blosc_opts)

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, **kwds)
    422             with phil:
    423                 fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0, **kwds)
--> 424                 fid = make_fid(name, mode, userblock_size,
    425                                fapl, fcpl=make_fcpl(track_order=track_order, fs_strategy=fs_strategy,
    426                                fs_persist=fs_persist, fs_threshold=fs_threshold),

/storage/mi/marscher/software/miniconda3/conda-bld/pyemma-doc_1607519340854/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    188         if swmr and swmr_support:
    189             flags |= h5f.ACC_SWMR_READ
--> 190         fid = h5f.open(name, flags, fapl=fapl)
    191     elif mode == 'r+':
    192         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

OSError: Unable to open file (file signature not found)

Exercise 2

Do a PCCA++ analysis of the MSM with four metastable states, compute the probability of the metastable sets, and visualize the metastable state memberships.

Solution

[23]:
nstates = 4
msm.pcca(nstates)

for i, s in enumerate(msm.metastable_sets):
    print('π_{} = {:f}'.format(i + 1, msm.pi[s].sum()))

fig, axes = plt.subplots(1, 4, figsize=(15, 3))
for i, ax in enumerate(axes.flat):
    pyemma.plots.plot_contour(
        *tica_concatenated.T,
        msm.metastable_distributions[i][dtrajs_concatenated],
        ax=ax,
        cmap='afmhot_r',
        mask=True,
        cbar_label='metastable distribution {}'.format(i + 1))
    ax.set_xlabel('IC 1')
axes[0].set_ylabel('IC 2')
fig.tight_layout()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-72023ccff5a9> in <module>
      1 nstates = 4
----> 2 msm.pcca(nstates)
      3
      4 for i, s in enumerate(msm.metastable_sets):
      5     print('π_{} = {:f}'.format(i + 1, msm.pi[s].sum()))

NameError: name 'msm' is not defined

Did you guess the metastable states correctly?

Note the similarities between the MSM built from the backbone torsions and the MSM built from the TICA projection of heavy atom distances. Even though we started from different features, both models found the same kinetic information in the data.

Exercise 3

Compute the pairwise MFPTs and transition rates, and visualize the resulting kinetic network.

Solution

[24]:
mfpt = np.zeros((nstates, nstates))
for i in range(nstates):
    for j in range(nstates):
        mfpt[i, j] = msm.mfpt(
            msm.metastable_sets[i],
            msm.metastable_sets[j])

inverse_mfpt = np.zeros_like(mfpt)
nz = mfpt.nonzero()
inverse_mfpt[nz] = 1.0 / mfpt[nz]

pyemma.plots.plot_network(
    inverse_mfpt,
    pos=np.asarray([[0, 0], [4, 0], [2, 4], [6, 4]]),
    arrow_label_format='%.1f ps',
    arrow_labels=mfpt,
    arrow_scale=3.0,
    state_labels=range(1, nstates + 1),
    size=12);
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-f3aa2cc0ac6f> in <module>
      2 for i in range(nstates):
      3     for j in range(nstates):
----> 4         mfpt[i, j] = msm.mfpt(
      5             msm.metastable_sets[i],
      6             msm.metastable_sets[j])

NameError: name 'msm' is not defined

Exercise 4

Compute the TPT object, coarse grain it onto the PCCA++ metastable sets and visualize the flux along with the committor probabilities.

Solution

[25]:
A = msm.metastable_sets[0]
B = msm.metastable_sets[3]
flux = pyemma.msm.tpt(msm, A, B)

cg, cgflux = flux.coarse_grain(msm.metastable_sets)

highest_membership = msm.metastable_distributions.argmax(1)
coarse_state_centers = cluster.clustercenters[msm.active_set[highest_membership]]

fig, ax = plt.subplots(figsize=(10, 7))

pyemma.plots.plot_contour(
    *tica_concatenated.T,
    flux.committor[dtrajs_concatenated],
    cmap='brg',
    ax=ax,
    mask=True,
    cbar_label=r'committor 1 $\to$ 4',
    zorder=-1)

pyemma.plots.plot_flux(
    cgflux,
    coarse_state_centers,
    cgflux.stationary_distribution,
    ax=ax,
    show_committor=False,
    figpadding=0.2,
    state_labels=['A', '', '', 'B'],
    arrow_label_format='%2.e / ps')

ax.set_xlabel('IC 1')
ax.set_ylabel('IC 2')
ax.set_aspect('equal')
ax.set_xlim(tica_concatenated[:, 0].min(), tica_concatenated[:, 0].max())
ax.set_ylim(tica_concatenated[:, 1].min(), tica_concatenated[:, 1].max())
fig.tight_layout()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-15e6c46cb690> in <module>
----> 1 A = msm.metastable_sets[0]
      2 B = msm.metastable_sets[3]
      3 flux = pyemma.msm.tpt(msm, A, B)
      4
      5 cg, cgflux = flux.coarse_grain(msm.metastable_sets)

NameError: name 'msm' is not defined

Wrapping up

In this notebook, we have learned how to use PCCA++ using an existing MSM and how to extract kinetic information from the model. In detail, we have used - the pcca() method of an MSM object to find metastable states, - the mfpt() method of an MSM object to compute mean first passage times between metastable states which, in turn, are accessible via - the metastable_sets and metastable_assignments attributes of an MSM object.

For visualizing MSMs or kinetic networks we used - pyemma.plots.plot_density(), pyemma.plots.plot_contour() and - pyemma.plots.plot_network().

References

[][]Susanna Röblitz and Marcus Weber. 2013. Fuzzy spectral clustering by PCCA+: application to Markov state models and data classification. URL

[^]Weinan E. and Eric Vanden-Eijnden. 2006. Towards a Theory of Transition Paths. URL

[^]Philipp Metzner and Christof Schütte and Eric Vanden-Eijnden. 2009. Transition Path Theory for Markov Jump Processes. URL