Protein-ligand binding: Hidden Markov model analysis of Benzamidine-Trypsin binding
===================================================================================

In this notebook we will use PyEMMA to analyze the binding kinetics of
Benzamidine to Trypsin using 10 microseconds of trajectory data
generated with ACEMD. The simulation setup can be found in [1]. The
simulation setup and Benzamidine parameters were provided by Gianni De
Fabritiis - thanks a lot, Gianni!

Our analysis will be based on Hidden Markov models as a coarse-grained
Markov model variant - see [2] for details.

.. code:: python

    import pyemma
    pyemma.__version__


.. parsed-literal::

    '1.2.2-296-g37f5877-dirty'


This notebook has been tested for version 2.0. If you are using a
different version some adaptations may be required. Now we import a few
general packages: the pylab macro includes numpy, matplotlib and makes
sure all plots are shown inside the notebook.

.. code:: python

    import os
    %pylab inline
    matplotlib.rcParams.update({'font.size': 14})


.. parsed-literal::

    Populating the interactive namespace from numpy and matplotlib


Now we import the PyEMMA modules required later.

.. code:: python

    import pyemma.coordinates as coor
    import pyemma.msm as msm
    import pyemma.plots as mplt
    from pyemma import config

This little helper function is used for saving figures. We can set the
target directy and switch off saving by setting do\_save=False here

.. code:: python

    def save_figure(name):
        # change these if wanted
        do_save = True
        fig_dir = './figs/'
        if do_save:
            savefig(fig_dir + name, bbox_inches='tight')

Trypsin/Benzamidine - load data
-------------------------------

We first define the pdb file and the xtc trajectories used as input
files

.. code:: python

    indir = './data'
    topfile =  indir+'/tryp_ben_protein.pdb'
    traj_list = []
    for filename in os.listdir(indir):
        if filename.endswith('-200ns.xtc'):
            traj_list.append(os.path.join(indir,filename))

Now we decide which coordinates to be used in the further analysis. In
this example we will use nearest-neighbor heavy-atom contacts between
the Benzamidine residue and each other residue. If the nearest-neighbor
distance between Benzamidine and a given other residue is smaller than
0.5 nm, the contact is set 1, otherwise 0. This results in a contact
vector with 233 elements, i.e. each trajectory frame is now represented
by a 233-dimensional binary feature vector.

.. code:: python

    feat = coor.featurizer(topfile)
    n_res = 223 # number of residues
    ben_ind = 225 # benzamidine index (with residues starting at 1)
    ind_arr = np.zeros((n_res,2))
    for i in range(n_res):
        ind_arr[i][0] = ben_ind-1
        ind_arr[i][1] = i
    feat.add_residue_mindist(residue_pairs=ind_arr, scheme='closest-heavy', threshold=0.5)

Let's look at the first ten entries of the feature list. The feature is
formally a distance feature, but you can see that we have selected
contacts because of the 'closest-heavy' tag

.. code:: python

    feat.describe()[:10]


.. parsed-literal::

    ['RES_DIST (closest-heavy) BEN225 - ILE1',
     'RES_DIST (closest-heavy) BEN225 - VAL2',
     'RES_DIST (closest-heavy) BEN225 - GLY3',
     'RES_DIST (closest-heavy) BEN225 - GLY4',
     'RES_DIST (closest-heavy) BEN225 - TYR5',
     'RES_DIST (closest-heavy) BEN225 - THR6',
     'RES_DIST (closest-heavy) BEN225 - CYS7',
     'RES_DIST (closest-heavy) BEN225 - GLY8',
     'RES_DIST (closest-heavy) BEN225 - ALA9',
     'RES_DIST (closest-heavy) BEN225 - ASN10']


Since out feature set is reasonably low-dimensional we can just load all
data into memory which makes some of the analysis quicker. Note that
instead of defining a coordinate source and then getting its output, we
could just directly say coor.load. But we will need the coordinate
source later, so we define it here.

Note that we can also work on disk (streaming mode) in order to handle
large trajectory data - see BPTI notebook for examples. Let's also have
a look on some general statistics.

.. code:: python

    # short version: X = coor.load(traj_list, feat)
    inp = coor.source(traj_list, feat)
    X = inp.get_output()
    print 'number of trajectories = ', len(X)
    print 'trajectory length = ', X[0].shape[0]
    print 'total data = ', len(X)*X[0].shape[0],'ns'
    print 'number of dimension = ', X[0].shape[1]


.. parsed-literal::

    number of trajectories =  50
    trajectory length =  200
    total data =  10000 ns
    number of dimension =  223


TICA and clustering
-------------------

Now we reduce the dimension of the input coordinates to the subspace of
slow coordinates using TICA [3]. By default, TICA will choose a number
of output dimensions to cover 95% of the kinetic variance and scale the
output to produce a kinetic map [4].

.. code:: python

    tica_lag = 10  # tica lagtime
    tica_obj = coor.tica(X, lag=tica_lag)
    Y = tica_obj.get_output()  # get tica coordinates

In this case we retain 59 dimensions, which is a lot but note that they
are scaled by eigenvalue, so it's mostly the first dimensions that
contribute.

.. code:: python

    print 'Retained dimension: ', tica_obj.dimension()
    plot(tica_obj.cumvar, linewidth=2)
    plot([tica_obj.dimension(), tica_obj.dimension()], [0, 1], color='black', linewidth=2)
    plot([0, 200], [0.95, 0.95], color='black', linewidth=2)
    xlabel('Number of dimensions'); ylabel('Cum. kinetic variance fraction')


.. parsed-literal::

    Retained dimension:  59


.. parsed-literal::

    <matplotlib.text.Text at 0x10accccd0>


.. image:: trypsin_benzamidine_hmm_files/trypsin_benzamidine_hmm_20_2.png


Let's have a look on a free energy plot onto the first two coordinates.
The projection onto coordinates is a very superficial view of the
high-dimensional space, but at least we can see that there are distinct
free energy basins, indicating metastability.

.. code:: python

    mplt.plot_free_energy(np.vstack(Y)[:, 0], np.vstack(Y)[:, 1])
    xlabel('independent component 1'); ylabel('independent component 2')
    save_figure('free_energy.png')


.. image:: trypsin_benzamidine_hmm_files/trypsin_benzamidine_hmm_22_0.png


Clustering
~~~~~~~~~~

We use sqrt(N) clusters for k-means, where N are the total number of
samples available. We fix the seed in order to make sure that the
clustering results are reproducible with this notebook. Usually there is
no reason to do that.

.. code:: python

    n_clusters = 100
    clustering = coor.cluster_kmeans(Y, k=n_clusters, max_iter=100, tolerance=1e-10, fixed_seed=True)
    dtrajs = clustering.dtrajs  # get discrete trajectories


.. parsed-literal::

    2015-08-27 20:54:31,382 coordinates.clustering.KmeansClustering[3] INFO     Cluster centers converged after 7 steps.


Now we would like to characterize the clusters. The following miniscript
calculates how many residues contacts Benzamidine has one average, by
cluster.

.. code:: python

    Dall = np.concatenate(dtrajs)
    Xall = np.vstack(X)
    distogram = np.zeros(n_clusters)
    for i in range(n_clusters):
        Xsub = Xall[np.where(Dall==i)[0]]
        distogram[i] = Xsub.sum() / Xsub.shape[0]

Plotting a scatterplot of the number of contacts over the number of
frames in each cluster, we see the following picture: we have one state
with 0 contacts and many counts. This is good, because it means we have
resolved the unbound state in a single cluster. There are also bound
states with up to 15 contacts, where Benzamidin is completely embedded
in the protein. We still have to see whether the crystallographic
complex is amongst these states, but at least we can say something about
the association / dissocation process based on these data.

.. code:: python

    histogram = np.bincount(np.concatenate(dtrajs), minlength=len(clustering.clustercenters));
    scatter(histogram, distogram)
    semilogx(); xlabel('number of frames'); ylabel('number of contacts')


.. parsed-literal::

    <matplotlib.text.Text at 0x109adddd0>


.. image:: trypsin_benzamidine_hmm_files/trypsin_benzamidine_hmm_28_1.png


We get the cluster index of the unbound state, for later reference

.. code:: python

    i_micro_unbound = np.argmin(distogram)
    print 'Microstate of unbound = ', i_micro_unbound


.. parsed-literal::

    Microstate of unbound =  1


Coarse-grained kinetic Model using a Hidden Markov Model
--------------------------------------------------------

Now we estimate Bayesian Hidden Markov models (BHMMs) at a series of lag
times. BHMMs are initialized from Markov state models and
maximum-likelihood estimated as described in [2]. The statistical
uncertainties are obtained by sampling the HMMs as described in [6].

We here choose five states because that resolves interesting different
conformations. However, we get a number of warnings below because there
is no pronounced timescale gap between five and six states, so the
application of HMMs with five states is not completely justified. Since
we get no problems in terms of convergence of timescales etc., we'll
ignore this problem for now, but take care in general. There is a
timescale gap in this data with two or three states, but then the
analysis becomes pretty boring because we won't resolve interesting
conformational substates.

This will take a few minutes, because some HMM maximum likelihood
estimations take quite long to converge. The Bayesian sampling step is
fast, so we can easily afford 250 samples to get somewhat smoother error
bar estimates (which is above the default).

.. code:: python

    nstates = 5
    lags=[1,2,3,4,5,7,10]
    its = msm.timescales_hmsm(dtrajs, nstates, lags=lags, errors='bayes', nsamples=250)


.. parsed-literal::

    2015-08-27 20:54:32,362 MaximumLikelihoodHMSM[0x105e1f7d0] WARNING  Requested coarse-grained model with 5 metastable states at lag=1.The ratio of relaxation timescales between 5 and 6 states is only 1.04257443495 while we recommend at least 2.  It is possible that the resulting HMM is inaccurate. Handle with caution.
    2015-08-27 20:54:40,061 MaximumLikelihoodHMSM[0x1088cd6d0] WARNING  Requested coarse-grained model with 5 metastable states at lag=2.The ratio of relaxation timescales between 5 and 6 states is only 1.09491033362 while we recommend at least 2.  It is possible that the resulting HMM is inaccurate. Handle with caution.
    /Users/noe/data/software_projects/msmtools/msmtools/analysis/dense/pcca.py:294: ComplexWarning: Casting complex values to real discards the imaginary part
      evecs[:, i] /= math.sqrt(np.dot(evecs[:, i] * pi, evecs[:, i]))
    2015-08-27 20:58:21,421 MaximumLikelihoodHMSM[0x10889e850] WARNING  Requested coarse-grained model with 5 metastable states at lag=3.The ratio of relaxation timescales between 5 and 6 states is only 1.16445416799 while we recommend at least 2.  It is possible that the resulting HMM is inaccurate. Handle with caution.
    /Users/noe/data/software_projects/msmtools/msmtools/estimation/dense/transition_matrix.py:199: NotConvergedWarning: Reversible transition matrix estimation didn't converge.
      warnings.warn("Reversible transition matrix estimation didn't converge.", msmtools.util.exceptions.NotConvergedWarning)
    2015-08-27 20:59:07,231 MaximumLikelihoodHMSM[0x10aace710] WARNING  Requested coarse-grained model with 5 metastable states at lag=4.The ratio of relaxation timescales between 5 and 6 states is only 1.21391934977 while we recommend at least 2.  It is possible that the resulting HMM is inaccurate. Handle with caution.
    2015-08-27 21:00:06,450 MaximumLikelihoodHMSM[0x109a05490] WARNING  Requested coarse-grained model with 5 metastable states at lag=5.The ratio of relaxation timescales between 5 and 6 states is only 1.13144623594 while we recommend at least 2.  It is possible that the resulting HMM is inaccurate. Handle with caution.
    2015-08-27 21:00:16,225 MaximumLikelihoodHMSM[0x10909cc90] WARNING  Requested coarse-grained model with 5 metastable states at lag=7.The ratio of relaxation timescales between 5 and 6 states is only 1.0976875936 while we recommend at least 2.  It is possible that the resulting HMM is inaccurate. Handle with caution.
    2015-08-27 21:00:40,678 MaximumLikelihoodHMSM[0x1087ad2d0] WARNING  Requested coarse-grained model with 5 metastable states at lag=10.The ratio of relaxation timescales between 5 and 6 states is only 1.27766830627 while we recommend at least 2.  It is possible that the resulting HMM is inaccurate. Handle with caution.


.. code:: python

    mplt.plot_implied_timescales(its, show_mean=False, units='ns', linewidth=2)
    ylim(10, 1000000)
    save_figure('its.png')


.. image:: trypsin_benzamidine_hmm_files/trypsin_benzamidine_hmm_33_0.png


The timescales are largely converged for small lag times (much smaller
than those an MSM would converge for), but diverge for lag times times
that are on the order of the fastest process. That is a typical
observation for HMMs. Here, we choose a lag time of 2 ns because the
first three lag times are stable for lag times 2 ns to 5 ns.

Trypsin-Benzamidin thermodynamics and kinetics
----------------------------------------------

Now we will compute the binding affinity and the binding/dissociation
rates from the model and compare it to experimental values. We define a
few helper functions to convert the simulation quantities into
experimental (Molar) quantities.

.. code:: python

    from pyemma.util.statistics import confidence_interval

.. code:: python

    def index_unbound_bound(bhmm):
        i_bound = np.argmax(bhmm.stationary_distribution)
        i_unbound = np.argmax(bhmm.observation_probabilities[:, i_micro_unbound])
        return i_unbound, i_bound
    
    def pi2dG(pi, i_unbound, i_bound):
        delta_g = -0.6 * np.log(pi[i_bound]/pi[i_unbound])  # dG in kcal/mol
        delta_g -= 3.1  # volume correction to standard binding free energy
        return delta_g
    
    def binding_free_energy(bhmm, conf=0.95):
        i_unbound, i_bound = index_unbound_bound(bhmm)
        # MLE
        pi_mle = bhmm.stationary_distribution
        dG_mle = pi2dG(pi_mle, i_unbound, i_bound)
        # samples
        pi_samples = bhmm.sample_f('stationary_distribution')
        dG_samples = [pi2dG(pi_sample, i_unbound, i_bound) for pi_sample in pi_samples]
        l, r = confidence_interval(dG_samples, conf=conf)
        return dG_mle, l, r
    
    def mfpt2kon(mfpt):
        mfpt *= 1e-9  # in seconds
        # volume fraction
        Nsim = 10604.0  # number of water molecules in our simulation
        Nstd = 55.55  # number of water molecules in standard volume
        concentration = Nstd / Nsim
        return 1./(mfpt*concentration)
    
    def binding_rate(bhmm, conf=0.95):
        i_unbound, i_bound = index_unbound_bound(bhmm)
        # MLE
        mfpt_mle = bhmm.mfpt(i_unbound, i_bound)
        kon = mfpt2kon(mfpt_mle)
        # samples
        mfpt_samples = bhmm.sample_f('mfpt', i_unbound, i_bound)
        kon_samples = [mfpt2kon(mfpt_sample) for mfpt_sample in mfpt_samples]
        l, r = confidence_interval(kon_samples, conf=conf)
        return kon, l, r
    
    def mfpt2koff(mfpt):
        mfpt *= 1e-9  # in seconds
        k_off = 1./mfpt
        return k_off
    
    def unbinding_rate(bhmm, conf=0.95):
        i_unbound, i_bound = index_unbound_bound(bhmm)
        # MLE
        mfpt_mle = bhmm.mfpt(i_bound, i_unbound)
        koff = mfpt2koff(mfpt_mle)
        # samples
        mfpt_samples = bhmm.sample_f('mfpt', i_bound, i_unbound)
        koff_samples = [mfpt2koff(mfpt_sample) for mfpt_sample in mfpt_samples]
        l, r = confidence_interval(koff_samples, conf=conf)
        return koff, l, r

.. code:: python

    dG_stats = np.array([binding_free_energy(M) for M in its.models])
    kon_stats = np.array([binding_rate(M) for M in its.models])
    koff_stats = np.array([unbinding_rate(M) for M in its.models])

We start by computing the binding free energy in kcal/mol. For this, we
compute the probability ratio of bound and unbound states in the
simulation and convert to experimental standard conditions by taking the
simulation box size into account. The results at lag=2 agree with two
experimental references within statistical error

.. code:: python

    matplotlib.rcParams.update({'font.size': 14})
    figure(figsize=(5,4))
    # plot binding affinities
    plot(its.lags, dG_stats[:, 0], linewidth=2, marker='o', color='black', label='HMM')
    # compute and plot conf
    fill_between(lags, dG_stats[:, 1], dG_stats[:, 2], alpha=0.2, color='black')
    # experimental reference: 6.4 kcal/mold from Talhout R, Engberts JBFN. Eur J Biochem. 2001;268(6):1554–1560
    plot(its.lags, -6.4*np.ones(len(its.lags)), linewidth=2, linestyle='dotted', color='black', label='Talhout et al.')
    # experimental reference: 7.3 kcal/mold from 66. Katz BA, Elrod K, Luong C, Rice MJ, Mackman RL, Sprengeler PA, Spencer J, Hataye J, Janc J, Link J, Litvak J, Rai R, Rice K, Sideris S, Verner E, Young W. Journal of Molecular Biology. 2001;307(5):1451–1486.
    plot(its.lags, -7.3*np.ones(len(its.lags)), linewidth=2, linestyle='dashed', color='black', label='Katz et al.')
    xlim(1, 10); xlabel('lag time / ns')
    ylim(-15, 0); ylabel('binding free energy / kcal/mol')
    legend(loc=2, fontsize=14)
    save_figure('dG.png')


.. image:: trypsin_benzamidine_hmm_files/trypsin_benzamidine_hmm_40_0.png


Similar procedure for the association rate, just that we use the mean
first passage time from unbound to bound as a simulation quantity. The
results at lag=2 agree with an experimental reference within statistical
error

.. code:: python

    matplotlib.rcParams.update({'font.size': 14})
    figure(figsize(5,4))
    # plot binding rates
    plot(lags, kon_stats[:, 0], linewidth=2, marker='o', color='black', label='HMM')
    # compute and plot conf
    fill_between(lags, kon_stats[:, 1], kon_stats[:, 2], alpha=0.2, color='black')
    # experimental values from Guillain, F. & Thusius, D. Use of proflavine as an indicator in temperature- jump studies of the binding of a competitive inhibitor to trypsin. J. Am. Chem. Soc. 92, 5534–5536 (1970).
    # experimental k_on:  2.9x10**7 mol-1 sec-1'
    plot(its.lags, 2.9*1e7*np.ones(len(its.lags)), linewidth=2, linestyle='dashed', color='black', label='Guillain et al.')
    legend(loc=3, fontsize=14)
    xlim(1, 10); xlabel('lag time / ns')
    ylim(1e5, 1e9); ylabel('binding rate / s$^{-1}$'); semilogy()
    save_figure('k_on.png')


.. image:: trypsin_benzamidine_hmm_files/trypsin_benzamidine_hmm_42_0.png


Similar procedure for the dissociation rate, here we can directly use
the mean first passage time from bound to unbound. The results at lag=2
agree with an experimental reference within statistical error

.. code:: python

    matplotlib.rcParams.update({'font.size': 14})
    figure(figsize(5,4))
    # plot binding rates
    plot(lags, koff_stats[:, 0], linewidth=2, marker='o', color='black', label='HMM')
    # compute and plot conf
    fill_between(lags, koff_stats[:, 1], koff_stats[:, 2], alpha=0.2, color='black')
    # experimental
    #print 'k_off experimental = 6*10**2 sec-1'
    plot(its.lags, 600.0*np.ones(len(its.lags)), linewidth=2, linestyle='dashed', color='black', label='Guillain et al.')
    xlim(1, 10); xlabel('lag time / ns')
    ylim(1e-2, 1e7); ylabel('dissociation rate / s$^{-1}$')
    legend(loc=2, fontsize=14)
    semilogy()
    save_figure('k_off.png')


.. image:: trypsin_benzamidine_hmm_files/trypsin_benzamidine_hmm_44_0.png


Structures and binding mechanism
--------------------------------

Now we want to have a closer look on the five-state model and inspect
the structures and binding pathways. We could estimate the HMM at lag=2
using msm.estimate\_hidden\_markov\_model(dtrajs, 5, 2), but that's not
even needed because we have estimated that model already as the second
model of our implied timescales scan:

.. code:: python

    bhmm = its.models[1]
    print bhmm


.. parsed-literal::

    BayesianHMSM(conf=0.95, connectivity='largest', dt_traj='1 step',
           init_hmsm=None, lag=2, nsamples=100, nstates=5, observe_active=True,
           prior='mixed', reversible=True, stride='effective')


We want to test this model a bit more in detail by running a generalized
Chapman-Kolmogorov Test. We compute the transition probability from any
metastable state to any other metastable state for several multiples of
the estimation lag time, and check if the model predictions agree with
actual estimations conducted at these lag times.

.. code:: python

    ck = bhmm.cktest(mlags=5, err_est=True)


.. parsed-literal::

    2015-08-27 21:00:56,981 MaximumLikelihoodHMSM[0x10aab4490] WARNING  Requested coarse-grained model with 5 metastable states at lag=2.The ratio of relaxation timescales between 5 and 6 states is only 1.09491033362 while we recommend at least 2.  It is possible that the resulting HMM is inaccurate. Handle with caution.
    2015-08-27 21:04:35,539 MaximumLikelihoodHMSM[0x10b0408d0] WARNING  Requested coarse-grained model with 5 metastable states at lag=4.The ratio of relaxation timescales between 5 and 6 states is only 1.21391934977 while we recommend at least 2.  It is possible that the resulting HMM is inaccurate. Handle with caution.
    2015-08-27 21:05:34,325 MaximumLikelihoodHMSM[0x10a7c8050] WARNING  Requested coarse-grained model with 5 metastable states at lag=6.The ratio of relaxation timescales between 5 and 6 states is only 1.06273469673 while we recommend at least 2.  It is possible that the resulting HMM is inaccurate. Handle with caution.
    2015-08-27 21:05:47,455 MaximumLikelihoodHMSM[0x10a725a50] WARNING  Requested coarse-grained model with 5 metastable states at lag=8.The ratio of relaxation timescales between 5 and 6 states is only 1.0922661522 while we recommend at least 2.  It is possible that the resulting HMM is inaccurate. Handle with caution.


Now we plot the results. We have an agreement for most states for lag
times below 6 ns, but a few state transitions disagree at lag times of 6
ns or longer (e.g. transitions 2->3 and 3->2). Remember that in the
implied timescales plot above we have seen that the estimates diverge
for lag times of 6 ns or longer, so the long-lag estimates are probably
not very reliable because one or several of the HMM assumptions break
down. Overall, all state transitions agree for two or three multiples of
the estimation lag time and for more than that for a subset of the state
transitions. We will accept this model and analyze it in more detail.

.. code:: python

    fig, axes = mplt.plot_cktest(ck, figsize=(15, 15), y01=False, padding_top=0.05, padding_between=0.3)
    for i in range(5):
        for j in range(5):
            if i==j:
                axes[i, i].set_ylim(0.9, 1.0)
            else:
                axes[i, j].set_ylim(0.0, 0.1)


.. image:: trypsin_benzamidine_hmm_files/trypsin_benzamidine_hmm_50_0.png


In order to make a nice illustration of the transition network, let's
find out which HMM state is the unbound one (we define it by the
metastable state containing the dissociated cluster, i.e. the cluster
without Trypsin-Benzamidine contacts), and bound one (defined by the
state with highest probability):

.. code:: python

    i_meta_unbound = np.argmax(bhmm.metastable_memberships[i_micro_unbound])
    i_meta_bound = np.argmax(bhmm.stationary_distribution)
    print 'Unbound state:', i_meta_unbound
    print 'Bound state:', i_meta_bound


.. parsed-literal::

    Unbound state: 0
    Bound state: 4


.. parsed-literal::

    /Users/noe/data/software_projects/pyemma/pyemma/msm/models/hmsm.py:342: RuntimeWarning: divide by zero encountered in divide
      M = _np.dot(A, _np.diag(1.0/self.stationary_distribution_obs)).T


Next, we fix positions and colors for a transition network plot. We
don't have to do that and could let the plotting algorithm do the
placement automatically, but we want to order the plot in a certain way,
unbound on top, associated states on the bottom and the
highest-probability (bound) state on the right.

We also tweak the display sizes of states a little bit, because
otherwise the non-bound states are almost invisibly small.

.. code:: python

    pos=np.array([[3,3],[4.25,0],[0,1],[1.75,0],[6,1.0]])
    state_colors = ['green', 'blue', 'yellow', 'cyan', 'purple']
    state_sizes = bhmm.stationary_distribution**(0.25)
    print 'probabilities', bhmm.stationary_distribution
    print 'plotting sizes', state_sizes


.. parsed-literal::

    probabilities [  1.82454721e-03   4.34772866e-04   3.98493974e-04   8.20638457e-04
       9.96521547e-01]
    plotting sizes [ 0.2066754   0.14439947  0.14128805  0.16925362  0.99912925]


.. code:: python

    fig, pos = mplt.plot_markov_model(bhmm, pos=pos, state_sizes=state_sizes, state_colors=state_colors)
    save_figure('network.png')


.. image:: trypsin_benzamidine_hmm_files/trypsin_benzamidine_hmm_55_0.png


Finally, we conduct transition path theory in the MSM formulation. Since
the HMM is at the same time an MSM amongst its metastable states, we can
just throw the HMM object into the TPT function. We have to define two
end-states of the transition, and here we just the unbound and bound
states.

.. code:: python

    tpt = msm.tpt(bhmm, [i_meta_unbound], [i_meta_bound])

And we will visualize the flux. For visual clarity, we reuse the same
state positioning as defined above.

.. code:: python

    mplt.plot_flux(tpt, pos=pos, state_sizes=state_sizes, state_colors=state_colors, show_committor=False)
    save_figure('network_flux.png')


.. image:: trypsin_benzamidine_hmm_files/trypsin_benzamidine_hmm_59_0.png


Now we write out sample structures of the five metastable states into MD
trajectory files. We will use these to generate a nice visualization
with VMD or whatever viewer you prefer.

.. code:: python

    meta_samples = bhmm.sample_by_observation_probabilities(100)

.. code:: python

    outfiles = ['./figs/hmm1_100samples.xtc', 
                './figs/hmm2_100samples.xtc', 
                './figs/hmm3_100samples.xtc', 
                './figs/hmm4_100samples.xtc', 
                './figs/hmm5_100samples.xtc']

.. code:: python

    coor.save_trajs(inp, meta_samples, outfiles=outfiles);


.. parsed-literal::

    2015-08-27 21:06:23,183 coordinates.api INFO     Created file ./figs/hmm1_100samples.xtc
    2015-08-27 21:06:30,248 coordinates.api INFO     Created file ./figs/hmm2_100samples.xtc
    2015-08-27 21:06:34,717 coordinates.api INFO     Created file ./figs/hmm3_100samples.xtc
    2015-08-27 21:06:40,632 coordinates.api INFO     Created file ./figs/hmm4_100samples.xtc
    2015-08-27 21:06:44,900 coordinates.api INFO     Created file ./figs/hmm5_100samples.xtc


This figure was created with VMD. Since the protein is not moving much,
we just plot the crystal structure pdb once (grey), and then plot the
Benzamidin positions by color of state. A equally strided subset of all
structures is shown in order to avoid overloading the picture, but the
structures below are representative and were not manually selected in
any way.

You can see that the green state is indeed mostly dissociated, although
some structures are touching the protein without interacting with it in
any particularly strong way. We find three metastable states (yellow,
cyan and blue) where the Benzamidine associates to protein residues
(e.g. by forming hydrogen bonds to ASP residues), but these are not the
(crystallographic) bound complex. The purple state which is the state
with maximum probability indeed corresponds to the crystallographic
bound state, and has a clear and salt bridge formed between Benzamidine
and ASP170.

Looking at the network and flux plots above suggest a clear mechanism:
The dissociated (green) state can bind to any of the associated or bound
states directly. However, there is also an indirect mechanism. The
dissociated (green) state can bind into a misbound state (yellow), from
which it most likely dissociated again or crawls into the cyan state.
From there it can progress into the blue and then into the bound state.

.. code:: python

    from IPython.display import Image
    Image(filename='./figs/hmm_allstates.png') 


.. image:: trypsin_benzamidine_hmm_files/trypsin_benzamidine_hmm_65_0.png


**References**

[1] Buch, I., T. Giorgino and G. De Fabritiis: Complete reconstruction
of an enzyme-inhibitor binding process by molecular dynamics
simulations. PNAS 108, 10184–10189 (2011)

[2] Noe, F., H. Wu, J.-H. Prinz and N. Plattner: Projected and Hidden
Markov Models for calculating kinetics and metastable states of complex
molecules. J. Chem. Phys. 139, 184114 (2013)

[3] Pérez-Hernández G., F. Paul, T. Giorgino, G. De Fabritiis and F.
Noé. J. Chem. Phys. 139, 015102 (2013); Schwantes, C. R. and V. S.
Pande, J. Chem. Theory Comput. 9, 2000 (2013); Molgedey, L. and H. G.
Schuster, Phys. Rev. Lett. 72, 3634 (1994).

[4] Noe, F., C. Clementi: Kinetic distance and kinetic maps from
molecular dynamics simulation. http://arxiv.org/abs/1506.06259

[5] E, W. and E. Vanden-Eijnden. J. Stat. Phys. 123, 503-523 (2006);
Metzner, P. and C. Schütte and E. Vanden-Eijnden. Multiscale Model.
Simul. 7 (1192-1219) 2009; Noé, F., Ch. Schütte, E. Vanden-Eijnden, L.
Reich and T. Weikl. Proc. Natl. Acad. Sci. USA 106, 19011-19016 (2009)

[6] Chodera, J. D., P. Elms, F. Noé, B. Keller, C. M. Kaiser, A.
Ewall-Wice, S. Marqusee, C. Bustamante, N. Singhal Hinrichs: Bayesian
hidden Markov model analysis of single-molecule force spectroscopy:
Characterizing kinetics under measurement uncertainty.
http://arxiv.org/abs/1108.1430.