Markov state model for pentapeptide
===================================

First we import pyemma and check what version we are using.

.. code:: python

    import pyemma
    pyemma.__version__


.. parsed-literal::

    '1.2.2'


This notebook has been tested for version 1.2.2. If you are using a
different version some adaptations may be required.

Now we import a few general packages, including basic numerics and
algebra routines (numpy) and plotting routines (matplotlib), and makes
sure that all plots are shown inside the notebook rather than in a
separate window (nicer that way).

.. code:: python

    import os
    %pylab inline
    matplotlib.rcParams.update({'font.size': 12})


.. parsed-literal::

    Populating the interactive namespace from numpy and matplotlib


Now we import the PyEMMA modules required for the following steps.

.. code:: python

    import pyemma.coordinates as coor
    import pyemma.msm as msm
    import pyemma.plots as mpl

Load pentapeptide coordinates and select features
-------------------------------------------------

We first have to load the PDB file and the trajectory data, in this case
for WW-pentapeptide.

.. code:: python

    indir = './data'
    topfile =  indir+'/init-ww-penta.pdb'
    traj_list = []
    for filename in os.listdir(indir):
        if filename.endswith('-protein.xtc'):
            traj_list.append(os.path.join(indir,filename))

We can decide here which features we would like to use in the further
analysis. In this case backbone torsions. As we want to do TICA on those
coordinates, which requires subtracting the mean from each feature, we
cannot use angles directly but have to transform them into a space where
an arithmetic mean can be computed. We are using the cos/sin transform
to do this, specified by the *cossin* option.

.. code:: python

    feat = coor.featurizer(topfile)
    feat.add_backbone_torsions(cossin=True)
    # describe the features
    feat.describe()


.. parsed-literal::

    ['COS(PHI 0 LEU 2 )',
     'SIN(PHI 0 LEU 2 )',
     'COS(PHI 0 ALA 3 )',
     'SIN(PHI 0 ALA 3 )',
     'COS(PHI 0 LEU 4 )',
     'SIN(PHI 0 LEU 4 )',
     'COS(PHI 0 LEU 5 )',
     'SIN(PHI 0 LEU 5 )',
     'COS(PHI 0 TRP 1 )',
     'SIN(PHI 0 TRP 1 )',
     'COS(PHI 0 LEU 2 )',
     'SIN(PHI 0 LEU 2 )',
     'COS(PHI 0 ALA 3 )',
     'SIN(PHI 0 ALA 3 )',
     'COS(PHI 0 LEU 4 )',
     'SIN(PHI 0 LEU 4 )']


Now we define the source of input coordinates (we don't load them into
memory at this stage - they will be loaded as needed). Compute a few
basic data statistics gives:

.. code:: python

    inp = coor.source(traj_list, feat)
    print 'number of trajectories = ',inp.number_of_trajectories()
    print 'trajectory length = ',inp.trajectory_length(0)
    print 'number of dimension = ',inp.dimension()


.. parsed-literal::

    number of trajectories =  25
    trajectory length =  5001
    number of dimension =  16


TICA and clustering
===================

For TICA we have to choose a *lag* time and we have to define the output
dimension. This can be either set by the *dim* keyword, or by specify a
percentage the kinetic variance we want to keep. Here we choose 90%,
which gives us three dimensions. From the original 16-dimensional space,
most of the relevant kinetic information is in a three-dimensional
subspace.

.. code:: python

    tica_obj = coor.tica(inp, lag=20, var_cutoff=0.9)
    print 'TICA dimension ', tica_obj.dimension()


.. parsed-literal::

    2015-07-22 15:57:43,005 coordinates.transform.TICA[1] INFO     reset (previous) parametrization state, since data producer has been changed.
    2015-07-22 15:57:43,006 coordinates.transform.TICA[1] INFO     Running TICA with tau=20; Estimating two covariance matrices with dimension (16, 16)


.. parsed-literal::

    TICA dimension  3


We can have a look at the cumulative kinetic variance, which is similar
to the cumulative variance in PCA. The first dimension explaines 38% of
the variance in the data, the first two explain 75% and the first three
explain nearly 95%.

.. code:: python

    tica_obj.cumvar


.. parsed-literal::

    array([ 0.38527381,  0.75235544,  0.94326229,  0.9926726 ,  0.99931903,
            0.99950935,  0.9996539 ,  0.99974701,  0.9998262 ,  0.9998814 ,
            0.99992397,  0.99996192,  0.99997767,  0.99998953,  0.99999636,  1.        ])


.. code:: python

    # here we do a little trick to ensure that eigenvectors always have the same sign structure. 
    # That's irrelevant to the analysis and just nicer plots - you can ignore it.
    for i in range(2):
        if tica_obj.eigenvectors[0, i] > 0: 
            tica_obj.eigenvectors[:, i] *= -1

Now we get the TICA output, i.e. the coordinates after being transformed
to the three slowest components. You can think of this as a
low-dimensional space of good reaction coordinates. Having a look at the
shape of the output reveals that we still have 25 trajectories, each of
length 5001, but now only three dimensions.

.. code:: python

    Y = tica_obj.get_output() # get tica coordinates
    print 'number of trajectories = ', np.shape(Y)[0]
    print 'number of frames = ', np.shape(Y)[1]
    print 'number of dimensions = ',np.shape(Y)[2]


.. parsed-literal::

    number of trajectories =  25
    number of frames =  5001
    number of dimensions =  3


Note that at this point we loaded the compressed coordinates into
memory. We don't have to do this, but it will significantly speed up any
further analysis. It is also easy because it's low-dimensional. In
general, after the TICA-transformation we can often keep the data in
memory even if we are working with massive data of a large protein.

Now we look at the distribution on the two dominant TICA coordinates
(three are hard to visualize). For that, we build a histogram of the
first two TICA dimensions and then compute a free energy by taking
:math:`F_i = -\ln z_i`, where :math:`z_i` is the number of bin counts.

.. code:: python

    def plot_labels():
        text(-0.3, -8.4, '1', fontsize=20, color='black')
        text(0.1, -6.8, '2', fontsize=20, color='black')
        text(6.7, 0.5, '3', fontsize=20, color='black')
        text(-0.3, 0, '4', fontsize=20, color='white')    

.. code:: python

    # stack all trajectories
    all_data = np.vstack(Y)
    # histogram data
    z,x,y = np.histogram2d(all_data[:,0], all_data[:,1], bins=100)
    # compute free energies
    F = -np.log(z)
    # do a contour plot
    extent = [x[0], x[-1], y[0], y[-1]]
    ax1 = contourf(F.T, 100, extent=extent)
    cbar = plt.colorbar()
    cbar.ax.set_ylabel('Free energy')
    plot_labels()


.. parsed-literal::

    /storage/mi/nplattner/miniconda/envs/PyE_1.2.2/lib/python2.7/site-packages/IPython/kernel/__main__.py:6: RuntimeWarning: divide by zero encountered in log


.. image:: md2msm_penta_peptide_files/md2msm_penta_peptide_24_1.png


Let's have a look how one of the trajectories looks like in the space of
the first three TICA components. We can see that the TICA components
nicely resolve the slow transitions as discrete jumps.

.. code:: python

    figure(figsize(10,6))
    ax1=plt.subplot(311)
    plot(Y[0][:,0]); ylabel('IC 1'); xticks([]); #yticks([0,2,4,6])
    ax1=plt.subplot(312)
    plot(Y[0][:,1]); ylabel('IC 2'); xticks([]); #yticks([-6, -4, -2, 0])
    ax1=plt.subplot(313)
    plot(Y[0][:,2]); xlabel('time / 50 ps'); ylabel('IC 3'); #yticks([-2,0,2,4])


.. parsed-literal::

    <matplotlib.text.Text at 0x7f066a00e350>


.. image:: md2msm_penta_peptide_files/md2msm_penta_peptide_26_1.png


The TICA coordinates are now clustered into a number of discrete states
using the k-means algorithm. The k-means algorithm requires as input the
number of clusters *n\_clusters*. For the metric there is only one
choice possible here which is *euclidean*.

.. code:: python

    n_clusters = 100      # number of k-means clusters

.. code:: python

    clustering = coor.cluster_kmeans(Y,k=n_clusters)


.. parsed-literal::

    2015-07-22 15:57:52,818 coordinates.clustering.KmeansClustering[2] INFO     reset (previous) parametrization state, since data producer has been changed.
    2015-07-22 15:57:53,853 coordinates.clustering.KmeansClustering[2] INFO     Accumulated all data, running kmeans on (125025, 3)
    2015-07-22 15:57:54,748 coordinates.clustering.KmeansClustering[2] INFO     Algorithm did not reach convergence criterion of 1e-05 in 10 iterations. Consider increasing max_iter.


The trajectories are now assigned to the cluster centers.

.. code:: python

    dtrajs = clustering.dtrajs

In order to analyze the distribution of the data to the cluster centers
we make a histogram of the discrete trajectories and analyze the
distribution of the clustercenters in 2-dimensional TICA space.

.. code:: python

    figure(figsize(8, 5))
    histogram = np.bincount(np.concatenate(dtrajs), minlength=len(clustering.clustercenters));
    ind = np.arange(len(histogram))
    plt.bar(ind, histogram, log=True)
    plt.xlabel('microstate number')
    plt.ylabel('number of counts')


.. parsed-literal::

    <matplotlib.text.Text at 0x7f0669f14090>


.. image:: md2msm_penta_peptide_files/md2msm_penta_peptide_33_1.png


.. code:: python

    cc_x = clustering.clustercenters[:,0]
    cc_y = clustering.clustercenters[:,1]
    contourf(F.T, 100, extent=extent)
    plot(cc_x,cc_y, linewidth=0, marker='o', markersize=5, color='black')


.. parsed-literal::

    [<matplotlib.lines.Line2D at 0x7f066a62bbd0>]


.. image:: md2msm_penta_peptide_files/md2msm_penta_peptide_34_1.png


The states are well distributed in phase space.

Implied timescales
==================

Here we calculate the implied timescales at a series of lagtimes defined
in the *lags[ ]* array. Instead of an array you can just give a single
number such as lags=100 in order to generate a range of lagtimes <= 100.

.. code:: python

    lags = [1,2,5,10,15,20,30,50]
    its = msm.ImpliedTimescales(dtrajs, lags=lags, nits=5)

.. code:: python

    mpl.plot_implied_timescales(its, ylog=False, units='ns')
    ylim(0, 120);


.. image:: md2msm_penta_peptide_files/md2msm_penta_peptide_39_0.png


Error bars for the implied timescales can be obtained by bootstrapping.

.. code:: python

    its.bootstrap(nsample=25)

.. code:: python

    mpl.plot_implied_timescales(its, ylog=True, units='ns')
    ylim(1,1000);


.. image:: md2msm_penta_peptide_files/md2msm_penta_peptide_42_0.png


It can be seen that the timescales are approximately constant within the
error. Below we will select a lag time of 20 ns to build a Markov model.

Estimate MSM
============

The lagtime to estimate the Markov model is specified as *msm\_lag*
here.

.. code:: python

    msm_lag = 20
    M = msm.estimate_markov_model(dtrajs, msm_lag)
    print 'fraction of states used = ', M.active_state_fraction
    print 'fraction of counts used = ', M.active_count_fraction


.. parsed-literal::

    fraction of states used =  1.0
    fraction of counts used =  1.0


From the MSM which is now stored in the object we called *M* various
properties can be obtained. We start by analyzing the free energy
computed over the first two TICA coordinates

.. code:: python

    # This will compute the free energies by MSM state. Very often we will use this quantity, 
    # however this is not so nice to produce a contour plot...
    F_pi = -np.log(M.stationary_distribution)  

.. code:: python

    # ... therefore we take the statistical weight of each simulation timestep (also available from the MSM object)
    # and use that to create a contour plot
    xall = np.vstack(Y)[:,0]
    yall = np.vstack(Y)[:,1]
    W = np.concatenate(M.trajectory_weights())
    # compute free energies
    Fall = -np.log(W)
    Fall -= np.min(Fall)

.. code:: python

    fig = plt.figure(figsize=(6,5))
    mpl.contour(xall, yall, Fall)
    plot_labels()


.. image:: md2msm_penta_peptide_files/md2msm_penta_peptide_49_0.png


Now we analyze the slowest processes by looking at the distribution of
states along the first 3 eigenvectors.

.. code:: python

    # project eigenvectors
    proj_ev1_all = np.hstack([M.eigenvectors_right()[:,1][dtraj] for dtraj in M.discrete_trajectories_full])
    proj_ev2_all = np.hstack([M.eigenvectors_right()[:,2][dtraj] for dtraj in M.discrete_trajectories_full])
    proj_ev3_all = np.hstack([M.eigenvectors_right()[:,3][dtraj] for dtraj in M.discrete_trajectories_full])

.. code:: python

    fig = plt.figure(figsize=(16,4))
    
    ax1=plt.subplot(131)
    r2 = M.eigenvectors_right()[:,1]
    ax1 = mpl.contour(xall, yall, proj_ev1_all, method='linear', colorbar=False)
    plot_labels()
    
    ax2=plt.subplot(132)
    r3 = M.eigenvectors_right()[:,2]
    ax2 = mpl.contour(xall, yall, proj_ev2_all, method='linear', colorbar=False)
    plot_labels()
    
    ax3=plt.subplot(133)
    r4 = M.eigenvectors_right()[:,3]
    ax3 = mpl.contour(xall, yall, -proj_ev3_all, method='linear', colorbar=False)
    plot_labels()


.. image:: md2msm_penta_peptide_files/md2msm_penta_peptide_52_0.png


PCCA
====

Next the MSM is coarse grained into a user-defined number of macrostates
(*n\_sets*).

.. code:: python

    n_sets = 4
    M.pcca(n_sets)
    pcca_dist = M.metastable_distributions
    membership=M.metastable_memberships  # get PCCA memberships
    # memberships over trajectory
    mem1_all = np.hstack([membership[:,0][dtraj] for dtraj in M.discrete_trajectories_full])
    mem2_all = np.hstack([membership[:,1][dtraj] for dtraj in M.discrete_trajectories_full])
    mem3_all = np.hstack([membership[:,2][dtraj] for dtraj in M.discrete_trajectories_full])
    mem4_all = np.hstack([membership[:,3][dtraj] for dtraj in M.discrete_trajectories_full])

We have now determined the probability for each microstate to belong to
a given macrostate. These probabilities are called *memberships* to a
given macrostate.

.. code:: python

    fig = plt.figure(figsize=(8,8))
    
    ax1 = plt.subplot(221)
    ax1 = mpl.contour(xall, yall, mem1_all, zlim=[0,1], method='linear', colorbar=False)
    plot_labels()
    
    ax2 = plt.subplot(222)
    ax2 = mpl.contour(xall, yall, mem2_all, zlim=[0,1], method='linear', colorbar=False)
    plot_labels()
    
    ax3 = plt.subplot(223)
    ax3 = mpl.contour(xall, yall, mem3_all, zlim=[0,1], method='linear', colorbar=False)
    plot_labels()
    
    ax4 = plt.subplot(224)
    ax4 = mpl.contour(xall, yall, mem4_all, zlim=[0,1], method='linear', colorbar=False)
    plot_labels()


.. image:: md2msm_penta_peptide_files/md2msm_penta_peptide_56_0.png


For each macrostate we can generate a number of representative sample
structures and store them into a trajectory file.

.. code:: python

    pcca_samples = M.sample_by_distributions(pcca_dist, 10)

.. code:: python

    coor.save_trajs(inp, pcca_samples, outfiles=['./data/pcca1_10samples.xtc','./data/pcca2_10samples.xtc',
                                                 './data/pcca3_10samples.xtc','./data/pcca4_10samples.xtc'])


.. parsed-literal::

    2015-07-22 15:59:08,174 coordinates.api INFO     Created file ./data/pcca1_10samples.xtc
    2015-07-22 15:59:08,563 coordinates.api INFO     Created file ./data/pcca2_10samples.xtc
    2015-07-22 15:59:08,978 coordinates.api INFO     Created file ./data/pcca3_10samples.xtc
    2015-07-22 15:59:09,506 coordinates.api INFO     Created file ./data/pcca4_10samples.xtc


.. parsed-literal::

    ['./data/pcca1_10samples.xtc',
     './data/pcca2_10samples.xtc',
     './data/pcca3_10samples.xtc',
     './data/pcca4_10samples.xtc']


Structure figures are generated with VMD, pyMol or another visualization
program of your choice. Here we used VMD to generate the following
structures, corresponding to the four metastable states:

.. code:: python

    from IPython.display import Image
    Image(filename='./data/pcca4_sets.png', width=800)


.. image:: md2msm_penta_peptide_files/md2msm_penta_peptide_61_0.png