Markov state model for BPTI =========================== .. raw:: html
 
   Antonia Mey                antonia.mey@fu-berlin.de  
   Guillermo Perez-Hernandez  guille.perez@fu-berlin.de 
   Frank Noe                  frank.noe@fu-berlin.de    
   
First import the pyemma package and check if we have the right version number: .. code:: ipython2 import pyemma pyemma.__version__ .. parsed-literal:: u'2.3' This notebook has been tested for PyEMMA 1.2.2. If you have a later version, changes may be required. Now we import a few general packages that we need to start with. The following imports basic numerics and algebra routines (numpy) and plotting routines (matplotlib), and makes sure that all plots are shown inside the notebook rather than in a separate window (nicer that way). .. code:: ipython2 import numpy as np %pylab inline .. parsed-literal:: Populating the interactive namespace from numpy and matplotlib Now we import the pyEMMA package that we will be using in the beginning: the coordinates package. This package contains functions and classes for reading and writing trajectory files, extracting order parameters from them (such as distances or angles), as well as various methods for dimensionality reduction and clustering. The shortcuts module is a bunch of functions specific to this workshop - they help us to visualize some of our results. Some of them might become part of the pyemma package once they are more mature. .. code:: ipython2 import pyemma.coordinates as coor # some helper funcs def average_by_state(dtraj, x, nstates): assert(len(dtraj) == len(x)) N = len(dtraj) res = np.zeros((nstates)) for i in range(nstates): I = np.argwhere(dtraj == i)[:,0] res[i] = np.mean(x[I]) return res def avg_by_set(x, sets): # compute mean positions of sets. This is important because of some technical points the set order # in the coarse-grained TPT object can be different from the input order. avg = np.zeros(len(sets)) for i in range(len(sets)): I = list(sets[i]) avg[i] = np.mean(x[I]) return avg BPTI 1 ms trajectory - load data -------------------------------- In this workshop we will be using the 1 millisecond trajectory of Bovine Pancreatic Trypsin Inhibitor (BPTI) generated by DE Shaw Research on the Anton Supercomputer [1]. In order to make the data size manageable we have saved only the Ca-coordinates and only every 10 ns, resulting in about 100000 frames. First we define coordinate and topology input file by pointing to a local data directory. Note that instead of using a single trajectory file you could specify a list of many trajectory files here - the rest of the analysis stays the same. .. code:: ipython2 trajfile = '../../unpublished/bpti_data/bpti_ca_1ms_dt10ns.xtc' topfile = '../../unpublished/bpti_data/bpti_ca.pdb' Now we decide which coordinates we would like to use in the further analysis. Since this trajectory is already RMSD-aligned we can simply use the Cartesian coordinates, here Ca-coordinates: .. code:: ipython2 feat = coor.featurizer(topfile) # just use all xyz-coordinates feat.add_all() We can ask the featurizer to tell describe the coordinates used - so we can later remember what we did. The return is a list of strings and we will just print the first few labels here: .. code:: ipython2 feat.describe()[:10] .. parsed-literal:: ['ATOM:ARG 1 CA 0 x', 'ATOM:ARG 1 CA 0 y', 'ATOM:ARG 1 CA 0 z', 'ATOM:PRO 2 CA 1 x', 'ATOM:PRO 2 CA 1 y', 'ATOM:PRO 2 CA 1 z', 'ATOM:ASP 3 CA 2 x', 'ATOM:ASP 3 CA 2 y', 'ATOM:ASP 3 CA 2 z', 'ATOM:PHE 4 CA 3 x'] Next we want to load the coordinates from disc. Often, coordinates will not fit into memory, so we'll just create a loader by specifying the source files as follows: .. code:: ipython2 inp = coor.source(trajfile, feat) print 'trajectory length = ',inp.trajectory_length(0) print 'number of dimension = ',inp.dimension() .. parsed-literal:: trajectory length = 103125 number of dimension = 174 That's 174 coordinates for 58 Ca-atoms. In practice you will often be using more coordinates in the beginning, but we want to keep our example simple and fast. time-lagged independent component analysis (TICA) ------------------------------------------------- In principle we could now directly go on to do data clustering. However, doing anything in 174 dimensions is quite inefficient. Not so much because of the CPU time needed, but rather because in such a high-dimensional space it is difficult to efficiently discretize data where it matters. So we would like to first reduce our dimension by throwing out the 'uninteresting' ones and only keeping the 'relevant' ones. But how do we do that? It turns out that a really good way to do that if you are interesting in the slow kinetics of the molecule - e.g. for constructing a Markov model, is to use the time-lagged independent component analysis (TICA) [2]. Amongst linear methods, TICA is optimal in its ability to approximate the relevant slow coordinates / reaction coordinates from MD simulation [3], and therefore it's ideal to construct Markov models. Probably you are more familiar with the principal component analysis (PCA), but we will come back to that later... .. code:: ipython2 lag=100 tica_obj = coor.tica(inp, lag=lag, dim=2, kinetic_map=False) # here we get the data that has been projected onto the first 2 IC's. It's a list, because we could generally # have a list of trajectories, so we just get the first element. Y = tica_obj.get_output()[0] print 'Projected data shape = ',Y.shape .. parsed-literal:: Projected data shape = (103125, 2) The TICA object has a number of properties that we can extract and work with. We have already obtained the projected trajectory and wrote it in a variable Y that is a matrix of size (103125 x 2). The rows are the MD steps, the 2 columns are the independent component coordinates projected onto. So each columns is a trajectory. Let us plot them: .. code:: ipython2 subplot2grid((2,1),(0,0)) plot(Y[:,0]) ylabel('ind. comp. 1') subplot2grid((2,1),(1,0)) plot(Y[:,1]) ylabel('ind. comp. 2') xlabel('time (10 ns)') .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_17_1.png Even in a 1 ms simulation we have a rare event in the first coordinate. Damn sampling problem. It looks like no trajectory can ever be long enough A particular thing about the IC's is that they have zero mean and variance one. We can easily check that: .. code:: ipython2 print 'Mean values: ',np.mean(Y, axis=0) print 'Variances: ',np.var(Y, axis=0) .. parsed-literal:: Mean values: [ 0.00013578 0.00067149] Variances: [ 0.99914652 1.00026047] The small deviations from 0 and 1 come from statistical and numerical issues. That's not a problem. Note that if we had set kinetic\_map=True when doing TICA, then the variances would not be 1 but rather the square of the corresponding TICA eigenvalue. TICA is a special transformation because it will project the data such that the autocorrelation along the independent components is as slow as possible. The eigenvalues of the TICA transform are the values of these autocorrelations at the chosen lag time (here 100). We can even interpret them in terms of relaxation timescales: .. code:: ipython2 print -lag/np.log(tica_obj.eigenvalues[:5]) .. parsed-literal:: [ 2121.98368872 946.55524452 669.04609256 373.33673652 305.70383507] We will see more timescales later when we estimate a Markov model, and there will be some differences. For now you should treat these numbers as a rough guess of your molecule's timescales, and we will see later that this guess is actually a bit too fast. The timescales are relative to the 10 ns saving interval, so we have 21 microseconds, 9 microseconds, .. Now we histogram this data and compute the apparent free energy landscape .. code:: ipython2 # histogram data z,x,y = np.histogram2d(Y[:,0],Y[:,1], bins=50) # compute free energies F = -np.log(z) # contour plot extent = [x[0], x[-1], y[0], y[-1]] contourf(F.T, 50, cmap=plt.cm.hot, extent=extent) .. parsed-literal:: /tmp/pyemma-doc-release/miniconda/envs/ci/lib/python2.7/site-packages/ipykernel/__main__.py:4: RuntimeWarning: divide by zero encountered in log .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_24_2.png So we nicely see that there are a couple of different energy minima are present and clearly separated in the TICA projection. Clustering the data ------------------- we use k-means clustering and get the discrete trajectories .. code:: ipython2 cl = coor.cluster_kmeans(data=Y, k=100, stride=1) # for later use we save the discrete trajectories and cluster center coordinates: dtrajs = cl.dtrajs cc_x = cl.clustercenters[:,0] cc_y = cl.clustercenters[:,1] .. parsed-literal:: 08-01-17 14:23:11 pyemma.coordinates.clustering.kmeans.KmeansClustering[2] INFO Algorithm did not reach convergence criterion of 1e-05 in 10 iterations. Consider increasing max_iter. where are the clusters? .. code:: ipython2 contourf(F.T, 50, cmap=plt.cm.hot, extent=extent) plot(cc_x,cc_y, linewidth=0, marker='o') .. parsed-literal:: [] .. image:: MSM_BPTI_files/MSM_BPTI_30_1.png For our purpose, the main result from the clustering algorithm are discrete trajectories: .. code:: ipython2 cl.dtrajs .. parsed-literal:: [array([82, 50, 42, ..., 18, 18, 5], dtype=int32)] This is a list of integer-arrays, one for each trajectory used (here only one). The integer-array contains one number between 0 and 99 (because we have used 100 cluster centers) for each MD trajectory frame. Let's play ---------- Now let's play around a little bit. Please refer to the pyEMMA coordinates documentation at: http://www.emma-project.org/latest/api/index\_coor.html and try to vary the code above in the following ways: 1. Try different clustering methods, such as regspace and uniform time clustering instead of using k-means. Also try to change the parameters. What are the differences you observe? What do you think is the best strategy for clustering your data? 2. Instead of using TICA to project your high-dimensional data, try to use PCA. Discuss the differences in the projected trajectories and in the free energy plot. Which one do you think will be better to detect the slow conformational changes 3. In the approach above you have streamed everything from the file because we wanted to keep the memory requirements small. For example, when you did the k-means clustering, the data was read from the file, transformed through your TICA transformation, and then clustered. Our data is small enough that we could have loaded everything into memory. Try to do that using the load function, and pass the result directly into TICA instead of passing the source object (inp). Likewise, take the resulting data from TICA (Y), and pass it directly to k-means. What is different now? 4. For large data sets you can further speed things up by using strides. Play around with the stride parameter using in k-means and see what changes. You can use a stride at every step of your data processing pipeline 5. Play around with different coordinates. Instead of just looking at Ca-positions, try angles, distances etc. Look at the result when you load data. Try to push it - can you compute all pairwise distances and still load the data into memory? 1) Different clustering methods ------------------------------- .. code:: ipython2 clu = coor.cluster_uniform_time(data=Y, k=100) contourf(F.T, 50, cmap=plt.cm.hot, extent=extent) plot(clu.clustercenters[:,0], clu.clustercenters[:,1], linewidth=0, marker='o') .. parsed-literal:: [] .. image:: MSM_BPTI_files/MSM_BPTI_36_1.png .. code:: ipython2 clr = coor.cluster_regspace(data=Y, dmin=0.5) contourf(F.T, 50, cmap=plt.cm.hot, extent=extent) plot(clr.clustercenters[:,0], clr.clustercenters[:,1], linewidth=0, marker='o') .. parsed-literal:: [] .. image:: MSM_BPTI_files/MSM_BPTI_37_1.png 2) PCA vs TICA -------------- .. code:: ipython2 pca_obj = coor.pca(inp, dim=2) Ypca = pca_obj.get_output()[0] subplot2grid((2,1),(0,0)) plot(Ypca[:,0]) ylabel('principal comp. 1') subplot2grid((2,1),(1,0)) plot(Ypca[:,1]) ylabel('principal comp. 2') xlabel('time (10 ns)') .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_39_1.png .. code:: ipython2 # histogram data z_,x_,y_ = np.histogram2d(Ypca[:,0],Ypca[:,1], bins=50) # compute free energies F_ = -np.log(z_) # contour plot extent_ = [x_[0], x_[-1], y_[0], y_[-1]] contourf(F_.T, 50, cmap=plt.cm.hot, extent=extent_) .. parsed-literal:: /tmp/pyemma-doc-release/miniconda/envs/ci/lib/python2.7/site-packages/ipykernel/__main__.py:4: RuntimeWarning: divide by zero encountered in log .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_40_2.png 3) Do everything in memory -------------------------- .. code:: ipython2 X = coor.load(trajfile, feat) Y1 = coor.tica(X, lag=lag, dim=2).get_output() km = coor.cluster_kmeans(Y1, k=100) contourf(F.T, 50, cmap=plt.cm.hot, extent=extent) # sign of -km.clustercenters[:,0] is changed here to be correct in the plot. # the direction may change due to a random initialization of the eigenvectors. plot(km.clustercenters[:,0], km.clustercenters[:,1], linewidth=0, marker='o') .. parsed-literal:: 08-01-17 14:23:30 pyemma.coordinates.clustering.kmeans.KmeansClustering[7] INFO Cluster centers converged after 7 steps. .. parsed-literal:: [] .. image:: MSM_BPTI_files/MSM_BPTI_42_2.png 4) Do everything as a pipeline ------------------------------ .. code:: ipython2 stage1 = coor.source(trajfile, feat) stage2 = coor.tica(lag=lag, dim=2) stage3 = coor.cluster_kmeans(k=100) coor.pipeline([stage1,stage2,stage3]) contourf(F.T, 50, cmap=plt.cm.hot, extent=extent) plot(stage3.clustercenters[:,0],stage3.clustercenters[:,1], linewidth=0, marker='o') .. parsed-literal:: 08-01-17 14:23:37 pyemma.coordinates.clustering.kmeans.KmeansClustering[10] INFO Algorithm did not reach convergence criterion of 1e-05 in 10 iterations. Consider increasing max_iter. .. parsed-literal:: [] .. image:: MSM_BPTI_files/MSM_BPTI_44_2.png 5) Striding ----------- .. code:: ipython2 %time coor.cluster_kmeans(data=Y, k=100, stride=1) .. parsed-literal:: 08-01-17 14:23:42 pyemma.coordinates.clustering.kmeans.KmeansClustering[11] INFO Algorithm did not reach convergence criterion of 1e-05 in 10 iterations. Consider increasing max_iter. CPU times: user 4.35 s, sys: 0 ns, total: 4.35 s Wall time: 4.34 s .. parsed-literal:: KmeansClustering(fixed_seed=False, init_strategy='kmeans++', max_iter=10, metric='euclidean', n_clusters=100, n_jobs=24, oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05) .. code:: ipython2 %time coor.cluster_kmeans(data=Y, k=100, stride=10) .. parsed-literal:: 08-01-17 14:23:43 pyemma.coordinates.clustering.kmeans.KmeansClustering[12] INFO Cluster centers converged after 8 steps. CPU times: user 408 ms, sys: 4 ms, total: 412 ms Wall time: 407 ms .. parsed-literal:: KmeansClustering(fixed_seed=False, init_strategy='kmeans++', max_iter=10, metric='euclidean', n_clusters=100, n_jobs=24, oom_strategy='memmap', skip=0, stride=10, tolerance=1e-05) 6) Different coordinates ------------------------ .. code:: ipython2 feat2 = coor.featurizer(topfile) feat2.add_distances_ca(periodic=False) inp_ = coor.source(trajfile, feat2) print 'number of dimension = ',inp_.dimension() tica_obj_ = coor.tica(inp_, lag=lag, dim=2) # plot Y_ = tica_obj_.get_output()[0] subplot2grid((2,1),(0,0)) plot(Y_[:,0]) ylabel('ind. comp. 1') subplot2grid((2,1),(1,0)) plot(Y_[:,1]) ylabel('ind. comp. 2') xlabel('time (10 ns)') .. parsed-literal:: number of dimension = 1540 .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_49_2.png .. code:: ipython2 # cluster km = coor.cluster_kmeans(Y_, k=100) # histogram data z_,x_,y_ = np.histogram2d(Y_[:,0],Y_[:,1], bins=50) # compute free energies F_ = -np.log(z_) # contour plot extent_ = [x_[0], x_[-1], y_[0], y_[-1]] # plot contourf(F_.T, 50, cmap=plt.cm.hot, extent=extent_) plot(km.clustercenters[:,0],km.clustercenters[:,1], linewidth=0, marker='o') .. parsed-literal:: 08-01-17 14:26:31 pyemma.coordinates.clustering.kmeans.KmeansClustering[16] INFO Algorithm did not reach convergence criterion of 1e-05 in 10 iterations. Consider increasing max_iter. .. parsed-literal:: /tmp/pyemma-doc-release/miniconda/envs/ci/lib/python2.7/site-packages/ipykernel/__main__.py:6: RuntimeWarning: divide by zero encountered in log .. parsed-literal:: [] .. image:: MSM_BPTI_files/MSM_BPTI_50_3.png MSM estimation -------------- In this chapter we want to estimate a Markov model, and then analyze it. First we import the msm package from pyemma, and then we also import a plotting package which offers a few standard ways for visualizing data. .. code:: ipython2 import pyemma.msm as msm import pyemma.plots as mplt The quality and the practical usefulness of a Markov model depend on two main parameters: 1. The state-space discretization, i.e. which steps we have conducted before (choice of coordinates, projection method, clustering method) 2. The lag time, i.e. at which time do we count transitions. The first choice is quite complex and there are some ways to deal with this complexity and the reduce the number of choices, although we won't discuss them in detail here. The second parameter is extremely important, and we should scan it in order to make a good selection. So Let us compute the so-called implied timescales, or relaxation timescales of the Markov model at different lag times: .. code:: ipython2 lags = [1,2,5,10,20,50,100,200] its = msm.its(dtrajs, lags=lags) .. parsed-literal:: Widget Javascript not detected. It may not be installed or enabled properly. What this function does is to estimate a Markov model at each of the given lag times :math:`\tau` (that are multiples of our saving step, i.e. multiples of 10 ns), compute the eigenvalues of each transition matrix, :math:`\lambda_i(\tau)`, and then compute the relaxation timescales by: .. math:: t_i = \frac{-\tau}{\ln | \lambda_i(\tau) |} The its object will mainly give us these estimated timescales. We can simply push the result to a standard plotting function: .. code:: ipython2 mplt.plot_implied_timescales(its) .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_56_1.png It has been shown [4] that these timescales should be independent of the lag time. You can see that for short lag times they are not, but after about 100 steps (1 microsecond), they are pretty constant. The faster timescales still increase. Part of this is normal and due to numerical issues. Part of this is because our state space discretization is rather naive - we cannot hope to capture a lot of relaxation processes when using only two dimensions. In a nutshell: longer timescale is better, at least as long you stay away from the grey area. The grey area is defined by lag > timescale, and in this area we cannot make a reliable estimate because the process under investigation has already decayed. Everything within or close to the grey area is distorted. Of course looking at a plot and judging the flatness by eye is not really sophisticated. Especially because the error bars on these timescales can be pretty big. Let's compute some errors using Bayesian MSMs and plot again... .. code:: ipython2 its = msm.its(dtrajs, lags=lags, errors='bayes') .. parsed-literal:: Widget Javascript not detected. It may not be installed or enabled properly. .. parsed-literal:: Widget Javascript not detected. It may not be installed or enabled properly. .. parsed-literal:: Widget Javascript not detected. It may not be installed or enabled properly. .. parsed-literal:: Widget Javascript not detected. It may not be installed or enabled properly. .. parsed-literal:: Widget Javascript not detected. It may not be installed or enabled properly. .. parsed-literal:: Widget Javascript not detected. It may not be installed or enabled properly. .. parsed-literal:: Widget Javascript not detected. It may not be installed or enabled properly. .. parsed-literal:: Widget Javascript not detected. It may not be installed or enabled properly. .. parsed-literal:: Widget Javascript not detected. It may not be installed or enabled properly. .. code:: ipython2 mplt.plot_implied_timescales(its) ylim(10,10000) xlim(0,200) .. parsed-literal:: (0, 200) .. image:: MSM_BPTI_files/MSM_BPTI_59_1.png In order to be approximately constant with the lag time, approximately 100 steps are reasonable. MSM --- So finally we can estimate a Markov model. In fact we have already estimated a several Markov models (for different lag times above), but now we construct an msm object which will give us access to a wide variety of interesting quantities. All we need to put in are the discrete trajectories obtained from the clustering and the lag time: .. code:: ipython2 M = msm.estimate_markov_model(dtrajs, 100) The Markov model will be constructed on the largest connected set of states. That could mean that we exclude some states from the analysis. Let us verify that this is not the case here: .. code:: ipython2 print 'fraction of states used = ', M.active_state_fraction print 'fraction of counts used = ', M.active_count_fraction .. parsed-literal:: fraction of states used = 1.0 fraction of counts used = 1.0 Spectral analysis ----------------- Let us have a closer look at the timescales that were already seen in the its plot: .. code:: ipython2 plot(M.timescales(),linewidth=0,marker='o') xlabel('index'); ylabel('timescale (10 ns)'); xlim(-0.5,10.5) .. parsed-literal:: (-0.5, 10.5) .. image:: MSM_BPTI_files/MSM_BPTI_66_1.png We can also look at that data by taking the ratios of subsequent timescales. This shows us how big the gap of timescales (or rates) are. .. code:: ipython2 plot(M.timescales()[:-1]/M.timescales()[1:], linewidth=0,marker='o') xlabel('index'); ylabel('timescale separation'); xlim(-0.5,10.5) .. parsed-literal:: (-0.5, 10.5) .. image:: MSM_BPTI_files/MSM_BPTI_68_1.png It can be seen that there is a large timescale separation between the second and third relaxation timescale. That means that if we are interested in coarse-graining our dynamics, retaining two relaxation timescales, or three metastable states, is a good choice. Now let us look at the second (slowest) right eigenvector .. code:: ipython2 r2 = M.eigenvectors_right()[:,1] ax = mplt.scatter_contour(cc_x, cc_y, r2) .. image:: MSM_BPTI_files/MSM_BPTI_70_0.png Clearly, the slowest process (about 40 microseconds) involves a change along the first TICA component. This is great news for TICA, because it means that even before computing a Markov model we have done a very good job in finding a slow order parameter. However, remember that this process has occurred only once in the trajectory, so unfortunatly our data is quite poor with respect to quantifying it. Let us then look at the third (next-slowest) right eigenvector .. code:: ipython2 r3 = M.eigenvectors_right()[:,2] mplt.scatter_contour(cc_x, cc_y, r3) .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_72_1.png This process is between 15 and 20 microseconds and clearly transitions between the two prominent minima on the left. Now we want to coarse-grain our system to get a simpler description. The PCCA method [5] uses the eigenvectors in order to perform a spectral clustering. This clustering is fuzzy, i.e. each of our k-means clusters is linked to one of the metastable states with a certain probability or membership. Here we plot the Bayesian inverse, i.e. the probability to be in a small state given that we are in a metastable state. We choose three metastable states, so we have three distributions: .. code:: ipython2 M.pcca(3) pcca_dist = M.metastable_distributions f, (ax1,ax2,ax3) = subplots(ncols=3) f.set_size_inches(15,5) cmap=plt.cm.Blues mplt.scatter_contour(cc_x, cc_y, pcca_dist[0], fig=f, ax=ax1, colorbar=False, cmap=cmap) mplt.scatter_contour(cc_x, cc_y, pcca_dist[1], fig=f, ax=ax2, colorbar=False, cmap=cmap) mplt.scatter_contour(cc_x, cc_y, pcca_dist[2], fig=f, ax=ax3, colorbar=False, cmap=cmap) .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_74_1.png MSM trajectory -------------- Let's generate a synthetic MSM trajectory (100 milliseconds). This is just for playing around There's nothing you can compute from this trajectory that you couldn't do better with a direct matrix analysis, but for some reason people seem to love it... .. code:: ipython2 index_traj = M.generate_traj(100000) let's just read out the x/y coordinates of our cluster centers using the synthetic trajectory .. code:: ipython2 I = index_traj[:,1] dtraj_synthetic = M.discrete_trajectories_full[0][I] x_synthetic = cl.clustercenters[dtraj_synthetic,0] y_synthetic = cl.clustercenters[dtraj_synthetic,1] Plot them. You can see the rare excursions to the low-populated state happening about 6 times in the first 10 ms. The next process is much more frequent. .. code:: ipython2 subplot2grid((2,1),(0,0)) plot(x_synthetic) xlim(0,10000) subplot2grid((2,1),(1,0)) plot(y_synthetic) xlim(0,10000) .. parsed-literal:: (0, 10000) .. image:: MSM_BPTI_files/MSM_BPTI_81_1.png Representative Structures ------------------------- Now we want to see some structures. Let us use the PCCA distributions in order to select structures. We will get 100 frame indexes for each of the three distributions .. code:: ipython2 pcca_samples = M.sample_by_distributions(pcca_dist, 100) and now we save them. .. code:: ipython2 coor.save_traj(inp, pcca_samples[0], './data/pcca1_100samples.xtc') coor.save_traj(inp, pcca_samples[1], './data/pcca2_100samples.xtc') coor.save_traj(inp, pcca_samples[2], './data/pcca3_100samples.xtc') .. parsed-literal:: 08-01-17 14:27:09 pyemma.coordinates.api INFO Created file ./data/pcca1_100samples.xtc 08-01-17 14:27:09 pyemma.coordinates.api INFO Created file ./data/pcca2_100samples.xtc 08-01-17 14:27:10 pyemma.coordinates.api INFO Created file ./data/pcca3_100samples.xtc .. parsed-literal:: /tmp/pyemma-doc-release/miniconda/envs/ci/lib/python2.7/site-packages/mdtraj/utils/unitcell.py:97: RuntimeWarning: invalid value encountered in greater a[np.logical_and(a>-tol, a-tol, a-tol, b-tol, b-tol, c-tol, c .. image:: MSM_BPTI_files/MSM_BPTI_90_1.png Experimental observables ------------------------ Now let us play some hypothetical games with experimental observables. We would like to design an experiment that allows us to resolve the slowest process in BPTI. By looking at the structures we find that the distance between residue 10 - 34 (indexes 9 - 33) exhibits a significant change. This is a difference we could measure by a fluorescent probe Here we use the feature reader to just read this one distance and get it as a vector: .. code:: ipython2 feat2 = coor.featurizer(topfile) feat2.add_distances(np.array([[9,33]])) D = coor.load(trajfile, feat2) Image(filename='./data/observable.png', width=500) .. image:: MSM_BPTI_files/MSM_BPTI_92_0.png Let us compare this experimental observable with the slowest TIC... Not too bad. Much more noisy, but at least there is some correlation. .. code:: ipython2 subplot2grid((2,1),(0,0)) plot(D) ylabel('distance res 10-34') subplot2grid((2,1),(1,0)) plot(Y[:,0]) ylabel('ind. comp. 1') xlabel('time (10 ns)') .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_94_1.png You can see that the variations are roughly between 0.8 and 1.5 nanometers. This is actually the perfect distance for FRET (although I'm sure that the bulky FRET labels would mess up everything here, but again this is just a hypothetical game). On the other hand they might be close enough to use fluorescence quenching. So one option is to check if there are any trytophanes nearby that can be used as a quencher - that way we could get away with one extrinsic fluorophore. let us average the observable by state: .. code:: ipython2 dmean = average_by_state(dtrajs[0], D, M.nstates) plotting the observable on the TICA-map. It's decently correlated with the first TIC, and a good indicator of whether we are in the low-populated state. .. code:: ipython2 mplt.scatter_contour(cl.clustercenters[:,0], cl.clustercenters[:,1], dmean, colorbar=False) .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_98_1.png this is the ensemble average of our probe distance: .. code:: ipython2 M.expectation(dmean) .. parsed-literal:: 0.90941107417117117 Now let us conduct a correlation experiment. This could be done by using FRET as an observable and then running a fluorescence time correlation experiment - very easy to do. .. code:: ipython2 times, corr = M.correlation(dmean) plot(times, corr, linewidth=2) xlabel('time / microseconds (10 ns)') ylabel('autocorrelation') .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_102_1.png We do see a nice decay that contains our slowest timescale. Unfortunately, the change in the signal (y-axis) is really small, so that we cannot expect to see anything in the presence of experimental errors. The problem is that the extra state is too little populated - in an equilibrium measurement such as FCS it will have too little contribution, and this is why the amplitude is so small. We need to find a way to increase the amplitude of the signal. The observable is already pretty good, we can't get much better there. But we can do a different experiment that hopefully gives a larger signal: Let us do a perturbation-relaxation experiment. Suppose we would have some way of preparing our ensemble in the low-populated state of interest (different temperature, pH, ...?), and then let the system relax towards equilibrium while we measure the expectation of the fluorescence. .. code:: ipython2 p0 = pcca_dist[0] times, rel = M.relaxation(p0, dmean) plot(times, rel, linewidth=2) xlabel('time / microseconds (10 ns)') ylabel('mean distance') .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_104_1.png Now we can see a much stronger signal (variation between 0.9 and 1.15) - this will be measurable even with large experimental errors. This is a very nice and clear observable. If we only could measure whatever we wanted... now let's plot the relaxation curve in a log-plot. And let's add single exponentials with the MSM timescales (showing up as lines), to see if we can match timescales .. code:: ipython2 plot(times, rel-M.expectation(dmean), linewidth=5) t2_tau = M.timescales()[0] plot(times, 0.24*np.exp(-times / t2_tau), color='yellow', linestyle='dashed', linewidth=2) semilogy() xlabel('time / microseconds (tau)') ylabel('distance perturbation') .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_107_1.png We can nicely see that the main part of the relaxation is exactly due to the slow process. So we have designed an experiment that can measure the slow process! There are more fast timescales contributing, but their amplitudes are so small that we can hardly see the deviation from a single-exponential. These other timescales could be measured by choosing other coordinates A more systematic analysis could be done using dynamical fingerprints concept described in [6]. Please check the msm functions fingerprint\_correlation and fingerprint\_relaxation. Transition pathways and Committors ---------------------------------- Finally, let us compute transition pathways between two selected endstates. This is a very nice mechanistic analysis that is often revealing when we have a two-state process between such as folding or binding. In the present case we would like to know which pathways does the system take to partially unfold into the rarely populated 'right' state. At first we have to select what the two end-states of our interest are. We have done PCCA which gives us an idea of metastable sets, but if we do a transition path analysis with three states that's a bit boring, because there are only two options: the single intermediate state is on-pathway or off-pathway. We want to get a more detailed analysis. Therefore we reconsider PCCA an this time ask for six metastable sets: .. code:: ipython2 # do pcca with 6 states now M.pcca(6) pcca_sets_6 = M.metastable_sets Which are assigned to the TICA plot as follows: .. code:: ipython2 figure(figsize=(8,5)) pcca_sets = M.metastable_sets contourf(F.T, 50, cmap=plt.cm.hot, extent=extent) size = 50 cols = ['orange', 'magenta', 'red', 'black', 'blue', 'green',] for i in range(6): scatter(cc_x[pcca_sets_6[i]], cc_y[pcca_sets_6[i]], color=cols[i], s=size) .. image:: MSM_BPTI_files/MSM_BPTI_112_0.png Now we select as two end-states the leftmost and the rightmost of these 6 sets. Since the ordering of metastable sets could differ in different runs, we actually compute their average positions along the first TICA coordinate: .. code:: ipython2 xavg = avg_by_set(cc_x, pcca_sets_6) A = pcca_sets_6[xavg.argmax()] B = pcca_sets_6[xavg.argmin()] Now we compute the transition pathways. This is done by discrete transition path theory [7] in its MSM formulation [8]. .. code:: ipython2 fluxAB = msm.tpt(M, A, B) .. code:: ipython2 # mean first passage times in microseconds print 0.01*M.mfpt(A, B) print 0.01*M.mfpt(B, A) .. parsed-literal:: 1578.91254715 110.368997723 so, it takes about 1.5 milliseconds on average to go into the low-populated "open" state, and about 100 microseconds to switch back. .. code:: ipython2 figure(figsize=(8,5)) mplt.scatter_contour(cl.clustercenters[:,0], cl.clustercenters[:,1], fluxAB.committor, colorbar=True, ncontours=15) scatter(cl.clustercenters[A,0], cl.clustercenters[A,1], color='black', s=200) scatter(cl.clustercenters[B,0], cl.clustercenters[B,1], color='white', s=200) .. parsed-literal:: .. image:: MSM_BPTI_files/MSM_BPTI_119_1.png .. code:: ipython2 cg, cgflux = fluxAB.coarse_grain(pcca_sets) .. code:: ipython2 # compute mean positions of sets. This is important because of some technical points the set order # in the coarse-grained TPT object can be different from the input order. avgpos = np.zeros((6,2)) avgpos[:,0] = avg_by_set(cc_x, cg) avgpos[:,1] = avg_by_set(cc_y, cg) .. code:: ipython2 fig, _ = mplt.plot_flux(cgflux, avgpos, cgflux.stationary_distribution, max_width=10, max_height=7) cf=contourf(F.T, 50, cmap=plt.cm.hot, extent=extent, fig=fig, zorder=0) fig.set_size_inches(12, 4) .. image:: MSM_BPTI_files/MSM_BPTI_122_0.png This shows us the probability fluxes of pathways from the rightmost state to the leftmost (rare-event) state. You can see that the main pathway leads us to the most stable state, that then splits into two pathways that merge in an intermediate, and then split again before uniting in the target state. We can decompose the flux into individual pathways, along with their fluxes by: .. code:: ipython2 paths, path_fluxes = cgflux.pathways(fraction=0.99) print 'percentage \tpath' print '-------------------------------------' for i in range(len(paths)): print (path_fluxes[i] / np.sum(path_fluxes)),' \t', paths[i] .. parsed-literal:: percentage path ------------------------------------- 0.417000119681 [0 4 2 5] 0.260757390002 [0 3 2 1 5] 0.159434554378 [0 4 3 2 5] 0.0625507760049 [0 2 1 5] 0.049359611785 [0 4 1 5] 0.0359700763782 [0 4 3 2 1 5] 0.0149274717718 [0 4 3 1 5] We can get a cleaner picture by just plotting the dominant flux, i.e. the flux that only contains the few most important pathways by setting *minflux* to a higher value .. code:: ipython2 fig, _ = mplt.plot_flux(cgflux, avgpos, cgflux.stationary_distribution, minflux=1e-6) cf = contourf(F.T, 50, cmap=plt.cm.hot, extent=extent, fig=fig, zorder=0) fig.set_size_inches(12, 4) .. image:: MSM_BPTI_files/MSM_BPTI_126_0.png References ---------- 1. Shaw DE, Maragakis P, Lindorff-Larsen K, Piana S, Dror RO, Eastwood MP, Bank JA, Jumper JM, Salmon JK, Shan Y, Wriggers W: Atomic-level characterization of the structural dynamics of proteins. *Science* **330**:341-346 (2010). doi: 10.1126/science.1187409. 2. Molgedey, L. and H. G. Schuster, Phys. Rev. Lett. 72, 3634 (1994). 3. Pérez-Hernández, G. and Paul, F. and Giogino, T. and de Fabritiis, G. and Noé, F. Identification of slow molecular order parameters for Markov model construction. *J. Chem. Phys.* **139**:015102 (2013) 4. Swope WC, Pitera JW and Suits F. Describing protein folding kinetics by molecular dynamics simulations: 1. Theory. *J. Phys. Chem. B* **108**:6571-6581 (2004) 5. Röblitz S. and M. Weber: Fuzzy spectral clustering by PCCA+: application to Markov state models and data classification. Adv. Data. Anal. Classif. DOI 10.1007/s11634-013-0134-6 (2013) 6. Noé F, Doose S, Daidone I, Löllmann M, Chodera JD, Sauer M, Smith JC. Dynamical fingerprints for probing individual relaxation processes in biomolecular dynamics with simulations and kinetic experiments. *Proc. Natl. Acad. Sci. USA*, **108**: 4822-4827 (2011) 7. Metzner P, Schütte C, Vanden-Eijnden, E. Transition Path Theory for Markov Jump Processes. *Multiscale Model. Simul.* **7**. 1192--1219 (2009) 8. Noé F, Schütte C, Vanden-Eijnden E, Reich L and Weikl T. Constructing the Full Ensemble of Folding Pathways from Short Off-Equilibrium Simulations. *Proc. Natl. Acad. Sci. USA*, **106**:19011-19016 (2009)