MSM Estimation ============== In this notebook we gonna show, how to estimate a transfer operator, namely a transition matrix from an already discretized trajectory. .. code:: python ########################################## # IMPORT ALL REQUIRED PACKAGES ########################################## # numerics import numpy as np # matplotlib import matplotlib.pyplot as plt %pylab inline #emma imports import pyemma.msm.io as msmio import pyemma.msm.estimation as msmest import pyemma.msm.analysis as msmana import pyemma.msm.generation as msmgen .. parsed-literal:: Populating the interactive namespace from numpy and matplotlib IO -- We first read a discrete trajectory that has been generated by clustering a TICA-projection of the 1 ms Anton BPTI trajectory: .. code:: python dtraj_bpti = msmio.read_dtraj("./resources/bpti_tica2_regspace0.3.disctraj") print len(dtraj_bpti) print dtraj_bpti[0:20] .. parsed-literal:: 4124963 [0 2 2 3 4 2 0 2 7 1 0 7 6 0 1 3 3 1 2 2] Although this is just a series of cluster indexes, let's have a look at the trajectory in a plot. It's pretty clear that there is something interested happening in the last third of the trajectory: .. code:: python plot(range(len(dtraj_bpti)), dtraj_bpti) xticks(range(0,len(dtraj_bpti),1000000)) xlabel('time (250 ps steps) ') ylabel('cluster index') show() .. image:: estimation_files/estimation_5_0.png Trajectory generation --------------------- Now we want to generate our own trajectory data. For starters, consider the following simple 4x4 matrix which exhibits dynamics on three different timescales: .. code:: python T4 = np.array([[0.9, 0.1, 0.0, 0.0], [0.1, 0.89, 0.01, 0.0], [0.0, 0.01, 0.94, 0.05], [0.0, 0.0, 0.05, 0.95]]) This matrix is stochastic (it's a transition matrix) - see: .. code:: python # test function print "T is a transition matrix: ", msmana.is_transition_matrix(T4) # proof that this is true: print "because all its row sums are 1: ", np.sum(T4, axis=1) print "and all its elements are within: [", np.min(T4), ",", np.max(T4), "]" .. parsed-literal:: T is a transition matrix: True because all its row sums are 1: [ 1. 1. 1. 1.] and all its elements are within: [ 0.0 , 0.95 ] Its stationary distribution is uniform: .. code:: python mu = msmana.statdist(T4) print "Stationary distribution:", mu .. parsed-literal:: Stationary distribution: [ 0.25 0.25 0.25 0.25] It's also reverisble matrix: .. code:: python # test function print "T is a reversible matrix: ", msmana.is_reversible(T4) # proof that this is true: C = np.dot(np.diag(mu),T4) print "because its correlation matrix \n", C print "is symmetric to within numerical error of ", np.linalg.norm(C - C.transpose()) .. parsed-literal:: T is a reversible matrix: True because its correlation matrix [[ 0.225 0.025 0. 0. ] [ 0.025 0.2225 0.0025 0. ] [ 0. 0.0025 0.235 0.0125] [ 0. 0. 0.0125 0.2375]] is symmetric to within numerical error of 2.26512642969e-17 Buhuuuu! This is wrong. OK... Now we generate a trajectory from it: .. code:: python figsize(15,5) dtraj_4_1K = msmgen.generate_traj(T4, 0, 1000) subplot2grid((1,3),(0,0)) plot(range(len(dtraj_4_1K)), dtraj_4_1K) dtraj_4_10K = msmgen.generate_traj(T4, 0, 10000) subplot2grid((1,3),(0,1)) plot(range(len(dtraj_4_10K)), dtraj_4_10K) dtraj_4_100K = msmgen.generate_traj(T4, 0, 100000) subplot2grid((1,3),(0,2)) plot(range(len(dtraj_4_100K)), dtraj_4_100K) show() .. image:: estimation_files/estimation_16_0.png Estimation ---------- First of all we have to count .. code:: python C = msmest.cmatrix(dtraj_4_10K, 1) print C .. parsed-literal:: (0, 0) 2655.0 (0, 1) 281.0 (1, 0) 280.0 (1, 1) 2730.0 (1, 2) 26.0 (2, 1) 25.0 (2, 2) 2067.0 (2, 3) 93.0 (3, 2) 92.0 (3, 3) 1750.0 Mind that this count matrix is now a sparse matrix! At the moment count matrices are sparse whereas subsequent operations (estimation, analysis) will work on either sparse or dense matrices, depending on the input matrix (although not all combinations are implemented yet). This may be changed in the future because for small matrices, dense is the most efficient option. Next, we estimate the transition matrix - either reversible or nonreversibly: .. code:: python Test_nonrev = msmest.transition_matrix(C) print Test_nonrev.toarray() print Test_rev = msmest.transition_matrix(C,reversible=True) print Test_rev.toarray() print .. parsed-literal:: [[ 0.90429155 0.09570845 0. 0. ] [ 0.09222661 0.89920949 0.0085639 0. ] [ 0. 0.01144165 0.94599542 0.04256293] [ 0. 0. 0.04994571 0.95005429]] [[ 0.90429155 0.09570845 0. 0. ] [ 0.09222661 0.89920949 0.0085639 0. ] [ 0. 0.01144165 0.94599542 0.04256293] [ 0. 0. 0.04994571 0.95005429]] As you can see the results are almost identical (the differences are due to numerical reasons). This is because any tridiagonal transition matrix is reversible with respect to its stationary distribution - but we will see differences later. We now show that the estimate converges to the correct matrix for long sampling: .. code:: python lengths = range(1000, 100000, 1000) errors_nonrev = [] for l in lengths: dtraj_4_var = msmgen.generate_traj(T4, 0, l) Cest_4_var = msmest.count_matrix(dtraj_4_var, 1) Test_4_var = msmest.transition_matrix(Cest_4_var) error_nonrev = np.linalg.norm(Test_4_var - T4) errors_nonrev.append(error_nonrev) # do a log-log plot of the error figsize(7,7) loglog(lengths, errors_nonrev) xlabel('simulation length (steps)') ylabel('nonreversible estimation error') show() .. image:: estimation_files/estimation_22_0.png Implied timescales ------------------ Now we conduct the implied timescales test using the BPTI trajectory. There'll be a few high-level commands (certainly for the shell, maybe for the python API level as well) later, but right now you build things like implied timescales yourselves: .. code:: python lags = [1,2,5,10,20,50,100,200,500,1000,2000,5000,10000] nits = 10 evs = np.zeros((len(lags),nits)) its = np.zeros((len(lags),nits)) for i in range(len(lags)): lag = lags[i] # count matrix at lag tau Clag_sparse = msmest.cmatrix(dtraj_bpti,lag) # let's stay dense Clag = Clag_sparse.toarray() # estimate transition matrix Tlag = msmest.transition_matrix(Clag, lag) # eigenvalues evs[i,:] = msmana.eigenvalues(Tlag)[1:nits+1] # timescales its[i,:] = msmana.timescales(Tlag, lag)[1:nits+1] .. parsed-literal:: -c:14: ComplexWarning: Casting complex values to real discards the imaginary part Let's have a look at the first eigenvalue and timescale .. code:: python # plot eigenvalue figsize(10,5) subplot2grid((1,2),(0,0)) plot(lags, evs[:,0]) xlabel('lag time') ylabel('eigenvalue') # plot its figsize(10,5) subplot2grid((1,2),(0,1)) plot(lags, its[:,0]) xlabel('lag time') ylabel('timescale') show() .. image:: estimation_files/estimation_26_0.png You can nicely see the problem with this kind of MSM estimation: the eigenvalue does not decay exponentially for all tau, but rather drops (due to fast processes that are not resolved by the discretization) to some value smaller than 1, and then decays exponentially with the timescale of interest. Hence we have slow convergence in tau. Let's look at all timescales .. code:: python # plot its figsize(10,5) subplot2grid((1,2),(0,0)) for i in range(nits): plot(lags, its[:,i]) # plot forbidden region plot(lags,lags,color='black', linewidth=3) xlabel('lag time') ylabel('timescale') # logarithmically figsize(10,5) subplot2grid((1,2),(0,1)) for i in range(nits): plot(lags, its[:,i]) # plot forbidden region plot(lags,lags,color='black', linewidth=3) semilogy() ylim(100,1000000) xlabel('lag time') ylabel('timescale') show() .. image:: estimation_files/estimation_28_0.png Example: Double-well potential ------------------------------ Next we probe a more complex quasi-continuous example: a metastable double-well potential with 100 microstates, out of which about 60 are populated with significant probability. We coarse-grain the double-well to a few (2 or more) sets. In the coarse-grained state space is usually connected. .. code:: python # load T matrix T_doublewell = msmio.read_matrix('./resources/2well.T',mode='sparse') pi_doublewell = msmana.statdist(T_doublewell.toarray()) U_doublewell = -np.log(pi_doublewell) .. code:: python trajs = [] for i in range(500): trajs.append(msmgen.generate_traj(T_doublewell.toarray(), random.randint(0,50), 500, dt=10)) for i in range(100): trajs.append(msmgen.generate_traj(T_doublewell.toarray(), random.randint(50,100), 500, dt=10)) .. code:: python figure(figsize=(8,4)) subplot2grid((1,4),(0,0),colspan=3) for traj in trajs: plot(range(len(traj)), traj) plt.xlabel('t') plt.ylabel('x') subplot2grid((1,4),(0,3)) plot(U_doublewell,range(len(U_doublewell)), linewidth=2, color='black') xlim([2,10]) plt.xlabel('Energy / kT') show() .. image:: estimation_files/estimation_32_0.png .. code:: python # define discretization def discretize(_trajs, _bounds): """discretizes the trajectories into two states split at point p""" _dtrajs = [] for _traj in _trajs: _dtraj = np.digitize(_traj,_bounds) _dtrajs.append(_dtraj) return _dtrajs .. code:: python # DECENT DISCRETIZATION def its_doublewell(lags,discretization): dtrajs_1 = discretize(trajs,discretization) nits = len(discretization) its = np.zeros((len(lags),nits)) for i in range(len(lags)): lag = lags[i] # count matrix Z = msmest.cmatrix(dtrajs_1, lag) # densify! Z = Z.toarray() # estimate T = msmest.transition_matrix(Z, reversible=True) # timescales its[i,:] = msmana.timescales(T, lag)[1:nits+1] return its .. code:: python # DECENT discretization lags = range(1,40) its2 = its_doublewell(lags,[50]) # GREAT discretization lags = range(1,40) its8 = its_doublewell(lags,[30,40,45,50,55,60,70]) .. code:: python subplot2grid((1,2),(0,0)) plot(lags,its2) subplot2grid((1,2),(0,1)) plot(lags,its8) show() .. image:: estimation_files/estimation_36_0.png .. code:: python # ANALYZE ERRORS!!!! Next - analyze errors, uncertainties