{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 03 - MSM estimation and validation\n", "\n", "\"Creative\n", "\n", "In this notebook, we will cover how to estimate a Markov state model (MSM) and do model validation;\n", "we also show how to save and restore model and estimator objects.\n", "For this notebook, you need to know how to do data loading/visualization\n", "([Notebook 01 ➜ πŸ““](01-data-io-and-featurization.ipynb))\n", "as well as dimension reduction ([Notebook 02 ➜ πŸ““](02-dimension-reduction-and-discretization.ipynb)).\n", "\n", "We further recommend to have a look at the literature, if you are new to the concept of Markov state models:\n", "- prinz-11\n", "- bowman-14\n", "- husic-18\n", "\n", "Maintainers: [@cwehmeyer](https://github.com/cwehmeyer), [@marscher](https://github.com/marscher), [@thempel](https://github.com/thempel), [@psolsson](https://github.com/psolsson)\n", "\n", "**Remember**:\n", "- to run the currently highlighted cell, hold ⇧ Shift and press ⏎ Enter;\n", "- to get help for a specific function, place the cursor within the function's brackets, hold ⇧ Shift, and press ⇥ Tab;\n", "- you can find the full documentation at [PyEMMA.org](http://www.pyemma.org)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import mdshare\n", "import pyemma" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Case 1: preprocessed, two-dimensional data (toy model)\n", "We load the two-dimensional trajectory from an archive using numpy and directly discretize the full space using $k$-means clustering:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file = mdshare.fetch('hmm-doublewell-2d-100k.npz', working_directory='data')\n", "with np.load(file) as fh:\n", " data = fh['trajectory']\n", "\n", "cluster = pyemma.coordinates.cluster_kmeans(data, k=50, max_iter=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To start with, we visualize the marginal and joint distributions of both components as well as the cluster centers:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 2, figsize=(10, 4))\n", "pyemma.plots.plot_feature_histograms(data, feature_labels=['$x$', '$y$'], ax=axes[0])\n", "pyemma.plots.plot_density(*data.T, ax=axes[1], cbar=False, alpha=0.1)\n", "axes[1].scatter(*cluster.clustercenters.T, s=15, c='C1')\n", "axes[1].set_xlabel('$x$')\n", "axes[1].set_ylabel('$y$')\n", "axes[1].set_xlim(-4, 4)\n", "axes[1].set_ylim(-4, 4)\n", "axes[1].set_aspect('equal')\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step after obtaining the discretized dynamics is finding a suitable lag time.\n", "The systematic approach is to estimate MSMs at various lag times and observe how the implied timescales (ITSs) of these models behave.\n", "In particular, we are looking for lag time ranges in which the implied timescales are constant\n", "(i.e., lag time independent as described in the manuscript in Section 2.1).\n", "To this aim, PyEMMA provides the `its()` function which we use to track the first three (`nits=3`) implied timescales:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "its = pyemma.msm.its(cluster.dtrajs, lags=[1, 2, 3, 5, 7, 10], nits=3, errors='bayes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can pass the returned `its` object to the `pyemma.plots.plot_implied_timescales()` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pyemma.plots.plot_implied_timescales(its, ylog=False);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above plot tells us that there is one resolved process with an ITS of approximately $8.5$ steps (blue) which is largely invariant to the MSM lag time.\n", "The other two ITSs (green, red) are smaller than the lag time (black line, grey-shaded area);\n", "they correspond to processes which are faster than the lag time and, thus, are not resolved.\n", "Since the implied timescales are, like the corresponding eigenvalues, sorted in decreasing order,\n", "we know that all other remaining processes must be even faster.\n", "\n", "As MSMs tend to underestimate the true ITSs, we are looking for a converged maximum in the ITS plot.\n", "In our case, any lag time before the slow process (blue line) crosses the lag time threshold (black line) would work.\n", "To maximize the kinetic resolution, we choose the lag time $1$ step.\n", "\n", "To see whether our model satisfies Markovianity, we perform (and visualize) a Chapman-Kolmogorow (CK) test.\n", "Since we aim at modeling the dynamics between metastable states rather than between microstates, this will be conducted in the space of metastable states.\n", "The latter are identified automatically using PCCA++ (which is explained in [Notebook 05 πŸ““](05-pcca-tpt.ipynb)).\n", "We usually choose the number of metastable states according to the implied timescales plot by identifying a gap between the ITS.\n", "For a single process, we can assume that there are two metastable states between which the process occurs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "msm = pyemma.msm.estimate_markov_model(cluster.dtrajs, lag=1)\n", "pyemma.plots.plot_cktest(msm.cktest(2));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see a perfect agreement between models estimated at higher lag times and predictions of the model at lag time $1$ step.\n", "Thus, we have estimated a valid MSM according to basic model validation.\n", "\n", "Should a CK test fail, it means that the dynamics in the space of metastable states is not Markovian.\n", "This can have multiple causes since it is the result of the combination of all steps in the pipeline.\n", "In practice, one would attempt to find a better model by tuning hyper-parameters such as the number of metastable states, the MSM lag time or the number of cluster centers.\n", "Back-tracking the error by following the pipeline in an upstream direction,\n", "i.e., by starting with the number of metastable states, is usually advised. \n", "\n", "A failing CK test might further hint at poor sampling.\n", "This case is explained in more detail in [Notebook 08 πŸ““](08-common-problems.ipynb#poorly_sampled_dw).\n", "\n", "## Case 2: low-dimensional molecular dynamics data (alanine dipeptide)\n", "We fetch the alanine dipeptide data set, load the backbone torsions into memory and directly discretize the full space using $k$-means clustering.\n", "In order to demonstrate how to adjust the MSM lag time,\n", "we will first set the number of cluster centers to $200$ and justify this choice later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')\n", "files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.xtc', working_directory='data')\n", "\n", "feat = pyemma.coordinates.featurizer(pdb)\n", "feat.add_backbone_torsions(periodic=False)\n", "data = pyemma.coordinates.load(files, features=feat)\n", "data_concatenated = np.concatenate(data)\n", "\n", "cluster = pyemma.coordinates.cluster_kmeans(data, k=200, max_iter=50, stride=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the discrete trajectories, implied timescales can be estimated:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "its = pyemma.msm.its(cluster.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We visualize the marginal and joint distributions of both components as well as the cluster centers,\n", "and show the ITS convergence to help selecting a suitable lag time:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 3, figsize=(12, 3))\n", "pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=['$\\Phi$', '$\\Psi$'], ax=axes[0])\n", "pyemma.plots.plot_density(*data_concatenated.T, ax=axes[1], cbar=False, alpha=0.1)\n", "axes[1].scatter(*cluster.clustercenters.T, s=15, c='C1')\n", "axes[1].set_xlabel('$\\Phi$')\n", "axes[1].set_ylabel('$\\Psi$')\n", "pyemma.plots.plot_implied_timescales(its, ax=axes[2], units='ps')\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe three resolved processes with flat ITS for a lag time of approximately $10$ ps.\n", "\n", "Please note though that this ITS convergence analysis is based on the assumption that $200$ $k$-means centers are sufficient to discretize the dynamics.\n", "In order to study the influence of the clustering on the ITS convergence,\n", "we repeat the clustering and ITS convergence analysis for various number of cluster centers.\n", "For the sake of simplicity, we will restrict ourselves to the $k$-means algorithm; alternative clustering methods are presented in [Notebook 02 ➜ πŸ““](02-dimension-reduction-and-discretization.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(2, 3, figsize=(12, 6))\n", "for i, k in enumerate([20, 50, 100]):\n", " cluster = pyemma.coordinates.cluster_kmeans(data, k=k, max_iter=50, stride=10)\n", " pyemma.plots.plot_density(*data_concatenated.T, ax=axes[0, i], cbar=False, alpha=0.1)\n", " axes[0, i].scatter(*cluster.clustercenters.T, s=15, c='C1')\n", " axes[0, i].set_xlabel('$\\Phi$')\n", " axes[0, i].set_ylabel('$\\Psi$')\n", " axes[0, i].set_title('k = {} centers'.format(k))\n", " pyemma.plots.plot_implied_timescales(\n", " pyemma.msm.its(cluster.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes'),\n", " ax=axes[1, i], units='ps')\n", " axes[1, i].set_ylim(1, 2000)\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see from this analysis that the ITS curves indeed converge towards the $200$ centers case and we can continue with estimating/validating an MSM.\n", "\n", "Before we continue with MSM estimation, let us discuss implied timescales convergence for large systems.\n", "Given sufficient sampling, the task is often to find a discretization that captures the process of interest well enough to obtain implied timescales that converge within the trajectory length. \n", "\n", "As we see in the above example with $k=20$ cluster centers,\n", "increasing the MSM lag time compensates for poor discretization to a certain extent.\n", "In a more realistic system, however, trajectories have a finite length that limits the choice of our MSM lag time.\n", "Furthermore, our clustering might be worse than the one presented above,\n", "so convergence might not be reached at all.\n", "Thus, we aim to converge the implied timescales at a low lag time by fine-tuning not only the number of cluster centers,\n", "but also feature selection and dimension reduction measures.\n", "This additionally ensures that our model has the maximum achievable temporal resolution.\n", "\n", "Please note that choosing an appropriate MSM lag time variationally\n", "(e.g., using VAMP scoring) is as far as we know not possible.\n", "\n", "Further details on how to account for poor discretization can be found in our notebook about hidden Markov models [Notebook 07 πŸ““](07-hidden-markov-state-models.ipynb).\n", "An example on how implied timescales behave in the limit of poor sampling is shown in [Notebook 08 πŸ““](08-common-problems.ipynb).\n", "\n", "Now, let's continue with the alanine dipeptide system.\n", "We estimate an MSM at lag time $10$ ps and, given that we have three slow processes, perform a CK test for four metastable states.\n", "\n", "⚠️ In general, the number of metastable states is a modeler's choice and will be explained in more detail in [Notebook 04 ➜ πŸ““](04-msm-analysis.ipynb) and [Notebook 07 ➜ πŸ““](07-hidden-markov-state-models.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "msm = pyemma.msm.estimate_markov_model(cluster.dtrajs, lag=10, dt_traj='1 ps')\n", "pyemma.plots.plot_cktest(msm.cktest(4), units='ps');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model prediction and re-estimation are in quite good agreement but we do see some small deviations in the first row.\n", "\n", "To obtain error bars for the model prediction,\n", "we estimate a Bayesian MSM under the same conditions as the regular MSM and repeat the CK test for the Bayesian model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bayesian_msm = pyemma.msm.bayesian_markov_model(cluster.dtrajs, lag=10, dt_traj='1 ps', conf=0.95)\n", "pyemma.plots.plot_cktest(bayesian_msm.cktest(4), units='ps');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bayesian MSMs are an extension of regular maximum likelihood (ML) MSMs that represent a sample of (reversible) transition matrices.\n", "As presented here, they are usually used to compute confidence intervals.\n", "\n", "A regular MSM estimates a single transition matrix which maximizes the likelihood of the data given the model.\n", "Thus, all derived quantities are based on this ML estimation.\n", "A Bayesian MSM, in comparison, starts with a ML-MSM and samples transition matrices using a Monte Carlo scheme.\n", "Hence, target property posterior distributions can be estimated by computing these properties from each individual transition matrix in the sample. \n", "\n", "The initial ML-MSM used for the transition matrix sampling is contained in the `BayesianMSM` object with its properties accessible to the user.\n", "Please note that different default estimation parameters might yield results that numerically differ from a directly estimated ML-MSM.\n", "\n", "In the case of the low dimensional molecular dynamics data, we thus observe that the deviations are within a $95\\%$ confidence interval.\n", "\n", "### Persisting and restoring estimators\n", "\n", "Because some estimations we have performed so far require more computational effort (e.g., TICA or kmeans with lots of centers),\n", "it could be desirable to persist the resulting models in a file.\n", "Luckily, PyEMMA provides a convenience method for this.\n", "Just try it out:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cluster.save('nb3.pyemma', model_name='kmeans_k200')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have stored the current state of the clustering estimator to disk.\n", "A file can contain multiple models, this is why we have used the `model_name` argument to specify the name.\n", "If omitted, the estimator will be saved under the name `default_model`.\n", "\n", "Assume that we have restarted our Python session and do not want to re-compute everything.\n", "We can now restore the previously saved estimator via" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cluster_restored = pyemma.load('nb3.pyemma', model_name='kmeans_k200')\n", "\n", "# check that nothing has changed\n", "np.testing.assert_allclose(cluster_restored.clustercenters, cluster.clustercenters, atol=1e-15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To check the contents of a file, you can utilize the list_models function of PyEMMA:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pyemma.list_models('nb3.pyemma')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we now remove this files again\n", "import os\n", "os.unlink('nb3.pyemma')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you see, all important attributes of an estimator will be stored.\n", "PyEMMA provides future compatibility of stored estimators,\n", "which means that you can always load your files in a new version, but are then restricted to not using older ones.\n", "\n", "#### Exercise 1\n", "\n", "Load the heavy atom distances into memory, perform PCA and TICA (`lag=3`) with `dim=2`,\n", "then discretize with $100$ $k$-means centers and a stride of $10$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "solution2": "hidden", "solution2_first": true }, "outputs": [], "source": [ "feat = #FIXME\n", "feat. #FIXME\n", "data = #FIXME\n", "\n", "pca = pyemma.coordinates.pca(data, dim=2)\n", "tica = #FIXME\n", "\n", "pca_concatenated = np.concatenate(pca.get_output())\n", "tica_concatenated = #FIXME\n", "\n", "cls_pca = pyemma.coordinates.cluster_kmeans(pca, k=100, max_iter=50, stride=10)\n", "cls_tica = #FIXME\n", "\n", "its_pca = pyemma.msm.its(\n", " cls_pca.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')\n", "its_tica = #FIXME" ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden" }, "source": [ "###### Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "solution2": "hidden" }, "outputs": [], "source": [ "feat = pyemma.coordinates.featurizer(pdb)\n", "pairs = feat.pairs(feat.select_Heavy())\n", "feat.add_distances(pairs, periodic=False)\n", "data = pyemma.coordinates.load(files, features=feat)\n", "\n", "pca = pyemma.coordinates.pca(data, dim=2)\n", "tica = pyemma.coordinates.tica(data, lag=3, dim=2)\n", "\n", "pca_concatenated = np.concatenate(pca.get_output())\n", "tica_concatenated = np.concatenate(tica.get_output())\n", "\n", "cls_pca = pyemma.coordinates.cluster_kmeans(pca, k=100, max_iter=50, stride=10)\n", "cls_tica = pyemma.coordinates.cluster_kmeans(tica, k=100, max_iter=50, stride=10)\n", "\n", "its_pca = pyemma.msm.its(\n", " cls_pca.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')\n", "its_tica = pyemma.msm.its(\n", " cls_tica.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the ITS convergence for both projections:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(2, 3, figsize=(12, 6))\n", "pyemma.plots.plot_feature_histograms(pca_concatenated, ax=axes[0, 0])\n", "pyemma.plots.plot_feature_histograms(tica_concatenated, ax=axes[1, 0])\n", "axes[0, 0].set_title('PCA')\n", "axes[1, 0].set_title('TICA')\n", "pyemma.plots.plot_density(*pca_concatenated.T, ax=axes[0, 1], cbar=False, alpha=0.1)\n", "axes[0, 1].scatter(*cls_pca.clustercenters.T, s=15, c='C1')\n", "axes[0, 1].set_xlabel('PC 1')\n", "axes[0, 1].set_ylabel('PC 2')\n", "pyemma.plots.plot_density(*tica_concatenated.T, ax=axes[1, 1], cbar=False, alpha=0.1)\n", "axes[1, 1].scatter(*cls_tica.clustercenters.T, s=15, c='C1')\n", "axes[1, 1].set_xlabel('IC 1')\n", "axes[1, 1].set_ylabel('IC 2')\n", "pyemma.plots.plot_implied_timescales(its_pca, ax=axes[0, 2], units='ps')\n", "pyemma.plots.plot_implied_timescales(its_tica, ax=axes[1, 2], units='ps')\n", "axes[0, 2].set_ylim(1, 2000)\n", "axes[1, 2].set_ylim(1, 2000)\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Despite the fact that PCA yields a projection with some defined basins,\n", "the ITS plot shows that only one \"slow\" process is resolved which is more than one order of magnitude too fast.\n", "\n", "TICA does find three slow processes which agree (in terms of the implied timescales) with the backbone torsions example above.\n", "\n", "We conclude that this PCA projection is not suitable to resolve the slow dynamics of alanine dipeptide and we will continue to estimate/validate the TICA-based projection.\n", "\n", "#### Exercise 2\n", "\n", "Estimate a Bayesian MSM at lag time $10$ ps and perform/show a CK test for four metastable states." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "solution2": "hidden", "solution2_first": true }, "outputs": [], "source": [ "bayesian_msm = pyemma.msm.bayesian_markov_model(cls_tica.dtrajs, lag=10, dt_traj='1 ps')\n", "pyemma.plots. #FIXME" ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden" }, "source": [ "###### Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "solution2": "hidden" }, "outputs": [], "source": [ "bayesian_msm = pyemma.msm.bayesian_markov_model(cls_tica.dtrajs, lag=10, dt_traj='1 ps')\n", "pyemma.plots.plot_cktest(bayesian_msm.cktest(4), units='ps');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We again see a good agreement between model prediction and re-estimation.\n", "\n", "## Wrapping up\n", "In this notebook, we have learned how to estimate a regular or Bayesian MSM from discretized molecular simulation data with `pyemma` and how to perform basic model validation.\n", "In detail, we have selected a suitable lag time by using\n", "- `pyemma.msm.its()` to obtain an implied timescale object and\n", "- `pyemma.plots.plot_implied_timescales()` to visualize the convergence of the implied timescales.\n", "\n", "We then have used\n", "- `pyemma.msm.estimate_markov_model()` to estimate a regular MSM,\n", "- `pyemma.msm.bayesian_markov_model()` to estimate a Bayesian MSM,\n", "- the `timescales()` method of an estimated MSM object to access its implied timescales,\n", "- the `cktest()` method of an estimated MSM object to perform a Chapman-Kolmogorow test, and\n", "- `pyemma.plots.plot_cktest()` to visualize the latter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "[^]Prinz, Jan-Hendrik and Wu, Hao and Sarich, Marco and Keller, Bettina and Senne, Martin and Held, Martin and Chodera, John D. and SchΓΌtte, Christof and NoΓ©, Frank. 2011. _Markov models of molecular kinetics: Generation and validation_. [URL](http://scitation.aip.org/content/aip/journal/jcp/134/17/10.1063/1.3565032)\n", "\n", "[^]Gregory R. Bowman and Vijay S. Pande and Frank NoΓ©. 2014. _An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation_. [URL](https://doi.org/10.1007%2F978-94-007-7606-7)\n", "\n", "[^]Brooke E. Husic and Vijay S. Pande. 2018. _Markov State Models: From an Art to a Science_.\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }