{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Save your models in PyEMMA\n", "\n", "\n", "Most of the Estimators and Models in PyEMMA are serializable. If a given Estimator or Model can be saved to disk,\n", "it provides a **save** method. In this notebook we will explain the basic concepts of file handling.\n", "\n", "We try our best to provide **future** compatiblity of already saved data. This means it should always be possible to load\n", "data with a newer version of the software, but you can not do reverse, eg. load a model saved by a new version with an old version of PyEMMA.\n", "\n", "If you are interested in the technical background, go ahead and read the source code (it is not that much actually)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pyemma\n", "import numpy as np\n", "import os\n", "import pprint\n", "pyemma.config.mute = True" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# delete all saved data\n", "def rm_models():\n", " import glob\n", " for f in glob.glob('*.h5'):\n", " os.unlink(f)\n", "rm_models()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# generate some artificial data with 10 states\n", "dtrajs = [np.random.randint(0, 10, size=10000) for _ in range(5)]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\n", " dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,\n", " nsteps=3, reversible=True, show_progress=False, sparse=False,\n", " statdist_constraint=None)\n" ] } ], "source": [ "# estimate a Bayesian Markov state model\n", "bmsm = pyemma.msm.bayesian_markov_model(dtrajs, lag=10)\n", "print(bmsm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now save the estimator (which contains the model) to disk." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# now save our model\n", "bmsm.save('my_models.h5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now restore the model, by simply invoking pyemma.load function with our file name." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\n", " dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,\n", " nsteps=3, reversible=True, show_progress=False, sparse=False,\n", " statdist_constraint=None)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyemma.load('my_models.h5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we can save multiple models in one file. Because HDF5 acts like a file system, we have each model in a separate \"folder\", which is completely independent of the other models. We now change a parameter during estimation and save the estimator again in the same file, but in a different \"folder\"." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\n", " dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,\n", " nsteps=3, reversible=True, show_progress=False, sparse=False,\n", " statdist_constraint=None)\n" ] } ], "source": [ "bmsm.estimate(dtrajs, lag=100)\n", "print(bmsm)\n", "bmsm.save('my_models.h5', model_name='lag100')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Likewise when we want to restore the model with the new name, we have to pass it to the load function accordingly." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\n", " dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,\n", " nsteps=3, reversible=True, show_progress=False, sparse=False,\n", " statdist_constraint=None)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyemma.load('my_models.h5', model_name='lag100')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you may have noted, there is no need to pass a model name. For convenience we always save under model_name \"latest\", if the argument is not provided. To check which models are contained in a file, we provide a command line tool named \"pyemma_list_models\"." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: pyemma_list_models [-h] [--json] [--recursive] [-v] files [files ...]\r\n", "pyemma_list_models: error: the following arguments are required: files\r\n" ] } ], "source": [ "! pyemma_list_models" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PyEMMA models\r\n", "=============\r\n", "\r\n", "file: my_models.h5\r\n", "--------------------------------------------------------------------------------\r\n", "1. name: default\r\n", "created: Thu Jan 11 17:17:27 2018\r\n", "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\r\n", " dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,\r\n", " nsteps=3, reversible=True, show_progress=False, sparse=False,\r\n", " statdist_constraint=None)\r\n", "2. name: lag100\r\n", "created: Thu Jan 11 17:17:28 2018\r\n", "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\r\n", " dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,\r\n", " nsteps=3, reversible=True, show_progress=False, sparse=False,\r\n", " statdist_constraint=None)\r\n", "--------------------------------------------------------------------------------\r\n", "\r\n" ] } ], "source": [ "! pyemma_list_models my_models.h5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also check the list of already stored models directly in PyEMMA." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "available models: dict_keys(['default', 'lag100'])\n", "--------------------------------------------------------------------------------\n", "detailed:\n", "{'default': {'class_repr': \"BayesianMSM(conf=0.95, connectivity='largest', \"\n", " \"count_mode='effective',\\n\"\n", " \" dt_traj='1 step', lag=10, \"\n", " \"mincount_connectivity='1/n', nsamples=100,\\n\"\n", " ' nsteps=3, reversible=True, '\n", " 'show_progress=False, sparse=False,\\n'\n", " ' statdist_constraint=None)',\n", " 'class_str': \"BayesianMSM(conf=0.95, connectivity='largest', \"\n", " \"count_mode='effective',\\n\"\n", " \" dt_traj='1 step', lag=10, \"\n", " \"mincount_connectivity='1/n', nsamples=100,\\n\"\n", " ' nsteps=3, reversible=True, '\n", " 'show_progress=False, sparse=False,\\n'\n", " ' statdist_constraint=None)',\n", " 'created': 1515687447.2736464,\n", " 'created_readable': 'Thu Jan 11 17:17:27 2018',\n", " 'digest': 'a6e8e71a1a070a2efa7229b648246c950cb7c7a31812b7337c11bc579e5027a5',\n", " 'pyemma_version': '2.4+903.gd2fd40c3.dirty',\n", " 'saved_streaming_chain': False},\n", " 'lag100': {'class_repr': \"BayesianMSM(conf=0.95, connectivity='largest', \"\n", " \"count_mode='effective',\\n\"\n", " \" dt_traj='1 step', lag=100, \"\n", " \"mincount_connectivity='1/n', nsamples=100,\\n\"\n", " ' nsteps=3, reversible=True, '\n", " 'show_progress=False, sparse=False,\\n'\n", " ' statdist_constraint=None)',\n", " 'class_str': \"BayesianMSM(conf=0.95, connectivity='largest', \"\n", " \"count_mode='effective',\\n\"\n", " \" dt_traj='1 step', lag=100, \"\n", " \"mincount_connectivity='1/n', nsamples=100,\\n\"\n", " ' nsteps=3, reversible=True, '\n", " 'show_progress=False, sparse=False,\\n'\n", " ' statdist_constraint=None)',\n", " 'created': 1515687448.0572259,\n", " 'created_readable': 'Thu Jan 11 17:17:28 2018',\n", " 'digest': 'd8552e43f7a0081ef0a6126701ebb4bc820548624eeb34f012c06448b7b15203',\n", " 'pyemma_version': '2.4+903.gd2fd40c3.dirty',\n", " 'saved_streaming_chain': False}}\n" ] } ], "source": [ "content = pyemma.list_models('my_models.h5')\n", "print(\"available models:\", content.keys())\n", "print(\"-\" * 80)\n", "print(\"detailed:\")\n", "pprint.pprint(content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overwriting existing models is also possible, but we have to tell the save method, that we want to overwrite." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "can not save: model \"default\" already exists. Either use overwrite=True, or use a different name/file.\n" ] } ], "source": [ "# we now expect that we get a failure, because the model already exists in the file.\n", "try:\n", " bmsm.save('my_models.h5')\n", "except RuntimeError as e:\n", " print(\"can not save:\", e)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "bmsm.save('my_models.h5', overwrite=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save Pipelines\n", "\n", "\n", "In PyEMMA coordinates one often has chains of Estimators, eg. a reader followed by some transformations and finally a clustering.\n", "If you want to preserve this definition of data flow, you can set this during save." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clustering: KmeansClustering(clustercenters=array([[-0.01797],\n", " [ 0.02758],\n", " [-0.04992],\n", " [ 0.00428],\n", " [ 0.04988],\n", " [-0.03211],\n", " [-0.00866],\n", " [ 0.01354],\n", " [ 0.05942],\n", " [ 0.03767],\n", " [-0.05957],\n", " [-0.0401 ],\n", " [-0.02445],\n", " [-0.00027],\n", " [ 0.03257],\n", "....04252],\n", " [-0.03544],\n", " [-0.02078],\n", " [-0.02834],\n", " [-0.05467]], dtype=float32),\n", " fixed_seed=2868876893, init_strategy='kmeans++', keep_data=False,\n", " max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,\n", " oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)\n", "tica: TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,\n", " ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,\n", " weights=None)\n", "source: DataInMemory(data=[array([[ 0.51702306, 0.70977755],\n", " [ 0.3201417 , 0.97838087],\n", " [ 0.27219631, 0.04501168],\n", " ..., \n", " [ 0.22109836, 0.02939619],\n", " [ 0.00912085, 0.66311234],\n", " [ 0.69041553, 0.44581483]])], chunksize=262144)\n" ] } ], "source": [ "# create some data, note that this in principle could also be a FeatureReader used to process MD data.\n", "data = np.random.random((1000, 2))\n", "from pyemma.coordinates import source, tica, cluster_kmeans\n", "\n", "reader = source(data)\n", "tica = tica(reader, lag=10)\n", "clust = cluster_kmeans(tica)\n", "\n", "print('clustering:', clust)\n", "print('tica:', tica)\n", "print('source:', reader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The setting \"save_streaming_chain\" controls, if we want to save the input chain of the object being saved." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "clust.save('pipeline.h5', save_streaming_chain=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The list models tools will also show the saved chain in a human readable fashion." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11-01-18 17:17:33 pyemma.coordinates.clustering.kmeans.KmeansClustering[0] DEBUG seed = 2868876893\r\n", "PyEMMA models\r\n", "=============\r\n", "\r\n", "file: pipeline.h5\r\n", "--------------------------------------------------------------------------------\r\n", "1. name: default\r\n", "created: Thu Jan 11 17:17:31 2018\r\n", "KmeansClustering(clustercenters=array([[-0.01797],\r\n", " [ 0.02758],\r\n", " [-0.04992],\r\n", " [ 0.00428],\r\n", " [ 0.04988],\r\n", " [-0.03211],\r\n", " [-0.00866],\r\n", " [ 0.01354],\r\n", " [ 0.05942],\r\n", " [ 0.03767],\r\n", " [-0.05957],\r\n", " [-0.0401 ],\r\n", " [-0.02445],\r\n", " [-0.00027],\r\n", " [ 0.03257],\r\n", "....04252],\r\n", " [-0.03544],\r\n", " [-0.02078],\r\n", " [-0.02834],\r\n", " [-0.05467]], dtype=float32),\r\n", " fixed_seed=2868876893, init_strategy='kmeans++', keep_data=False,\r\n", " max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,\r\n", " oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)\r\n", "\r\n", "---------Input chain---------\r\n", "1. DataInMemory(data=[array([[ 0.51702306, 0.70977755],\r\n", " [ 0.3201417 , 0.97838087],\r\n", " [ 0.27219631, 0.04501168],\r\n", " ..., \r\n", " [ 0.22109836, 0.02939619],\r\n", " [ 0.00912085, 0.66311234],\r\n", " [ 0.69041553, 0.44581483]])], chunksize=262144)\r\n", "2. TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,\r\n", " ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,\r\n", " weights=None)\r\n", "--------------------------------------------------------------------------------\r\n", "\r\n" ] } ], "source": [ "! pyemma_list_models pipeline.h5" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clustering: KmeansClustering(clustercenters=array([[-0.01797],\n", " [ 0.02758],\n", " [-0.04992],\n", " [ 0.00428],\n", " [ 0.04988],\n", " [-0.03211],\n", " [-0.00866],\n", " [ 0.01354],\n", " [ 0.05942],\n", " [ 0.03767],\n", " [-0.05957],\n", " [-0.0401 ],\n", " [-0.02445],\n", " [-0.00027],\n", " [ 0.03257],\n", "....04252],\n", " [-0.03544],\n", " [-0.02078],\n", " [-0.02834],\n", " [-0.05467]], dtype=float32),\n", " fixed_seed=2868876893, init_strategy='kmeans++', keep_data=False,\n", " max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,\n", " oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)\n", "tica: TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,\n", " ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,\n", " weights=None)\n", "source: DataInMemory(data=[array([[ 0.51702306, 0.70977755],\n", " [ 0.3201417 , 0.97838087],\n", " [ 0.27219631, 0.04501168],\n", " ..., \n", " [ 0.22109836, 0.02939619],\n", " [ 0.00912085, 0.66311234],\n", " [ 0.69041553, 0.44581483]])], chunksize=262144)\n" ] } ], "source": [ "restored = pyemma.load('pipeline.h5')\n", "\n", "print('clustering:', restored)\n", "print('tica:', restored.data_producer)\n", "print('source:', restored.data_producer.data_producer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you see, we can access all elements of the pipeline by the data_producer attribute. In principle we can just assign these to variables again and change estimation parameters and re-estimate parts of the pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This concludes the storage tutorial of PyEMMA. Happy saving!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }