{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Save your models in PyEMMA\n",
    "\n",
    "\n",
    "Most of the Estimators and Models in PyEMMA are serializable. If a given Estimator or Model can be saved to disk,\n",
    "it provides a **save** method. In this notebook we will explain the basic concepts of file handling.\n",
    "\n",
    "We try our best to provide **future** compatiblity of already saved data. This means it should always be possible to load\n",
    "data with a newer version of the software, but you can not do reverse, eg. load a model saved by a new version with an old version of PyEMMA.\n",
    "\n",
    "If you are interested in the technical background, go ahead and read the source code (it is not that much actually)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyemma\n",
    "import numpy as np\n",
    "import os\n",
    "import pprint\n",
    "pyemma.config.mute = True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# delete all saved data\n",
    "def rm_models():\n",
    "    import glob\n",
    "    for f in glob.glob('*.h5'):\n",
    "        os.unlink(f)\n",
    "rm_models()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate some artificial data with 10 states\n",
    "dtrajs = [np.random.randint(0, 10, size=10000) for _ in range(5)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\n",
      "      dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,\n",
      "      nsteps=3, reversible=True, show_progress=False, sparse=False,\n",
      "      statdist_constraint=None)\n"
     ]
    }
   ],
   "source": [
    "# estimate a Bayesian Markov state model\n",
    "bmsm = pyemma.msm.bayesian_markov_model(dtrajs, lag=10)\n",
    "print(bmsm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now save the estimator (which contains the model) to disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now save our model\n",
    "bmsm.save('my_models.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now restore the model, by simply invoking pyemma.load function with our file name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\n",
       "      dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,\n",
       "      nsteps=3, reversible=True, show_progress=False, sparse=False,\n",
       "      statdist_constraint=None)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pyemma.load('my_models.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we can save multiple models in one file. Because HDF5 acts like a file system, we have each model in a separate \"folder\", which is completely independent of the other models. We now change a parameter during estimation and save the estimator again in the same file, but in a different \"folder\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\n",
      "      dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,\n",
      "      nsteps=3, reversible=True, show_progress=False, sparse=False,\n",
      "      statdist_constraint=None)\n"
     ]
    }
   ],
   "source": [
    "bmsm.estimate(dtrajs, lag=100)\n",
    "print(bmsm)\n",
    "bmsm.save('my_models.h5', model_name='lag100')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Likewise when we want to restore the model with the new name, we have to pass it to the load function accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\n",
       "      dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,\n",
       "      nsteps=3, reversible=True, show_progress=False, sparse=False,\n",
       "      statdist_constraint=None)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pyemma.load('my_models.h5', model_name='lag100')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you may have noted, there is no need to pass a model name. For convenience we always save under model_name \"latest\", if the argument is not provided. To check which models are contained in a file, we provide a command line tool named \"pyemma_list_models\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usage: pyemma_list_models [-h] [--json] [--recursive] [-v] files [files ...]\r\n",
      "pyemma_list_models: error: the following arguments are required: files\r\n"
     ]
    }
   ],
   "source": [
    "! pyemma_list_models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PyEMMA models\r\n",
      "=============\r\n",
      "\r\n",
      "file: my_models.h5\r\n",
      "--------------------------------------------------------------------------------\r\n",
      "1. name: default\r\n",
      "created: Thu Jan 11 17:17:27 2018\r\n",
      "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\r\n",
      "      dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,\r\n",
      "      nsteps=3, reversible=True, show_progress=False, sparse=False,\r\n",
      "      statdist_constraint=None)\r\n",
      "2. name: lag100\r\n",
      "created: Thu Jan 11 17:17:28 2018\r\n",
      "BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',\r\n",
      "      dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,\r\n",
      "      nsteps=3, reversible=True, show_progress=False, sparse=False,\r\n",
      "      statdist_constraint=None)\r\n",
      "--------------------------------------------------------------------------------\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "! pyemma_list_models my_models.h5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also check the list of already stored models directly in PyEMMA."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "available models: dict_keys(['default', 'lag100'])\n",
      "--------------------------------------------------------------------------------\n",
      "detailed:\n",
      "{'default': {'class_repr': \"BayesianMSM(conf=0.95, connectivity='largest', \"\n",
      "                           \"count_mode='effective',\\n\"\n",
      "                           \"      dt_traj='1 step', lag=10, \"\n",
      "                           \"mincount_connectivity='1/n', nsamples=100,\\n\"\n",
      "                           '      nsteps=3, reversible=True, '\n",
      "                           'show_progress=False, sparse=False,\\n'\n",
      "                           '      statdist_constraint=None)',\n",
      "             'class_str': \"BayesianMSM(conf=0.95, connectivity='largest', \"\n",
      "                          \"count_mode='effective',\\n\"\n",
      "                          \"      dt_traj='1 step', lag=10, \"\n",
      "                          \"mincount_connectivity='1/n', nsamples=100,\\n\"\n",
      "                          '      nsteps=3, reversible=True, '\n",
      "                          'show_progress=False, sparse=False,\\n'\n",
      "                          '      statdist_constraint=None)',\n",
      "             'created': 1515687447.2736464,\n",
      "             'created_readable': 'Thu Jan 11 17:17:27 2018',\n",
      "             'digest': 'a6e8e71a1a070a2efa7229b648246c950cb7c7a31812b7337c11bc579e5027a5',\n",
      "             'pyemma_version': '2.4+903.gd2fd40c3.dirty',\n",
      "             'saved_streaming_chain': False},\n",
      " 'lag100': {'class_repr': \"BayesianMSM(conf=0.95, connectivity='largest', \"\n",
      "                          \"count_mode='effective',\\n\"\n",
      "                          \"      dt_traj='1 step', lag=100, \"\n",
      "                          \"mincount_connectivity='1/n', nsamples=100,\\n\"\n",
      "                          '      nsteps=3, reversible=True, '\n",
      "                          'show_progress=False, sparse=False,\\n'\n",
      "                          '      statdist_constraint=None)',\n",
      "            'class_str': \"BayesianMSM(conf=0.95, connectivity='largest', \"\n",
      "                         \"count_mode='effective',\\n\"\n",
      "                         \"      dt_traj='1 step', lag=100, \"\n",
      "                         \"mincount_connectivity='1/n', nsamples=100,\\n\"\n",
      "                         '      nsteps=3, reversible=True, '\n",
      "                         'show_progress=False, sparse=False,\\n'\n",
      "                         '      statdist_constraint=None)',\n",
      "            'created': 1515687448.0572259,\n",
      "            'created_readable': 'Thu Jan 11 17:17:28 2018',\n",
      "            'digest': 'd8552e43f7a0081ef0a6126701ebb4bc820548624eeb34f012c06448b7b15203',\n",
      "            'pyemma_version': '2.4+903.gd2fd40c3.dirty',\n",
      "            'saved_streaming_chain': False}}\n"
     ]
    }
   ],
   "source": [
    "content = pyemma.list_models('my_models.h5')\n",
    "print(\"available models:\", content.keys())\n",
    "print(\"-\" * 80)\n",
    "print(\"detailed:\")\n",
    "pprint.pprint(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overwriting existing models is also possible, but we have to tell the save method, that we want to overwrite."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "can not save: model \"default\" already exists. Either use overwrite=True, or use a different name/file.\n"
     ]
    }
   ],
   "source": [
    "# we now expect that we get a failure, because the model already exists in the file.\n",
    "try:\n",
    "    bmsm.save('my_models.h5')\n",
    "except RuntimeError as e:\n",
    "    print(\"can not save:\", e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "bmsm.save('my_models.h5', overwrite=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save Pipelines\n",
    "\n",
    "\n",
    "In PyEMMA coordinates one often has chains of Estimators, eg. a reader followed by some transformations and finally a clustering.\n",
    "If you want to preserve this definition of data flow, you can set this during save."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "clustering: KmeansClustering(clustercenters=array([[-0.01797],\n",
      "       [ 0.02758],\n",
      "       [-0.04992],\n",
      "       [ 0.00428],\n",
      "       [ 0.04988],\n",
      "       [-0.03211],\n",
      "       [-0.00866],\n",
      "       [ 0.01354],\n",
      "       [ 0.05942],\n",
      "       [ 0.03767],\n",
      "       [-0.05957],\n",
      "       [-0.0401 ],\n",
      "       [-0.02445],\n",
      "       [-0.00027],\n",
      "       [ 0.03257],\n",
      "....04252],\n",
      "       [-0.03544],\n",
      "       [-0.02078],\n",
      "       [-0.02834],\n",
      "       [-0.05467]], dtype=float32),\n",
      "         fixed_seed=2868876893, init_strategy='kmeans++', keep_data=False,\n",
      "         max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,\n",
      "         oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)\n",
      "tica: TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,\n",
      "   ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,\n",
      "   weights=None)\n",
      "source: DataInMemory(data=[array([[ 0.51702306,  0.70977755],\n",
      "       [ 0.3201417 ,  0.97838087],\n",
      "       [ 0.27219631,  0.04501168],\n",
      "       ..., \n",
      "       [ 0.22109836,  0.02939619],\n",
      "       [ 0.00912085,  0.66311234],\n",
      "       [ 0.69041553,  0.44581483]])], chunksize=262144)\n"
     ]
    }
   ],
   "source": [
    "# create some data, note that this in principle could also be a FeatureReader used to process MD data.\n",
    "data = np.random.random((1000, 2))\n",
    "from pyemma.coordinates import source, tica, cluster_kmeans\n",
    "\n",
    "reader = source(data)\n",
    "tica = tica(reader, lag=10)\n",
    "clust = cluster_kmeans(tica)\n",
    "\n",
    "print('clustering:', clust)\n",
    "print('tica:', tica)\n",
    "print('source:', reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The setting \"save_streaming_chain\" controls, if we want to save the input chain of the object being saved."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "clust.save('pipeline.h5', save_streaming_chain=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The list models tools will also show the saved chain in a human readable fashion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "11-01-18 17:17:33 pyemma.coordinates.clustering.kmeans.KmeansClustering[0] DEBUG    seed = 2868876893\r\n",
      "PyEMMA models\r\n",
      "=============\r\n",
      "\r\n",
      "file: pipeline.h5\r\n",
      "--------------------------------------------------------------------------------\r\n",
      "1. name: default\r\n",
      "created: Thu Jan 11 17:17:31 2018\r\n",
      "KmeansClustering(clustercenters=array([[-0.01797],\r\n",
      "       [ 0.02758],\r\n",
      "       [-0.04992],\r\n",
      "       [ 0.00428],\r\n",
      "       [ 0.04988],\r\n",
      "       [-0.03211],\r\n",
      "       [-0.00866],\r\n",
      "       [ 0.01354],\r\n",
      "       [ 0.05942],\r\n",
      "       [ 0.03767],\r\n",
      "       [-0.05957],\r\n",
      "       [-0.0401 ],\r\n",
      "       [-0.02445],\r\n",
      "       [-0.00027],\r\n",
      "       [ 0.03257],\r\n",
      "....04252],\r\n",
      "       [-0.03544],\r\n",
      "       [-0.02078],\r\n",
      "       [-0.02834],\r\n",
      "       [-0.05467]], dtype=float32),\r\n",
      "         fixed_seed=2868876893, init_strategy='kmeans++', keep_data=False,\r\n",
      "         max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,\r\n",
      "         oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)\r\n",
      "\r\n",
      "---------Input chain---------\r\n",
      "1. DataInMemory(data=[array([[ 0.51702306,  0.70977755],\r\n",
      "       [ 0.3201417 ,  0.97838087],\r\n",
      "       [ 0.27219631,  0.04501168],\r\n",
      "       ..., \r\n",
      "       [ 0.22109836,  0.02939619],\r\n",
      "       [ 0.00912085,  0.66311234],\r\n",
      "       [ 0.69041553,  0.44581483]])], chunksize=262144)\r\n",
      "2. TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,\r\n",
      "   ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,\r\n",
      "   weights=None)\r\n",
      "--------------------------------------------------------------------------------\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "! pyemma_list_models pipeline.h5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "clustering: KmeansClustering(clustercenters=array([[-0.01797],\n",
      "       [ 0.02758],\n",
      "       [-0.04992],\n",
      "       [ 0.00428],\n",
      "       [ 0.04988],\n",
      "       [-0.03211],\n",
      "       [-0.00866],\n",
      "       [ 0.01354],\n",
      "       [ 0.05942],\n",
      "       [ 0.03767],\n",
      "       [-0.05957],\n",
      "       [-0.0401 ],\n",
      "       [-0.02445],\n",
      "       [-0.00027],\n",
      "       [ 0.03257],\n",
      "....04252],\n",
      "       [-0.03544],\n",
      "       [-0.02078],\n",
      "       [-0.02834],\n",
      "       [-0.05467]], dtype=float32),\n",
      "         fixed_seed=2868876893, init_strategy='kmeans++', keep_data=False,\n",
      "         max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,\n",
      "         oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)\n",
      "tica: TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,\n",
      "   ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,\n",
      "   weights=None)\n",
      "source: DataInMemory(data=[array([[ 0.51702306,  0.70977755],\n",
      "       [ 0.3201417 ,  0.97838087],\n",
      "       [ 0.27219631,  0.04501168],\n",
      "       ..., \n",
      "       [ 0.22109836,  0.02939619],\n",
      "       [ 0.00912085,  0.66311234],\n",
      "       [ 0.69041553,  0.44581483]])], chunksize=262144)\n"
     ]
    }
   ],
   "source": [
    "restored = pyemma.load('pipeline.h5')\n",
    "\n",
    "print('clustering:', restored)\n",
    "print('tica:', restored.data_producer)\n",
    "print('source:', restored.data_producer.data_producer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you see, we can access all elements of the pipeline by the data_producer attribute. In principle we can just assign these to variables again and change estimation parameters and re-estimate parts of the pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This concludes the storage tutorial of PyEMMA. Happy saving!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}