Save your models in PyEMMA

Most of the Estimators and Models in PyEMMA are serializable. If a given Estimator or Model can be saved to disk, it provides a save method. In this notebook we will explain the basic concepts of file handling.

We try our best to provide future compatiblity of already saved data. This means it should always be possible to load data with a newer version of the software, but you can not do reverse, eg. load a model saved by a new version with an old version of PyEMMA.

If you are interested in the technical background, go ahead and read the source code (it is not that much actually).

[1]:
import pyemma
import numpy as np
import os
import pprint
pyemma.config.mute = True
[2]:
# delete all saved data
def rm_models():
    import glob
    for f in glob.glob('*.h5'):
        os.unlink(f)
rm_models()
[3]:
# generate some artificial data with 10 states
dtrajs = [np.random.randint(0, 10, size=10000) for _ in range(5)]
[4]:
# estimate a Bayesian Markov state model
bmsm = pyemma.msm.bayesian_markov_model(dtrajs, lag=10)
print(bmsm)
BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)

We can now save the estimator (which contains the model) to disk.

[5]:
# now save our model
bmsm.save('my_models.h5')

We can now restore the model, by simply invoking pyemma.load function with our file name.

[6]:
pyemma.load('my_models.h5')
[6]:
BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)

Note that we can save multiple models in one file. Because HDF5 acts like a file system, we have each model in a separate “folder”, which is completely independent of the other models. We now change a parameter during estimation and save the estimator again in the same file, but in a different “folder”.

[7]:
bmsm.estimate(dtrajs, lag=100)
print(bmsm)
bmsm.save('my_models.h5', model_name='lag100')
BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)

Likewise when we want to restore the model with the new name, we have to pass it to the load function accordingly.

[8]:
pyemma.load('my_models.h5', model_name='lag100')
[8]:
BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)

As you may have noted, there is no need to pass a model name. For convenience we always save under model_name “latest”, if the argument is not provided. To check which models are contained in a file, we provide a command line tool named “pyemma_list_models”.

[9]:
! pyemma_list_models
usage: pyemma_list_models [-h] [--json] [--recursive] [-v] files [files ...]
pyemma_list_models: error: the following arguments are required: files
[10]:
! pyemma_list_models my_models.h5
PyEMMA models
=============

file: my_models.h5
--------------------------------------------------------------------------------
1. name: default
created: Thu Jan 11 17:17:27 2018
BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)
2. name: lag100
created: Thu Jan 11 17:17:28 2018
BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)
--------------------------------------------------------------------------------

You can also check the list of already stored models directly in PyEMMA.

[11]:
content = pyemma.list_models('my_models.h5')
print("available models:", content.keys())
print("-" * 80)
print("detailed:")
pprint.pprint(content)
available models: dict_keys(['default', 'lag100'])
--------------------------------------------------------------------------------
detailed:
{'default': {'class_repr': "BayesianMSM(conf=0.95, connectivity='largest', "
                           "count_mode='effective',\n"
                           "      dt_traj='1 step', lag=10, "
                           "mincount_connectivity='1/n', nsamples=100,\n"
                           '      nsteps=3, reversible=True, '
                           'show_progress=False, sparse=False,\n'
                           '      statdist_constraint=None)',
             'class_str': "BayesianMSM(conf=0.95, connectivity='largest', "
                          "count_mode='effective',\n"
                          "      dt_traj='1 step', lag=10, "
                          "mincount_connectivity='1/n', nsamples=100,\n"
                          '      nsteps=3, reversible=True, '
                          'show_progress=False, sparse=False,\n'
                          '      statdist_constraint=None)',
             'created': 1515687447.2736464,
             'created_readable': 'Thu Jan 11 17:17:27 2018',
             'digest': 'a6e8e71a1a070a2efa7229b648246c950cb7c7a31812b7337c11bc579e5027a5',
             'pyemma_version': '2.4+903.gd2fd40c3.dirty',
             'saved_streaming_chain': False},
 'lag100': {'class_repr': "BayesianMSM(conf=0.95, connectivity='largest', "
                          "count_mode='effective',\n"
                          "      dt_traj='1 step', lag=100, "
                          "mincount_connectivity='1/n', nsamples=100,\n"
                          '      nsteps=3, reversible=True, '
                          'show_progress=False, sparse=False,\n'
                          '      statdist_constraint=None)',
            'class_str': "BayesianMSM(conf=0.95, connectivity='largest', "
                         "count_mode='effective',\n"
                         "      dt_traj='1 step', lag=100, "
                         "mincount_connectivity='1/n', nsamples=100,\n"
                         '      nsteps=3, reversible=True, '
                         'show_progress=False, sparse=False,\n'
                         '      statdist_constraint=None)',
            'created': 1515687448.0572259,
            'created_readable': 'Thu Jan 11 17:17:28 2018',
            'digest': 'd8552e43f7a0081ef0a6126701ebb4bc820548624eeb34f012c06448b7b15203',
            'pyemma_version': '2.4+903.gd2fd40c3.dirty',
            'saved_streaming_chain': False}}

Overwriting existing models is also possible, but we have to tell the save method, that we want to overwrite.

[12]:
# we now expect that we get a failure, because the model already exists in the file.
try:
    bmsm.save('my_models.h5')
except RuntimeError as e:
    print("can not save:", e)
can not save: model "default" already exists. Either use overwrite=True, or use a different name/file.
[13]:
bmsm.save('my_models.h5', overwrite=True)

Save Pipelines

In PyEMMA coordinates one often has chains of Estimators, eg. a reader followed by some transformations and finally a clustering. If you want to preserve this definition of data flow, you can set this during save.

[14]:
# create some data, note that this in principle could also be a FeatureReader used to process MD data.
data = np.random.random((1000, 2))
from pyemma.coordinates import source, tica, cluster_kmeans

reader = source(data)
tica = tica(reader, lag=10)
clust = cluster_kmeans(tica)

print('clustering:', clust)
print('tica:', tica)
print('source:', reader)
clustering: KmeansClustering(clustercenters=array([[-0.01797],
       [ 0.02758],
       [-0.04992],
       [ 0.00428],
       [ 0.04988],
       [-0.03211],
       [-0.00866],
       [ 0.01354],
       [ 0.05942],
       [ 0.03767],
       [-0.05957],
       [-0.0401 ],
       [-0.02445],
       [-0.00027],
       [ 0.03257],
....04252],
       [-0.03544],
       [-0.02078],
       [-0.02834],
       [-0.05467]], dtype=float32),
         fixed_seed=2868876893, init_strategy='kmeans++', keep_data=False,
         max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,
         oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)
tica: TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,
   ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,
   weights=None)
source: DataInMemory(data=[array([[ 0.51702306,  0.70977755],
       [ 0.3201417 ,  0.97838087],
       [ 0.27219631,  0.04501168],
       ...,
       [ 0.22109836,  0.02939619],
       [ 0.00912085,  0.66311234],
       [ 0.69041553,  0.44581483]])], chunksize=262144)

The setting “save_streaming_chain” controls, if we want to save the input chain of the object being saved.

[15]:
clust.save('pipeline.h5', save_streaming_chain=True)

The list models tools will also show the saved chain in a human readable fashion.

[16]:
! pyemma_list_models pipeline.h5
11-01-18 17:17:33 pyemma.coordinates.clustering.kmeans.KmeansClustering[0] DEBUG    seed = 2868876893
PyEMMA models
=============

file: pipeline.h5
--------------------------------------------------------------------------------
1. name: default
created: Thu Jan 11 17:17:31 2018
KmeansClustering(clustercenters=array([[-0.01797],
       [ 0.02758],
       [-0.04992],
       [ 0.00428],
       [ 0.04988],
       [-0.03211],
       [-0.00866],
       [ 0.01354],
       [ 0.05942],
       [ 0.03767],
       [-0.05957],
       [-0.0401 ],
       [-0.02445],
       [-0.00027],
       [ 0.03257],
....04252],
       [-0.03544],
       [-0.02078],
       [-0.02834],
       [-0.05467]], dtype=float32),
         fixed_seed=2868876893, init_strategy='kmeans++', keep_data=False,
         max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,
         oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)

---------Input chain---------
1. DataInMemory(data=[array([[ 0.51702306,  0.70977755],
       [ 0.3201417 ,  0.97838087],
       [ 0.27219631,  0.04501168],
       ...,
       [ 0.22109836,  0.02939619],
       [ 0.00912085,  0.66311234],
       [ 0.69041553,  0.44581483]])], chunksize=262144)
2. TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,
   ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,
   weights=None)
--------------------------------------------------------------------------------

[17]:
restored = pyemma.load('pipeline.h5')

print('clustering:', restored)
print('tica:', restored.data_producer)
print('source:', restored.data_producer.data_producer)
clustering: KmeansClustering(clustercenters=array([[-0.01797],
       [ 0.02758],
       [-0.04992],
       [ 0.00428],
       [ 0.04988],
       [-0.03211],
       [-0.00866],
       [ 0.01354],
       [ 0.05942],
       [ 0.03767],
       [-0.05957],
       [-0.0401 ],
       [-0.02445],
       [-0.00027],
       [ 0.03257],
....04252],
       [-0.03544],
       [-0.02078],
       [-0.02834],
       [-0.05467]], dtype=float32),
         fixed_seed=2868876893, init_strategy='kmeans++', keep_data=False,
         max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,
         oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)
tica: TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,
   ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,
   weights=None)
source: DataInMemory(data=[array([[ 0.51702306,  0.70977755],
       [ 0.3201417 ,  0.97838087],
       [ 0.27219631,  0.04501168],
       ...,
       [ 0.22109836,  0.02939619],
       [ 0.00912085,  0.66311234],
       [ 0.69041553,  0.44581483]])], chunksize=262144)

As you see, we can access all elements of the pipeline by the data_producer attribute. In principle we can just assign these to variables again and change estimation parameters and re-estimate parts of the pipeline.

This concludes the storage tutorial of PyEMMA. Happy saving!