pyemma.coordinates.clustering.MiniBatchKmeansClustering¶
-
class
pyemma.coordinates.clustering.MiniBatchKmeansClustering(*args, **kwargs)¶ Mini-batch k-means clustering
-
__init__(n_clusters, max_iter=5, metric='euclidean', tolerance=1e-05, init_strategy='kmeans++', batch_size=0.2, oom_strategy='memmap', fixed_seed=False, stride=None, n_jobs=None, skip=0, clustercenters=None, keep_data=False)¶ Kmeans clustering
- Parameters
n_clusters (int) – amount of cluster centers. When not specified (None), min(sqrt(N), 5000) is chosen as default value, where N denotes the number of data points
max_iter (int) – maximum number of iterations before stopping.
tolerance (float) –
stop iteration when the relative change in the cost function
\[C(S) = \sum_{i=1}^{k} \sum_{\mathbf x \in S_i} \left\| \mathbf x - \boldsymbol\mu_i \right\|^2\]is smaller than tolerance.
metric (str) – metric to use during clustering (‘euclidean’, ‘minRMSD’)
init_strategy (string) – can be either ‘kmeans++’ or ‘uniform’, determining how the initial cluster centers are being chosen
fixed_seed (bool or int) – if True, the seed gets set to 42. Use time based seeding otherwise. if an integer is given, use this to initialize the random generator.
oom_strategy (string, default='memmap') –
how to deal with out of memory situation during accumulation of all data.
- ’memmap’: if no memory is available to store all data, a memory
mapped file is created and written to
’raise’: raise OutOfMemory exception.
stride (int) – stridden data
n_jobs (int or None, default None) – Number of threads to use during assignment of the data. If None, all available CPUs will be used.
clustercenters (None or array(k, dim)) – This is used to resume the kmeans iteration. Note, that if this is set, the init_strategy is ignored and the centers are directly passed to the kmeans iteration algorithm.
keep_data (boolean, default False) – If you intend to resume the kmeans iteration later on, in case it did not converge, this parameter controls whether the input data is kept in memory or not.
Methods
_Loggable__create_logger()_ProgressReporterMixin__check_stage_registered(stage)_SerializableMixIn__interpolate(state, klass)__delattr__(name, /)Implement delattr(self, name).
__dir__()Default dir() implementation.
__eq__(value, /)Return self==value.
__format__(format_spec, /)Default object formatter.
__ge__(value, /)Return self>=value.
__getattribute__(name, /)Return getattr(self, name).
__getstate__()__gt__(value, /)Return self>value.
__hash__()Return hash(self).
__init__(n_clusters[, max_iter, metric, …])Kmeans clustering
__init_subclass__(*args, **kwargs)This method is called when a class is subclassed.
__iter__()__le__(value, /)Return self<=value.
__lt__(value, /)Return self<value.
__my_getstate__()__my_setstate__(state)__ne__(value, /)Return self!=value.
__new__(cls, *args, **kwargs)Create and return a new object.
__reduce__()Helper for pickle.
__reduce_ex__(protocol, /)Helper for pickle.
__repr__()Return repr(self).
__setattr__(name, value, /)Implement setattr(self, name, value).
__setstate__(state)__sizeof__()Size of object in memory, in bytes.
__str__()Return str(self).
__subclasshook__Abstract classes can override this to customize issubclass().
_check_estimated()_check_resume_iteration()_chunk_finite(data)_cleanup_logger(logger_id, logger_name)_clear_in_memory()_collect_data(X, first_chunk, last_chunk)_compute_default_cs(dim, itemsize[, logger])_create_iterator([skip, chunk, stride, …])Should be implemented by non-abstract subclasses.
_data_flow_chain()Get a list of all elements in the data flow graph.
_draw_mini_batch_sample()_estimate(iterable, **kw)_finish_estimate()_get_classes_to_inspect()gets classes self derives from which 1.
_get_interpolation_map(cls)_get_model_param_names()Get parameter names for the model
_get_param_names()Get parameter names for the estimator
_get_private_field(cls, name[, default])_get_serialize_fields(cls)_get_state_of_serializeable_fields(klass, state):return a dictionary {k:v} for k in self.serialize_fields and v=getattr(self, k)
_get_traj_info(filename)_get_version(cls[, require])_get_version_for_class_from_state(state, klass)retrieves the version of the current klass from the state mapping from old locations to new ones.
_init_estimate()_init_in_memory_chunks(size)_initialize_centers(X, itraj, t, last_chunk)_logger_is_active(level)@param level: int log level (debug=10, info=20, warn=30, error=40, critical=50)
_map_to_memory([stride])Maps results to memory.
_progress_context([stage])- param stage
_progress_force_finish([stage, description])forcefully finish the progress for given stage
_progress_register(amount_of_work[, …])Registers a progress which can be reported/displayed via a progress bar.
_progress_set_description(stage, description)set description of an already existing progress
_progress_update(numerator_increment[, …])Updates the progress.
_set_random_access_strategies()_set_state_from_serializeable_fields_and_state(…)set only fields from state, which are present in klass.__serialize_fields
_source_from_memory([data_producer])_transform_array(X)get closest index of point in
clustercentersto x.assign([X, stride])Assigns the given trajectory or list of trajectories to cluster centers by using the discretization defined by this clustering method (usually a Voronoi tesselation).
describe()Get a descriptive string representation of this class.
dimension()output dimension of clustering algorithm (always 1).
estimate(X, **kwargs)Estimates the model given the data X
fit(X[, y])Estimates parameters - for compatibility with sklearn.
fit_predict(X[, y])Performs clustering on X and returns cluster labels.
fit_transform(X[, y])Fit to data, then transform it.
get_model_params([deep])Get parameters for this model.
get_output([dimensions, stride, skip, chunk])Maps all input data of this transformer and returns it as an array or list of arrays
get_params([deep])Get parameters for this estimator.
iterator([stride, lag, chunk, …])creates an iterator to stream over the (transformed) data.
load(file_name[, model_name])Loads a previously saved PyEMMA object from disk.
n_chunks(chunksize[, stride, skip])how many chunks an iterator of this sourcde will output, starting (eg.
n_frames_total([stride, skip])Returns total number of frames.
number_of_trajectories([stride])Returns the number of trajectories.
output_type()By default transformers return single precision floats.
sample_indexes_by_cluster(clusters, nsample)Samples trajectory/time indexes according to the given sequence of states.
save(file_name[, model_name, overwrite, …])saves the current state of this object to given file and name.
save_dtrajs([trajfiles, prefix, output_dir, …])saves calculated discrete trajectories.
set_model_params(clustercenters)set_params(**params)Set the parameters of this estimator.
trajectory_length(itraj[, stride, skip])Returns the length of trajectory of the requested index.
trajectory_lengths([stride, skip])Returns the length of each trajectory.
transform(X)Maps the input data through the transformer to correspondingly shaped output data array/list.
update_model_params(**params)Update given model parameter if they are set to specific values
write_to_csv([filename, extension, …])write all data to csv with numpy.savetxt
write_to_hdf5(filename[, group, …])writes all data of this Iterable to a given HDF5 file.
Attributes
_AbstractClustering__serialize_fields_AbstractClustering__serialize_version_DataSource__serialize_fields_Estimator__serialize_fields_FALLBACK_CHUNKSIZE_InMemoryMixin__serialize_fields_InMemoryMixin__serialize_version_KmeansClustering__serialize_fields_KmeansClustering__serialize_version_Loggable__ids_Loggable__refs_MiniBatchKmeansClustering__serialize_version_SerializableMixIn__serialize_fields_SerializableMixIn__serialize_modifications_map_SerializableMixIn__serialize_version__abstractmethods____dict____doc____module____weakref__list of weak references to the object (if defined)
_abc_impl_estimated_estimator_type_loglevel_CRITICAL_loglevel_DEBUG_loglevel_ERROR_loglevel_INFO_loglevel_WARN_pg_threshold_prog_rep_callbacks_prog_rep_descriptions_prog_rep_progressbars_progress_num_registered_progress_registered_stages_save_data_producer_serialize_versionchunksizechunksize defines how much data is being processed at once.
cluster_centers_Array containing the coordinates of the calculated cluster centers.
clustercentersArray containing the coordinates of the calculated cluster centers.
convergeddata_producerThe data producer for this data source object (can be another data source object).
default_chunksizeHow much data will be processed at once, in case no chunksize has been provided.
dtrajsDiscrete trajectories (assigned data to cluster centers).
filenameslist of file names the data is originally being read from.
fixed_seedseed for random choice of initial cluster centers.
in_memoryare results stored in memory?
index_clustersReturns trajectory/time indexes for all the clusters
init_strategyStrategy to get an initial guess for the centers.
is_random_accessibleCheck if self._is_random_accessible is set to true and if all the random access strategies are implemented.
is_readerProperty telling if this data source is a reader or not.
loggerThe logger for this class instance
modelThe model estimated by this Estimator
n_jobsReturns number of jobs/threads to use during assignment of data.
nameThe name of this instance
ndimntrajoverwrite_dtrajsShould existing dtraj files be overwritten.
ra_itraj_cuboidImplementation of random access with slicing that can be up to 3-dimensional, where the first dimension corresponds to the trajectory index, the second dimension corresponds to the frames and the third dimension corresponds to the dimensions of the frames.
ra_itraj_jaggedBehaves like ra_itraj_cuboid just that the trajectories are not truncated and returned as a list.
ra_itraj_linearImplementation of random access that takes arguments as the default random access (i.e., up to three dimensions with trajs, frames and dims, respectively), but which considers the frame indexing to be contiguous.
ra_linearImplementation of random access that takes a (maximal) two-dimensional slice where the first component corresponds to the frames and the second component corresponds to the dimensions.
show_progresswhether to show the progress of heavy calculations on this object.
-