remedian package

Submodules

remedian.remedian module

Contains an implementation of Remedian.

class remedian.remedian.Remedian(obs_size, n_obs, t)[source]

Bases: object

Remedian object for a robust averaging method for large data sets.

Implementation of the Remedian algorithm, see [1] [2] [3] for references.

This algorithm is used to approximate the median of several data chunks if these data chunks cannot (or should not) be loaded into memory at once.

Given a data chunk of size obs_size, and t data chunks overall, the Remedian class sets up a number k_arrs of arrays of length n_obs.

The median of the t data chunks of size obs_size is then approximated as follows: One data chunk after another is fed into the n_obs positions of the first array. When the first array is full, its median is calculated and stored in the first position of the second array. After this, the first array is re-used to fill the second position of the second array, etc. When the second array is full, the median of its values is stored in the first position of the third array, and so on.

The final “Remedian” is the median of the last array, after all t data chunks have been fed into the object.

Parameters:
  • obs_size (ndarray) – The shape of each data chunk (=observation) to be fed into the Remedian object.
  • n_obs (int) – The number of observations to be stored within each array. If n_obs >= t, Remedian will equal the median.
  • t (int) – The total number of observations from which a median should be approximated.
obs_count

int – Counter of number of observations that have already been given to the Remedian object.

remedian

None | ndarray, shape(obs_size) – The calculated remedian of the same shape as the input data. Will be None until all observations n_obs have been fed into the object using the add_obs method.

References

[1]P.J. Rousseeuw, G.W. Bassett Jr., “The remedian: A robust averaging method for large data sets”, Journal of the American Statistical Association, vol. 85 (1990), pp. 97-104
[2]M. Chao, G. Lin, “The asymptotic distributions of the remedians”, Journal of Statistical Planning and Inference, vol. 37 (1993), pp. 1-11
[3]Domenico Cantone, Micha Hofri, “Further analysis of the remedian algorithm”, Theoretical Computer Science, vol. 495 (2013), pp. 1-16

Examples

>>> import numpy as np
>>> from remedian.remedian import Remedian
>>> # We can have data of any shape ... e.g., 3D:
>>> data_shape = (2, 3, 4)
>>> # Now we have to decide how many data observations we want to load into
>>> # memory at a time before computing a first intermediate median from it
>>> n_obs = 100
>>> # Pick some example number, assume we have `t` arrays of `data_shape`
>>> # that we want to summarize with Remedian
>>> t = 500
>>> # Initialize the object
>>> r = Remedian(data_shape, n_obs, t)
>>> # Feed it the data- For now, we just generate the data randomly.
>>> for obs_i in range(t):
>>>     obs = np.random.random(data_shape)
>>>     r.add_obs(obs)
>>> # This is the remedian
>>> r.remedian
assert r.remedian.ndim == data_shape.ndim
add_obs(obs)[source]

Add an observation to the Remedian.

Parameters:obs (ndarray, shape(obs_size)) –

Module contents