13.2.11. Correlations utilities — MDAnalysis.lib.correlations

Authors:Paul Smith & Mateusz Bieniek
Year:2020
Copyright:GNU Public License v2

New in version 1.0.0.

This module is primarily for internal use by other analysis modules. It provides functionality for calculating the time autocorrelation function of a binary variable (i.e one that is either true or false at each frame for a given atom/molecule/set of molecules). This module includes functions for calculating both the time continuous autocorrelation and the intermittent autocorrelation. The function autocorrelation() calculates the continuous autocorrelation only. The data may be pre-processed using the function intermittency() in order to acount for intermittency before passing the results to autocorrelation().

This module is inspired by seemingly disparate analyses that rely on the same underlying calculation, including the survival probability of water around proteins [Araya-Secchi2014], hydrogen bond lifetimes [Gowers2015], [Araya-Secchi2014], and the rate of cholesterol flip-flop in lipid bilayers [Gu2019].

See also

Analysis tools that make use of modules:

References

[Gu2019]Gu, R.-X.; Baoukina, S.; Tieleman, D. P. (2019) Cholesterol Flip-Flop in Heterogeneous Membranes. J. Chem. Theory Comput. 15 (3), 2064–2070. https://doi.org/10.1021/acs.jctc.8b00933.

13.2.11.1. Autocorrelation Function

MDAnalysis.lib.correlations.autocorrelation(list_of_sets, tau_max, window_step=1)[source]

Implementation of a discrete autocorrelation function.

The autocorrelation of a property \(x\) from a time \(t=t_0\) to \(t=t_0 + \tau\) is given by:

\[C(\tau) = \langle \frac{ x(t_0)x(t_0 +\tau) }{ x(t_0)x(t_0) } \rangle\]

where \(x\) may represent any property of a particle, such as velocity or potential energy.

This function is an implementation of a special case of the time autocorrelation function in which the property under consideration can be encoded with indicator variables, \(0\) and \(1\), to represent the binary state of said property. This special case is often referred to as the survival probability (\(S(\tau)\)). As an example, in calculating the survival probability of water molecules within 5 Å of a protein, each water molecule will either be within this cutoff range (\(1\)) or not (\(0\)). The total number of water molecules within the cutoff at time \(t_0\) will be given by \(N(t_0)\). Other cases include the Hydrogen Bond Lifetime as well as the translocation rate of cholesterol across a bilayer.

The survival probability of a property of a set of particles is given by:

\[S(\tau) = \langle \frac{ N(t_0, t_0 + \tau )} { N(t_0) }\rangle\]

where \(N(t0)\) is the number of particles at time \(t_0\) for which the feature is observed, and \(N(t0, t_0 + \tau)\) is the number of particles for which this feature is present at every frame from \(t_0\) to \(N(t0, t_0 + \tau)\). The angular brackets represent an average over all time origins, \(t_0\).

See [Araya-Secchi2014] for a description survival probability.

Parameters:
  • list_of_sets (list) – List of sets. Each set corresponds to data from a single frame. Each element in a set may be, for example, an atom id or a tuple of atoms ids. In the case of calculating the survival probability of water around a protein, these atom ids in a given set will be those of the atoms which are within a cutoff distance of the protein at a given frame.
  • tau_max (int) – The last tau (lag time, inclusive) for which to calculate the autocorrelation. e.g if tau_max = 20, the survival probability will be calculated over 20 frames.
  • window_step (int, optional) –
    The step size for t0 to perform autocorrelation. Ideally, window_step will be larger than
    tau_max to ensure independence of each window for which the calculation is performed. Default is 1.
Returns:

  • tau_timeseries (list of int) – the values of tau for which the autocorrelation was calculated
  • timeseries (list of int) – the autocorelation values for each of the tau values
  • timeseries_data (list of list of int) – the raw data from which the autocorrelation is computed, i.e \(S(\tau)\) at each window. This allows the time dependant evolution of \(S(\tau)\) to be investigated.
  • .. versionadded:: 0.19.2

13.2.11.2. Intermittency Function

MDAnalysis.lib.correlations.correct_intermittency(list_of_sets, intermittency)[source]

Preprocess data to allow intermittent behaviour prior to calling autocorrelation().

Survival probabilty may be calculated either with a strict continuous requirement or a less strict intermittency. If calculating the survival probability water around a protein for example, in the former case the water must be within a cutoff distance of the protein at every frame from \(t_0\) to \(t_0 + \tau\) in order for it to be considered present at \(t_0 + \tau\). In the intermittent case, the water molecule is allowed to leave the region of interest for up to a specified consecutive number of frames whilst still being considered present at \(t_0 + \tau\).

This function pre-processes data, such as the atom ids of water molecules within a cutoff distance of a protein at each frame, in order to allow for intermittent behaviour, with a single pass over the data.

For example, if an atom is absent for a number of frames equal or smaller than the parameter intermittency, then this absence will be removed and thus the atom is considered to have not left. e.g 7,A,A,7 with intermittency=2 will be replaced by 7,7,7,7, where A=absence.

The returned data can be used as input to the function autocorrelation() in order to calculate the survival probability with a given intermittency.

See [Gowers2015] for a description of intermittency in the calculation of hydrogen bond lifetimes.

# TODO - is intermittency consitent with list of sets of sets? (hydrogen bonds)

Parameters:
  • list_of_sets (list) – In the simple case of e.g survival probability, a list of sets of atom ids present at each frame, where a single set contains atom ids at a given frame, e.g [{0, 1}, {0}, {0}, {0, 1}]
  • intermittency (int) – The maximum gap allowed. The default intermittency=0 means that if the datapoint is missing at any frame, no changes are made to the data. With the value of intermittency=2, all datapoints missing for up to two consecutive frames will be instead be considered present.
Returns:

list_of_sets – returns a new list with the IDs with added IDs which disappeared for <= intermittency. e.g If [{0, 1}, {0}, {0}, {0, 1}] is a list of sets of atom ids present at each frame and intermittency=2, both atoms will be considered present throughout and thus the returned list of sets will be [{0, 1}, {0, 1}, {0, 1}, {0, 1}].

Return type:

list