# Source code for MDAnalysis.lib.correlations

# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*-
# vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4
#
# MDAnalysis --- https://www.mdanalysis.org
# Copyright (c) 2006-2017 The MDAnalysis Development Team and contributors
# (see the file AUTHORS for the full list of names)
#
# Released under the GNU Public Licence, v2 or any higher version
#
#
# R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler,
# D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein.
# MDAnalysis: A Python package for the rapid analysis of molecular dynamics
# simulations. In S. Benthall and S. Rostrup editors, Proceedings of the 15th
# Python in Science Conference, pages 102-109, Austin, TX, 2016. SciPy.
# doi: 10.25080/majora-629e541a-00e
#
# N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein.
# MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations.
# J. Comput. Chem. 32 (2011), 2319--2327, doi:10.1002/jcc.21787
#

"""Correlations utilities --- :mod:MDAnalysis.lib.correlations
=================================================================================

:Authors: Paul Smith & Mateusz Bieniek
:Year: 2020

This module is primarily for internal use by other analysis modules. It
provides functionality for calculating the time autocorrelation function
of a binary variable (i.e one that is either true or false at each
frame for a given atom/molecule/set of molecules). This module includes
functions for calculating both the time continuous autocorrelation and
the intermittent autocorrelation. The function :func:autocorrelation
calculates the continuous autocorrelation only. The data may be
pre-processed using the function :func:intermittency in order to
acount for intermittency before passing the results to
:func:autocorrelation.

This module is inspired by seemingly disparate analyses that rely on the same
underlying calculation, including the survival probability of water around
proteins [Araya-Secchi2014]_, hydrogen bond lifetimes [Gowers2015]_, [Araya-Secchi2014]_,
and the rate of cholesterol flip-flop in lipid bilayers [Gu2019]_.

.. seeAlso::

Analysis tools that make use of modules:

* :class:MDAnalysis.analysis.waterdynamics.SurvivalProbability
Calculates the continuous or intermittent survival probability
of an atom group in a region of interest.

* :class:MDAnalysis.analysis.hbonds.hbond_analysis
Calculates the continuous or intermittent hydrogen bond

.. rubric:: References

.. [Gu2019] Gu, R.-X.; Baoukina, S.; Tieleman, D. P. (2019)
Cholesterol Flip-Flop in Heterogeneous Membranes.
J. Chem. Theory Comput. 15 (3), 2064–2070.
https://doi.org/10.1021/acs.jctc.8b00933.

"""

import numpy as np
from copy import deepcopy

[docs]def autocorrelation(list_of_sets, tau_max, window_step=1):
r"""Implementation of a discrete autocorrelation function.

The autocorrelation of a property :math:x from a time :math:t=t_0 to :math:t=t_0 + \tau
is given by:

.. math::
C(\tau) = \langle \frac{ x(t_0)x(t_0 +\tau) }{ x(t_0)x(t_0) } \rangle

where :math:x may represent any property of a particle, such as velocity or
potential energy.

This function is an implementation of a special case of the time
autocorrelation function in which the property under consideration can
be encoded with indicator variables, :math:0 and :math:1, to represent the binary
state of said property. This special case is often referred to as the
survival probability (:math:S(\tau)). As an example, in calculating the survival
probability of water molecules within 5 Å of a protein, each water
molecule will either be within this cutoff range (:math:1) or not (:math:0). The
total number of water molecules within the cutoff at time :math:t_0 will be
given by :math:N(t_0). Other cases include the Hydrogen Bond Lifetime as
well as the translocation rate of cholesterol across a bilayer.

The survival probability of a property of a set of particles is
given by:

.. math::
S(\tau) =  \langle \frac{ N(t_0, t_0 + \tau )} { N(t_0) }\rangle

where :math:N(t0) is the number of particles at time :math:t_0 for which the feature
is observed, and :math:N(t0, t_0 + \tau) is the number of particles for which
this feature is present at every frame from :math:t_0 to :math:N(t0, t_0 + \tau).
The angular brackets represent an average over all time origins, :math:t_0.

See [Araya-Secchi2014]_ for a description survival probability.

Parameters
----------
list_of_sets : list
List of sets. Each set corresponds to data from a single frame. Each element in a set
may be, for example, an atom id or a tuple of atoms ids. In the case of calculating the
survival probability of water around a protein, these atom ids in a given set will be
those of the atoms which are within a cutoff distance of the protein at a given frame.
tau_max : int
The last tau (lag time, inclusive) for which to calculate the autocorrelation. e.g if tau_max = 20,
the survival probability will be calculated over 20 frames.
window_step : int, optional
The step size for t0 to perform autocorrelation. Ideally, window_step will be larger than
tau_max to ensure independence of each window for which the calculation is performed.
Default is 1.

Returns
--------
tau_timeseries : list of int
the values of tau for which the autocorrelation was calculated
timeseries : list of int
the autocorelation values for each of the tau values
timeseries_data : list of list of int
the raw data from which the autocorrelation is computed, i.e :math:S(\tau) at each window.
This allows the time dependant evolution of :math:S(\tau) to be investigated.

"""

# check types
if (type(list_of_sets) != list and len(list_of_sets) != 0) or type(list_of_sets[0]) != set:
raise TypeError("list_of_sets must be a one-dimensional list of sets")  # pragma: no cover

# Check dimensions of parameters
if len(list_of_sets) < tau_max:
raise ValueError("tau_max cannot be greater than the length of list_of_sets") # pragma: no cover

tau_timeseries = list(range(1, tau_max + 1))
timeseries_data = [[] for _ in range(tau_max)]

# calculate autocorrelation
for t in range(0, len(list_of_sets), window_step):
Nt = len(list_of_sets[t])

if Nt == 0:
continue

for tau in tau_timeseries:
if t + tau >= len(list_of_sets):
break

# continuous: IDs that survive from t to t + tau and at every frame in between
Ntau = len(set.intersection(*list_of_sets[t:t + tau + 1]))
timeseries_data[tau - 1].append(Ntau / float(Nt))

timeseries = [np.mean(x) for x in timeseries_data]

# at time 0 the value has to be one
tau_timeseries.insert(0, 0)
timeseries.insert(0, 1)

return tau_timeseries, timeseries, timeseries_data

[docs]def correct_intermittency(list_of_sets, intermittency):
r"""Preprocess data to allow intermittent behaviour prior to calling :func:autocorrelation.

Survival probabilty may be calculated either with a strict continuous requirement or
a less strict intermittency. If calculating the survival probability water around a
protein for example, in the former case the water must be within a cutoff distance
of the protein at every frame from :math:t_0 to :math:t_0 + \tau in order for it to be considered
present at :math:t_0 + \tau. In the intermittent case, the water molecule is allowed to
leave the region of interest for up to a specified consecutive number of frames whilst still
being considered present at :math:t_0 + \tau.

This function pre-processes data, such as the atom ids of water molecules within a cutoff
distance of a protein at each frame, in order to allow for intermittent behaviour, with a
single pass over the data.

For example, if an atom is absent for a number of frames equal or smaller than the parameter
:attr:intermittency, then this absence will be removed and thus the atom is considered to have
not left.
e.g 7,A,A,7 with intermittency=2 will be replaced by 7,7,7,7, where A=absence.

The returned data can be used as input to the function :func:autocorrelation in order
to calculate the survival probability with a given intermittency.

See [Gowers2015]_ for a description of
intermittency in the calculation of hydrogen bond lifetimes.

# TODO - is intermittency consitent with list of sets of sets? (hydrogen bonds)

Parameters
----------
list_of_sets: list
In the simple case of e.g survival probability, a list of sets of atom ids present at each frame, where a
single set contains atom ids at a given frame, e.g [{0, 1}, {0}, {0}, {0, 1}]
intermittency : int
The maximum gap allowed. The default intermittency=0 means that if the datapoint is missing at any frame, no
changes are made to the data. With the value of intermittency=2, all datapoints missing for up to two
consecutive frames will be instead be considered present.

Returns
-------
list_of_sets: list
returns a new list with the IDs with added IDs which disappeared for <= :attr:intermittency.
e.g If [{0, 1}, {0}, {0}, {0, 1}] is a list of sets of atom ids present at each frame and intermittency=2,
both atoms will be considered present throughout and thus the returned list of sets will be
[{0, 1}, {0, 1}, {0, 1}, {0, 1}].

"""
if intermittency == 0:
return list_of_sets

list_of_sets = deepcopy(list_of_sets)

# an element (a superset) takes the form of:
# - an atom pair when computing hydrogen bonds lifetime,
# - atom ID in the case of water survival probability,
for i, elements in enumerate(list_of_sets):
# initially update each frame as seen 0 ago (now)
seen_frames_ago = {i: 0 for i in elements}
for j in range(1, intermittency + 2):
for element in seen_frames_ago.keys():
# no more frames
if i + j >= len(list_of_sets):
continue

# if the element is absent now
if element not in list_of_sets[i + j]:
# increase its absence counter
seen_frames_ago[element] += 1
continue

# the element is present
if seen_frames_ago[element] == 0:
# the element was present in the last frame
continue

# element was absent more times than allowed
if seen_frames_ago[element] > intermittency:
continue

# The element was absent but returned, i.e.
# within <= intermittency_value.
# Add it to the frames where it was absent.
# Introduce the corrections.
for k in range(seen_frames_ago[element], 0, -1):