4.2.6. Calculating path similarity — MDAnalysis.analysis.psa

Author:

Sean Seyler

Year:

2015

Copyright:

GNU Public License v3

New in version 0.10.0.

The module contains code to calculate the geometric similarity of trajectories using path metrics such as the Hausdorff or Fréchet distances [Seyler2015]. The path metrics are functions of two paths and return a nonnegative number, i.e., a distance. Two paths are identical if their distance is zero, and large distances indicate dissimilarity. Each path metric is a function of the individual points (e.g., coordinate snapshots) that comprise each path and, loosely speaking, identify the two points, one per path of a pair of paths, where the paths deviate the most. The distance between these points of maximal deviation is measured by the root mean square deviation (RMSD), i.e., to compute structural similarity.

One typically computes the pairwise similarity for an ensemble of paths to produce a symmetric distance matrix, which can be clustered to, at a glance, identify patterns in the trajectory data. To properly analyze a path ensemble, one must select a suitable reference structure to which all paths (each conformer in each path) will be universally aligned using the rotations determined by the best-fit rmsds. Distances between paths and their structures are then computed directly with no further alignment. This pre-processing step is necessary to preserve the metric properties of the Hausdorff and Fréchet metrics; using the best-fit rmsd on a pairwise basis does not generally preserve the triangle inequality.

Note

The PSAnalysisTutorial outlines a typical application of PSA to a set of trajectories, including doing proper alignment, performing distance comparisons, and generating heat map-dendrogram plots from hierarchical clustering.

References

[Seyler2015] (1,2,3,4)

Sean L. Seyler, Avishek Kumar, M. F. Thorpe, and Oliver Beckstein. Path similarity analysis: a method for quantifying macromolecular pathways. PLOS Computational Biology, 11(10):1–37, 10 2015. doi:10.1371/journal.pcbi.1004568.

4.2.6.1. Helper functions and variables

The following convenience functions are used by other functions in this module.

MDAnalysis.analysis.psa.sqnorm(v, axis=None)[source]

Compute the sum of squares of elements along specified axes.

Parameters:
  • v (numpy.ndarray) – coordinates

  • axes (None / int / tuple (optional)) – Axes or axes along which a sum is performed. The default (axes = None) performs a sum over all the dimensions of the input array. The value of axes may be negative, in which case it counts from the last axis to the zeroth axis.

Returns:

the sum of the squares of the elements of v along axes

Return type:

float

MDAnalysis.analysis.psa.get_msd_matrix(P, Q, axis=None)[source]

Generate the matrix of pairwise mean-squared deviations between paths.

The MSDs between all pairs of points in P and Q are calculated, each pair having a point from P and a point from Q.

P (Q) is a numpy.ndarray of \(N_p\) (\(N_q\)) time steps, \(N\) atoms, and \(3N\) coordinates (e.g., MDAnalysis.core.groups.AtomGroup.positions). The pairwise MSD matrix has dimensions \(N_p\) by \(N_q\).

Parameters:
Returns:

msd_matrix – matrix of pairwise MSDs between points in P and points in Q

Return type:

numpy.ndarray

Notes

We calculate the MSD matrix

\[M_{ij} = ||p_i - q_j||^2\]

where \(p_i \in P\) and \(q_j \in Q\).

MDAnalysis.analysis.psa.get_coord_axes(path)[source]

Return the number of atoms and the axes corresponding to atoms and coordinates for a given path.

The path is assumed to be a numpy.ndarray where the 0th axis corresponds to a frame (a snapshot of coordinates). The \(3N\) (Cartesian) coordinates are assumed to be either:

  1. all in the 1st axis, starting with the x,y,z coordinates of the first atom, followed by the x,*y*,*z* coordinates of the 2nd, etc.

  2. in the 1st and 2nd axis, where the 1st axis indexes the atom number and the 2nd axis contains the x,*y*,*z* coordinates of each atom.

Parameters:

path (numpy.ndarray) – representing a path

Returns:

the number of atoms and the axes containing coordinates

Return type:

(int, (int, …))

4.2.6.2. Classes, methods, and functions

MDAnalysis.analysis.psa.get_path_metric_func(name)[source]

Selects a path metric function by name.

Parameters:

name (str) – name of path metric

Returns:

path_metric – The path metric function specified by name (if found).

Return type:

function

MDAnalysis.analysis.psa.hausdorff(P, Q)[source]

Calculate the symmetric Hausdorff distance between two paths.

The metric used is RMSD, as opposed to the more conventional L2 (Euclidean) norm, because this is convenient for i.e., comparing protein configurations.

P (Q) is a numpy.ndarray of \(N_p\) (\(N_q\)) time steps, \(N\) atoms, and \(3N\) coordinates (e.g., MDAnalysis.core.groups.AtomGroup.positions). P (Q) has either shape \(N_p \times N \times 3\) (\(N_q \times N \times 3\)), or \(N_p \times (3N)\) (\(N_q \times (3N)\)) in flattened form.

Note that reversing the path does not change the Hausdorff distance.

Parameters:
Returns:

the Hausdorff distance between paths P and Q

Return type:

float

Example

Calculate the Hausdorff distance between two halves of a trajectory:

>>> import MDAnalysis as mda
>>> import numpy
>>> from MDAnalysis.tests.datafiles import PSF, DCD
>>> from MDAnalysis.analysis import psa
>>> u = mda.Universe(PSF,DCD)
>>> mid = int(len(u.trajectory)/2)
>>> ca = u.select_atoms('name CA')
>>> P = numpy.array([
...                ca.positions for _ in u.trajectory[:mid:]
...              ]) # first half of trajectory
>>> Q = numpy.array([
...                ca.positions for _ in u.trajectory[mid::]
...              ]) # second half of trajectory
>>> psa.hausdorff(P,Q)
4.778663899862152
>>> psa.hausdorff(P,Q[::-1]) # hausdorff distance w/ reversed 2nd trajectory
4.778663899862152

Notes

scipy.spatial.distance.directed_hausdorff() is an optimized implementation of the early break algorithm of [Taha2015]; the latter code is used here to calculate the symmetric Hausdorff distance with an RMSD metric

References

[Taha2015]

Abdel Aziz Taha and Allan Hanbury. An efficient algorithm for calculating the exact hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11):2153–2163, 2015. doi:10.1109/TPAMI.2015.2408351.