Source code for MDAnalysis.analysis.dssp.dssp
"""
Secondary structure assignment (helix, sheet and loop) --- :mod:`MDAnalysis.analysis.dssp`
==========================================================================================
:Author: Egor Marin
:Year: 2024
:Copyright: LGPL v2.1+
.. versionadded:: 2.8.0
The module contains code to build hydrogend bond contact map,
and use it to assign protein secondary structure (:class:`DSSP`).
This module uses the python version of the original algorithm :footcite:p:`Kabsch1983`,
re-implemented by @ShintaroMinami and available under MIT license from
`ShintaroMinami/PyDSSP <https://github.com/ShintaroMinami/PyDSSP/tree/master#differences-from-the-original-dssp>`_.
.. Note::
This implementation does not discriminate different types of
beta-sheets, as well as different types of helices, meaning you will get
:math:`3_{10}` helices and π-helices labelled as "helix" too.
.. rubric:: Using original `pydssp`
The default implementation uses the original *pydssp* (v.0.9.0) code, rewritten
without usage of the *einops* library and hence having no dependencies. If you want
to explicitly use *pydssp* (or its particular version), install it to your
current environment with ``python -m pip install pydssp``. Please note that the
way MDAnalysis uses *pydssp* does not support *pydssp* 's capability for batch
processing or its use of the *pytorch* library.
When using this module in published work please cite :footcite:p:`Kabsch1983`.
.. rubric:: References
.. footbibliography::
Example applications
--------------------
Assigning secondary structure of a PDB file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this example we will simply print a string representing a protein's secondary
structure.
.. code-block:: python
from MDAnalysis.tests.datafiles import PDB
from MDAnalysis.analysis.dssp import DSSP
u = mda.Universe(PDB)
s = ''.join(DSSP(u).run().results.dssp[0])
print(s)
Calculating average secondary structure of a trajectory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here we take a trajectory and calculate its average secondary structure, i.e.
assign a secondary structure label 'X' to a residue if most of the frames in the
trajectory got assigned the 'X' label.
.. code-block:: python
from MDAnalysis.analysis.dssp import DSSP, translate
from MDAnalysisTests.datafiles import TPR, XTC
u = mda.Universe(TPR, XTC)
long_run = DSSP(u).run()
mean_secondary_structure = translate(long_run.results.dssp_ndarray.mean(axis=0))
print(''.join(mean_secondary_structure))
Running this code produces ::
'--EEEE----------HHHHHHH----EE----HHHHH------HHHHHHHHHH------HHHHHHHHHHH---------EEEE-----HHHHHHHHH------EEEEEE--HHHHHH----EE--------EE---E----------------------HHHHHHHHHHHHHHHHHHHHHHHHHHHH----EEEEE------HHHHHHHHH--'
Find parts of the protein that maintain their secondary structure during simulation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this example, we will find residue groups that maintain their secondary structure
along the simulation, and have some meaningful ('E' or 'H') secondary structure
during more than set `threshold` fraction of frames. We will call these residues
"persistent", for clarity, and label them according to the structure
that they maintain during the run:
.. code-block:: python
from MDAnalysis.analysis.dssp import DSSP, translate
from MDAnalysisTests.datafiles import TPR, XTC
u = mda.Universe(TPR, XTC)
threshold = 0.8
long_run = DSSP(u).run()
persistent_residues = translate(
long_run
.results
.dssp_ndarray
.mean(axis=0) > threshold
)
print(''.join(persistent_residues)[:20])
Running this code produces ::
'--EEEE----------HHHH'
Analysis classes
----------------
.. autoclass:: DSSP
:members:
:inherited-members:
.. attribute:: results.dssp
Contains the time series of the DSSP assignment as a
:class:`numpy.ndarray` array of shape ``(n_frames, n_residues)`` where each row
contains the assigned secondary structure character for each residue (whose
corresponding resid is stored in :attr:`results.resids`). The three characters
are ['H', 'E', '-'] and representi alpha-helix, sheet and loop, respectively.
.. attribute:: results.dssp_ndarray
Contains the one-hot encoding of the time series of the DSSP assignment
as a :class:`numpy.ndarray` Boolean array of shape ``(n_frames, n_residues, 3)``
where for each residue the encoding is stored as ``(3,)`` shape
:class:`numpy.ndarray` of Booleans so that ``True`` at index 0 represents loop
('-'), ``True`` at index 1 represents helix ('H'), and ``True`` at index 2
represents sheet 'E'.
.. SeeAlso:: :func:`translate`
.. attribute:: results.resids
A :class:`numpy.ndarray` of length ``n_residues`` that contains the residue IDs
(resids) for the protein residues that were assigned a secondary structure.
Functions
---------
.. autofunction:: assign
.. autofunction:: translate
"""
from typing import Union
import numpy as np
from MDAnalysis import Universe, AtomGroup
from ..base import AnalysisBase, ResultsGroup
from ...due import due, Doi
due.cite(
Doi("10.1002/bip.360221211"),
description="DSSP algorithm description",
path="MDAnalysis.analysis.dssp",
cite_module=True,
)
del Doi
try: # pragma: no cover
from pydssp.pydssp_numpy import (
assign,
_get_hydrogen_atom_position,
)
HAS_PYDSSP = True
except ModuleNotFoundError:
HAS_PYDSSP = False
from .pydssp_numpy import (
assign,
_get_hydrogen_atom_position,
)
[docs]
class DSSP(AnalysisBase):
"""Assign secondary structure using the DSSP algorithm.
Analyze a selection containing a protein and assign secondary structure
using the Kabsch-Sander algorithm :footcite:p:`Kabsch1983`. Only a subset
of secondary structure categories are implemented:
- 'H' represents a generic helix (α-helix, π-helix or :math:`3_{10}` helix)
- 'E' represents 'extended strand', participating in beta-ladder (parallel
or antiparallel)
- '-' represents unordered part ("loop")
The implementation was taken from the pydssp package (v. 0.9.0)
https://github.com/ShintaroMinami/PyDSSP by Shintaro Minami under the
MIT license.
.. Warning::
For DSSP to work properly, your atoms must represent a protein. The
hydrogen atom bound to the backbone nitrogen atom is matched by name
as given by the keyword argument `hydrogen_atom`. There may only be
a single backbone nitrogen hydrogen atom per residue; the one exception
is proline, for which there should not exist any such hydrogens.
The default value of `hydrogen_atom` should handle the common naming
conventions in the PDB and in force fields but if you encounter an error
or unusual results during your run, try to figure out how to select the
correct hydrogen atoms and report an issue in the MDAnalysis
`issue tracker <https://github.com/MDAnalysis/mdanalysis/issues>`_.
Parameters
----------
atoms : Union[Universe, AtomGroup]
input Universe or AtomGroup. In both cases, only protein residues will
be chosen prior to the analysis via `select_atoms('protein')`.
Heavy atoms of the protein are then selected by name
`heavyatom_names`, and hydrogens are selected by name
`hydrogen_name`.
guess_hydrogens : bool, optional
whether you want to guess hydrogens positions, by default ``True``.
Guessing is made assuming perfect 120 degrees for all bonds that N
atom makes, and a N-H bond length of 1.01 A.
If ``guess_hydrogens`` is False, hydrogen atom positions on N atoms
will be parsed from the trajectory, except for the "hydrogen" atom
positions on PRO residues, and an N-terminal residue.
heavyatom_names : tuple[str], default ("N", "CA", "C", "O O1 OT1")
selection names that will be used to select "N", "CA", "C" and "O"
atom coordinates for the secondary structure determination. The last
string contains multiple values for "O" to account for C-term residues.
hydrogen_name : str, default "H HN HT1 HT2 HT3"
This selection should only select a single hydrogen atom in each residue
(except proline), namely the one bound to the backbone nitrogen.
.. Note::
To work with different hydrogen-naming conventions by default, the
default selection is broad but if hydrogens are incorrectly selected
(e.g., a :exc:`ValueError` is raised) you must customize `hydrogen_name`
for your specific case.
Raises
------
ValueError
if ``guess_hydrogens`` is True but some non-PRO hydrogens are missing.
Examples
--------
For example, you can assign secondary structure for a single PDB file:
>>> from MDAnalysis.analysis.dssp import DSSP
>>> from MDAnalysisTests.datafiles import PDB
>>> import MDAnalysis as mda
>>> u = mda.Universe(PDB)
>>> run = DSSP(u).run()
>>> print("".join(run.results.dssp[0, :20]))
--EEEEE-----HHHHHHHH
The :attr:`results.dssp` holds the time series of assigned secondary
structure, with one character for each residue.
(Note that for displaying purposes we only print the first 20 residues
of frame 0 with ``run.results.dssp[0, :20]`` but one would typically look
at all residues ``run.results.dssp[0]``.)
The :attr:`results.dssp_ndarray` attribute holds a
``(n_frames, n_residues, 3)`` shape ndarray with a *one-hot encoding*
of *loop* '-' (index 0), *helix* 'H' (index 1), and *sheet* 'E'
(index 2), respectively for each frame of the trajectory. It can be
used to compute, for instance, the **average secondary structure**:
>>> from MDAnalysis.analysis.dssp import translate, DSSP
>>> from MDAnalysisTests.datafiles import TPR, XTC
>>> u = mda.Universe(TPR, XTC)
>>> run = DSSP(u).run()
>>> mean_secondary_structure = translate(run.results.dssp_ndarray.mean(axis=0))
>>> print(''.join(mean_secondary_structure)[:20])
-EEEEEE------HHHHHHH
.. versionadded:: 2.8.0
Enabled **parallel execution** with the ``multiprocessing`` and ``dask``
backends; use the new method :meth:`get_supported_backends` to see all
supported backends.
"""
_analysis_algorithm_is_parallelizable = True
[docs]
@classmethod
def get_supported_backends(cls):
return (
"serial",
"multiprocessing",
"dask",
)
def __init__(
self,
atoms: Union[Universe, AtomGroup],
guess_hydrogens: bool = True,
*,
heavyatom_names: tuple[str] = ("N", "CA", "C", "O O1 OT1"),
hydrogen_name: str = "H HN HT1 HT2 HT3",
):
self._guess_hydrogens = guess_hydrogens
ag: AtomGroup = atoms.select_atoms("protein")
super().__init__(ag.universe.trajectory)
# define necessary selections
self._heavy_atoms: dict[str, "AtomGroup"] = {
t: ag.atoms[
np.isin(
ag.names, t.split()
) # need split() since `np.isin` takes an iterable as second argument
# and "N".split() -> ["N"]
]
for t in heavyatom_names
}
self._hydrogens: list["AtomGroup"] = [
res.atoms.select_atoms(f"name {hydrogen_name}") for res in ag.residues
]
# can't do it the other way because I need missing values to exist
# so that I could fill them in later
if not self._guess_hydrogens:
# zip() assumes that _heavy_atoms and _hydrogens is ordered in the
# same way. This is true as long as the original AtomGroup ag is
# sorted. With the hard-coded protein selection for ag this is always
# true but if the code on L277 ever changes, make sure to sort first!
for calpha, hydrogen in zip(
self._heavy_atoms["CA"][1:], self._hydrogens[1:]
):
if (calpha.resname != "PRO" and len(hydrogen) != 1) or (
calpha.resname == "PRO" and hydrogen
):
raise ValueError(
(
f"Residue {calpha.residue} contains wrong number of hydrogens: "
"exactly 1 hydrogen is expected for non-PRO residues, and "
"zero hydrogens for PRO residues."
)
)
positions = [group.positions for group in self._heavy_atoms.values()]
if len(set(map(lambda arr: arr.shape[0], positions))) != 1:
raise ValueError(
(
"Universe contains unequal numbers of (N,CA,C,O) atoms ('name' field)."
" Please select appropriate AtomGroup manually."
)
)
def _prepare(self):
self.results.dssp_ndarray = []
def _get_coords(self) -> np.ndarray:
"""Returns coordinates of (N,CA,C,O,H) atoms, as required by
:func:`get_hbond_map` and :func:`assign` functions.
Returns
-------
np.ndarray
coordinates of (N,CA,C,O,H) atoms
Raises
------
ValueError
if input Universe contains different number of (N,CA,C,O) atoms
"""
# NOTE: here we explicitly rely on the fact that `self._heavy_atoms`
# dictionary maintains order of the keys since python 3.7
positions = [group.positions for group in self._heavy_atoms.values()]
coords = np.array(positions)
if not self._guess_hydrogens:
guessed_h_coords = _get_hydrogen_atom_position(coords.swapaxes(0, 1))
h_coords = np.array(
[
group.positions[0] if group else guessed_h_coords[idx]
for idx, group in enumerate(self._hydrogens)
]
)
h_coords = np.expand_dims(h_coords, axis=0)
coords = np.vstack([coords, h_coords])
coords = coords.swapaxes(0, 1)
return coords
def _single_frame(self):
coords = self._get_coords()
dssp = assign(coords)
self.results.dssp_ndarray.append(dssp)
def _conclude(self):
self.results.dssp = translate(np.array(self.results.dssp_ndarray))
self.results.dssp_ndarray = np.array(self.results.dssp_ndarray)
self.results.resids = self._heavy_atoms["CA"].resids
def _get_aggregator(self):
return ResultsGroup(
lookup={"dssp_ndarray": ResultsGroup.flatten_sequence},
)
[docs]
def translate(onehot: np.ndarray) -> np.ndarray:
"""Translate a one-hot encoding summary into char-based secondary structure
assignment.
One-hot encoding corresponds to C3 notation:
'-', 'H', 'E' are loop, helix and sheet, respectively. Input array must
have its last axis of shape 3: ``(n_residues, 3)`` or ``(n_frames, n_residues, 3)``
Examples
--------
.. code-block:: python
from MDAnalysis.analysis.dssp import translate
import numpy as np
# encoding 'HE-'
onehot = np.array([[False, True, False], # 'H'
[False, False, True], # 'E'
[True, False, False]]) # '-'
''.join(translate(onehot))
print(''.join(translate(onehot)))
Running this code produces ::
HE-
Parameters
----------
onehot : np.ndarray
input array of one-hot encoding in ('-', 'H', 'E') order
Returns
-------
np.ndarray
array of '-', 'H' and 'E' symbols with secondary structure
.. versionadded:: 2.8.0
"""
C3_ALPHABET = np.array(["-", "H", "E"])
index = np.argmax(onehot, axis=-1)
return C3_ALPHABET[index]