4.8.1.4. Secondary structure assignment (helix, sheet and loop) — MDAnalysis.analysis.dssp

Author:

Egor Marin

Year:

2024

Copyright:

LGPL v2.1+

Added in version 2.8.0.

The module contains code to build hydrogend bond contact map, and use it to assign protein secondary structure (DSSP).

This module uses the python version of the original algorithm [1], re-implemented by @ShintaroMinami and available under MIT license from ShintaroMinami/PyDSSP.

Note

This implementation does not discriminate different types of beta-sheets, as well as different types of helices, meaning you will get \(3_{10}\) helices and π-helices labelled as “helix” too.

Using original pydssp

The default implementation uses the original pydssp (v.0.9.0) code, rewritten without usage of the einops library and hence having no dependencies. If you want to explicitly use pydssp (or its particular version), install it to your current environment with python -m pip install pydssp. Please note that the way MDAnalysis uses pydssp does not support pydssp ‘s capability for batch processing or its use of the pytorch library.

When using this module in published work please cite [1].

References

4.8.1.4.1. Example applications

4.8.1.4.1.1. Assigning secondary structure of a PDB file

In this example we will simply print a string representing a protein’s secondary structure.

from MDAnalysis.tests.datafiles import PDB
from MDAnalysis.analysis.dssp import DSSP
u = mda.Universe(PDB)
s = ''.join(DSSP(u).run().results.dssp[0])
print(s)

4.8.1.4.1.2. Calculating average secondary structure of a trajectory

Here we take a trajectory and calculate its average secondary structure, i.e. assign a secondary structure label ‘X’ to a residue if most of the frames in the trajectory got assigned the ‘X’ label.

from MDAnalysis.analysis.dssp import DSSP, translate
from MDAnalysisTests.datafiles import TPR, XTC
u = mda.Universe(TPR, XTC)
long_run = DSSP(u).run()
mean_secondary_structure = translate(long_run.results.dssp_ndarray.mean(axis=0))
print(''.join(mean_secondary_structure))

Running this code produces

'--EEEE----------HHHHHHH----EE----HHHHH------HHHHHHHHHH------HHHHHHHHHHH---------EEEE-----HHHHHHHHH------EEEEEE--HHHHHH----EE--------EE---E----------------------HHHHHHHHHHHHHHHHHHHHHHHHHHHH----EEEEE------HHHHHHHHH--'

4.8.1.4.1.3. Find parts of the protein that maintain their secondary structure during simulation

In this example, we will find residue groups that maintain their secondary structure along the simulation, and have some meaningful (‘E’ or ‘H’) secondary structure during more than set threshold fraction of frames. We will call these residues “persistent”, for clarity, and label them according to the structure that they maintain during the run:

from MDAnalysis.analysis.dssp import DSSP, translate
from MDAnalysisTests.datafiles import TPR, XTC
u = mda.Universe(TPR, XTC)
threshold = 0.8

long_run = DSSP(u).run()
persistent_residues = translate(
    long_run
    .results
    .dssp_ndarray
    .mean(axis=0) > threshold
)
print(''.join(persistent_residues)[:20])

Running this code produces

'--EEEE----------HHHH'

4.8.1.4.2. Analysis classes

class MDAnalysis.analysis.dssp.dssp.DSSP(atoms: Universe | AtomGroup, guess_hydrogens: bool = True, *, heavyatom_names: tuple[str] = ('N', 'CA', 'C', 'O O1 OT1'), hydrogen_name: str = 'H HN HT1 HT2 HT3')[source]

Assign secondary structure using the DSSP algorithm.

Analyze a selection containing a protein and assign secondary structure using the Kabsch-Sander algorithm [1]. Only a subset of secondary structure categories are implemented:

  • ‘H’ represents a generic helix (α-helix, π-helix or \(3_{10}\) helix)

  • ‘E’ represents ‘extended strand’, participating in beta-ladder (parallel or antiparallel)

  • ‘-’ represents unordered part (“loop”)

The implementation was taken from the pydssp package (v. 0.9.0) https://github.com/ShintaroMinami/PyDSSP by Shintaro Minami under the MIT license.

Warning

For DSSP to work properly, your atoms must represent a protein. The hydrogen atom bound to the backbone nitrogen atom is matched by name as given by the keyword argument hydrogen_atom. There may only be a single backbone nitrogen hydrogen atom per residue; the one exception is proline, for which there should not exist any such hydrogens. The default value of hydrogen_atom should handle the common naming conventions in the PDB and in force fields but if you encounter an error or unusual results during your run, try to figure out how to select the correct hydrogen atoms and report an issue in the MDAnalysis issue tracker.

Parameters:
  • atoms (Union[Universe, AtomGroup]) – input Universe or AtomGroup. In both cases, only protein residues will be chosen prior to the analysis via select_atoms(‘protein’). Heavy atoms of the protein are then selected by name heavyatom_names, and hydrogens are selected by name hydrogen_name.

  • guess_hydrogens (bool, optional) – whether you want to guess hydrogens positions, by default True. Guessing is made assuming perfect 120 degrees for all bonds that N atom makes, and a N-H bond length of 1.01 A. If guess_hydrogens is False, hydrogen atom positions on N atoms will be parsed from the trajectory, except for the “hydrogen” atom positions on PRO residues, and an N-terminal residue.

  • heavyatom_names (tuple[str], default ("N", "CA", "C", "O O1 OT1")) – selection names that will be used to select “N”, “CA”, “C” and “O” atom coordinates for the secondary structure determination. The last string contains multiple values for “O” to account for C-term residues.

  • hydrogen_name (str, default "H HN HT1 HT2 HT3") –

    This selection should only select a single hydrogen atom in each residue (except proline), namely the one bound to the backbone nitrogen.

    Note

    To work with different hydrogen-naming conventions by default, the default selection is broad but if hydrogens are incorrectly selected (e.g., a ValueError is raised) you must customize hydrogen_name for your specific case.

Raises:

ValueError – if guess_hydrogens is True but some non-PRO hydrogens are missing.

Examples

For example, you can assign secondary structure for a single PDB file:

>>> from MDAnalysis.analysis.dssp import DSSP
>>> from MDAnalysisTests.datafiles import PDB
>>> import MDAnalysis as mda
>>> u = mda.Universe(PDB)
>>> run = DSSP(u).run()
>>> print("".join(run.results.dssp[0, :20]))
--EEEEE-----HHHHHHHH

The results.dssp holds the time series of assigned secondary structure, with one character for each residue.

(Note that for displaying purposes we only print the first 20 residues of frame 0 with run.results.dssp[0, :20] but one would typically look at all residues run.results.dssp[0].)

The results.dssp_ndarray attribute holds a (n_frames, n_residues, 3) shape ndarray with a one-hot encoding of loop ‘-’ (index 0), helix ‘H’ (index 1), and sheet ‘E’ (index 2), respectively for each frame of the trajectory. It can be used to compute, for instance, the average secondary structure:

>>> from MDAnalysis.analysis.dssp import translate, DSSP
>>> from MDAnalysisTests.datafiles import TPR, XTC
>>> u = mda.Universe(TPR, XTC)
>>> run = DSSP(u).run()
>>> mean_secondary_structure = translate(run.results.dssp_ndarray.mean(axis=0))
>>> print(''.join(mean_secondary_structure)[:20])
-EEEEEE------HHHHHHH

Added in version 2.8.0.

results.dssp

Contains the time series of the DSSP assignment as a numpy.ndarray array of shape (n_frames, n_residues) where each row contains the assigned secondary structure character for each residue (whose corresponding resid is stored in results.resids). The three characters are [‘H’, ‘E’, ‘-’] and representi alpha-helix, sheet and loop, respectively.

results.dssp_ndarray

Contains the one-hot encoding of the time series of the DSSP assignment as a numpy.ndarray Boolean array of shape (n_frames, n_residues, 3) where for each residue the encoding is stored as (3,) shape numpy.ndarray of Booleans so that True at index 0 represents loop (‘-‘), True at index 1 represents helix (‘H’), and True at index 2 represents sheet ‘E’.

See also

translate()

results.resids

A numpy.ndarray of length n_residues that contains the residue IDs (resids) for the protein residues that were assigned a secondary structure.

classmethod get_supported_backends()

Tuple with backends supported by the core library for a given class. User can pass either one of these values as backend=... to run() method, or a custom object that has apply method (see documentation for run()):

  • ‘serial’: no parallelization

  • ‘multiprocessing’: parallelization using multiprocessing.Pool

  • ‘dask’: parallelization using dask.delayed.compute(). Requires installation of mdanalysis[dask]

If you want to add your own backend to an existing class, pass a backends.BackendBase subclass (see its documentation to learn how to implement it properly), and specify unsupported_backend=True.

Returns:

names of built-in backends that can be used in run(backend=...)()

Return type:

tuple

Added in version 2.8.0.

property parallelizable

Boolean mark showing that a given class can be parallelizable with split-apply-combine procedure. Namely, if we can safely distribute _single_frame() to multiple workers and then combine them with a proper _conclude() call. If set to False, no backends except for serial are supported.

Note

If you want to check parallelizability of the whole class, without explicitly creating an instance of the class, see _analysis_algorithm_is_parallelizable. Note that you setting it to other value will break things if the algorithm behind the analysis is not trivially parallelizable.

Returns:

if a given AnalysisBase subclass instance is parallelizable with split-apply-combine, or not

Return type:

bool

Added in version 2.8.0.

run(start: int | None = None, stop: int | None = None, step: int | None = None, frames: Iterable | None = None, verbose: bool | None = None, n_workers: int | None = None, n_parts: int | None = None, backend: str | BackendBase | None = None, *, unsupported_backend: bool = False, progressbar_kwargs=None)

Perform the calculation

Parameters:
  • start (int, optional) – start frame of analysis

  • stop (int, optional) – stop frame of analysis

  • step (int, optional) – number of frames to skip between each analysed frame

  • frames (array_like, optional) –

    array of integers or booleans to slice trajectory; frames can only be used instead of start, stop, and step. Setting both frames and at least one of start, stop, step to a non-default value will raise a ValueError.

    Added in version 2.2.0.

  • verbose (bool, optional) – Turn on verbosity

  • progressbar_kwargs (dict, optional) – ProgressBar keywords with custom parameters regarding progress bar position, etc; see MDAnalysis.lib.log.ProgressBar for full list. Available only for backend='serial'

  • backend (Union[str, BackendBase], optional) –

    By default, performs calculations in a serial fashion. Otherwise, user can choose a backend: str is matched to a builtin backend (one of serial, multiprocessing and dask), or a MDAnalysis.analysis.results.BackendBase subclass.

    Added in version 2.8.0.

  • n_workers (int) –

    positive integer with number of workers (processes, in case of built-in backends) to split the work between

    Added in version 2.8.0.

  • n_parts (int, optional) –

    number of parts to split computations across. Can be more than number of workers.

    Added in version 2.8.0.

  • unsupported_backend (bool, optional) –

    if you want to run your custom backend on a parallelizable class that has not been tested by developers, by default False

    Added in version 2.8.0.

Changed in version 2.2.0: Added ability to analyze arbitrary frames by passing a list of frame indices in the frames keyword argument.

Changed in version 2.5.0: Add progressbar_kwargs parameter, allowing to modify description, position etc of tqdm progressbars

Changed in version 2.8.0: Introduced backend, n_workers, n_parts and unsupported_backend keywords, and refactored the method logic to support parallelizable execution.

4.8.1.4.3. Functions

MDAnalysis.analysis.dssp.dssp.assign(coord: ndarray) ndarray[source]

Assigns secondary structure for a given coordinate array, either with or without assigned hydrogens

Parameters:

coord (np.ndarray) – input coordinates in either (n, 4, 3) or (n, 5, 3) shape, without or with hydrogens, respectively. Second dimension k represents (N, CA, C, O) atoms coordinates (if k=4), or (N, CA, C, O, H) coordinates (when k=5).

Returns:

output (n,) array with one-hot labels in C3 notation (‘-’, ‘H’, ‘E’), representing loop, helix and sheet, respectively.

Return type:

np.ndarray

Added in version 2.8.0.

MDAnalysis.analysis.dssp.dssp.translate(onehot: ndarray) ndarray[source]

Translate a one-hot encoding summary into char-based secondary structure assignment.

One-hot encoding corresponds to C3 notation: ‘-’, ‘H’, ‘E’ are loop, helix and sheet, respectively. Input array must have its last axis of shape 3: (n_residues, 3) or (n_frames, n_residues, 3)

Examples

from MDAnalysis.analysis.dssp import translate
import numpy as np
# encoding 'HE-'
onehot = np.array([[False, True, False],  # 'H'
                   [False, False, True],  # 'E'
                   [True, False, False]]) # '-'
''.join(translate(onehot))
print(''.join(translate(onehot)))

Running this code produces

HE-
Parameters:

onehot (np.ndarray) – input array of one-hot encoding in (‘-’, ‘H’, ‘E’) order

Returns:

array of ‘-’, ‘H’ and ‘E’ symbols with secondary structure

Return type:

np.ndarray

Added in version 2.8.0.