4.8.1.4. Secondary structure assignment (helix, sheet and loop) — MDAnalysis.analysis.dssp
- Author:
Egor Marin
- Year:
2024
- Copyright:
LGPL v2.1+
Added in version 2.8.0.
The module contains code to build hydrogend bond contact map,
and use it to assign protein secondary structure (DSSP
).
This module uses the python version of the original algorithm [1], re-implemented by @ShintaroMinami and available under MIT license from ShintaroMinami/PyDSSP.
Note
This implementation does not discriminate different types of beta-sheets, as well as different types of helices, meaning you will get \(3_{10}\) helices and π-helices labelled as “helix” too.
Using original pydssp
The default implementation uses the original pydssp (v.0.9.0) code, rewritten
without usage of the einops library and hence having no dependencies. If you want
to explicitly use pydssp (or its particular version), install it to your
current environment with python -m pip install pydssp
. Please note that the
way MDAnalysis uses pydssp does not support pydssp ‘s capability for batch
processing or its use of the pytorch library.
When using this module in published work please cite [1].
References
4.8.1.4.1. Example applications
4.8.1.4.1.1. Assigning secondary structure of a PDB file
In this example we will simply print a string representing a protein’s secondary structure.
from MDAnalysis.tests.datafiles import PDB
from MDAnalysis.analysis.dssp import DSSP
u = mda.Universe(PDB)
s = ''.join(DSSP(u).run().results.dssp[0])
print(s)
4.8.1.4.1.2. Calculating average secondary structure of a trajectory
Here we take a trajectory and calculate its average secondary structure, i.e. assign a secondary structure label ‘X’ to a residue if most of the frames in the trajectory got assigned the ‘X’ label.
from MDAnalysis.analysis.dssp import DSSP, translate
from MDAnalysisTests.datafiles import TPR, XTC
u = mda.Universe(TPR, XTC)
long_run = DSSP(u).run()
mean_secondary_structure = translate(long_run.results.dssp_ndarray.mean(axis=0))
print(''.join(mean_secondary_structure))
Running this code produces
'--EEEE----------HHHHHHH----EE----HHHHH------HHHHHHHHHH------HHHHHHHHHHH---------EEEE-----HHHHHHHHH------EEEEEE--HHHHHH----EE--------EE---E----------------------HHHHHHHHHHHHHHHHHHHHHHHHHHHH----EEEEE------HHHHHHHHH--'
4.8.1.4.1.3. Find parts of the protein that maintain their secondary structure during simulation
In this example, we will find residue groups that maintain their secondary structure along the simulation, and have some meaningful (‘E’ or ‘H’) secondary structure during more than set threshold fraction of frames. We will call these residues “persistent”, for clarity, and label them according to the structure that they maintain during the run:
from MDAnalysis.analysis.dssp import DSSP, translate
from MDAnalysisTests.datafiles import TPR, XTC
u = mda.Universe(TPR, XTC)
threshold = 0.8
long_run = DSSP(u).run()
persistent_residues = translate(
long_run
.results
.dssp_ndarray
.mean(axis=0) > threshold
)
print(''.join(persistent_residues)[:20])
Running this code produces
'--EEEE----------HHHH'
4.8.1.4.2. Analysis classes
- class MDAnalysis.analysis.dssp.dssp.DSSP(atoms: Universe | AtomGroup, guess_hydrogens: bool = True, *, heavyatom_names: tuple[str] = ('N', 'CA', 'C', 'O O1 OT1'), hydrogen_name: str = 'H HN HT1 HT2 HT3')[source]
Assign secondary structure using the DSSP algorithm.
Analyze a selection containing a protein and assign secondary structure using the Kabsch-Sander algorithm [1]. Only a subset of secondary structure categories are implemented:
‘H’ represents a generic helix (α-helix, π-helix or \(3_{10}\) helix)
‘E’ represents ‘extended strand’, participating in beta-ladder (parallel or antiparallel)
‘-’ represents unordered part (“loop”)
The implementation was taken from the pydssp package (v. 0.9.0) https://github.com/ShintaroMinami/PyDSSP by Shintaro Minami under the MIT license.
Warning
For DSSP to work properly, your atoms must represent a protein. The hydrogen atom bound to the backbone nitrogen atom is matched by name as given by the keyword argument hydrogen_atom. There may only be a single backbone nitrogen hydrogen atom per residue; the one exception is proline, for which there should not exist any such hydrogens. The default value of hydrogen_atom should handle the common naming conventions in the PDB and in force fields but if you encounter an error or unusual results during your run, try to figure out how to select the correct hydrogen atoms and report an issue in the MDAnalysis issue tracker.
- Parameters:
atoms (Union[Universe, AtomGroup]) – input Universe or AtomGroup. In both cases, only protein residues will be chosen prior to the analysis via select_atoms(‘protein’). Heavy atoms of the protein are then selected by name heavyatom_names, and hydrogens are selected by name hydrogen_name.
guess_hydrogens (bool, optional) – whether you want to guess hydrogens positions, by default
True
. Guessing is made assuming perfect 120 degrees for all bonds that N atom makes, and a N-H bond length of 1.01 A. Ifguess_hydrogens
is False, hydrogen atom positions on N atoms will be parsed from the trajectory, except for the “hydrogen” atom positions on PRO residues, and an N-terminal residue.heavyatom_names (tuple[str], default ("N", "CA", "C", "O O1 OT1")) – selection names that will be used to select “N”, “CA”, “C” and “O” atom coordinates for the secondary structure determination. The last string contains multiple values for “O” to account for C-term residues.
hydrogen_name (str, default "H HN HT1 HT2 HT3") –
This selection should only select a single hydrogen atom in each residue (except proline), namely the one bound to the backbone nitrogen.
Note
To work with different hydrogen-naming conventions by default, the default selection is broad but if hydrogens are incorrectly selected (e.g., a
ValueError
is raised) you must customize hydrogen_name for your specific case.
- Raises:
ValueError – if
guess_hydrogens
is True but some non-PRO hydrogens are missing.
Examples
For example, you can assign secondary structure for a single PDB file:
>>> from MDAnalysis.analysis.dssp import DSSP >>> from MDAnalysisTests.datafiles import PDB >>> import MDAnalysis as mda >>> u = mda.Universe(PDB) >>> run = DSSP(u).run() >>> print("".join(run.results.dssp[0, :20])) --EEEEE-----HHHHHHHH
The
results.dssp
holds the time series of assigned secondary structure, with one character for each residue.(Note that for displaying purposes we only print the first 20 residues of frame 0 with
run.results.dssp[0, :20]
but one would typically look at all residuesrun.results.dssp[0]
.)The
results.dssp_ndarray
attribute holds a(n_frames, n_residues, 3)
shape ndarray with a one-hot encoding of loop ‘-’ (index 0), helix ‘H’ (index 1), and sheet ‘E’ (index 2), respectively for each frame of the trajectory. It can be used to compute, for instance, the average secondary structure:>>> from MDAnalysis.analysis.dssp import translate, DSSP >>> from MDAnalysisTests.datafiles import TPR, XTC >>> u = mda.Universe(TPR, XTC) >>> run = DSSP(u).run() >>> mean_secondary_structure = translate(run.results.dssp_ndarray.mean(axis=0)) >>> print(''.join(mean_secondary_structure)[:20]) -EEEEEE------HHHHHHH
Added in version 2.8.0: Enabled parallel execution with the
multiprocessing
anddask
backends; use the new methodget_supported_backends()
to see all supported backends.- results.dssp
Contains the time series of the DSSP assignment as a
numpy.ndarray
array of shape(n_frames, n_residues)
where each row contains the assigned secondary structure character for each residue (whose corresponding resid is stored inresults.resids
). The three characters are [‘H’, ‘E’, ‘-’] and representi alpha-helix, sheet and loop, respectively.
- results.dssp_ndarray
Contains the one-hot encoding of the time series of the DSSP assignment as a
numpy.ndarray
Boolean array of shape(n_frames, n_residues, 3)
where for each residue the encoding is stored as(3,)
shapenumpy.ndarray
of Booleans so thatTrue
at index 0 represents loop (‘-‘),True
at index 1 represents helix (‘H’), andTrue
at index 2 represents sheet ‘E’.See also
- results.resids
A
numpy.ndarray
of lengthn_residues
that contains the residue IDs (resids) for the protein residues that were assigned a secondary structure.
- classmethod get_supported_backends()[source]
Tuple with backends supported by the core library for a given class. User can pass either one of these values as
backend=...
torun()
method, or a custom object that hasapply
method (see documentation forrun()
):‘serial’: no parallelization
‘multiprocessing’: parallelization using multiprocessing.Pool
‘dask’: parallelization using dask.delayed.compute(). Requires installation of mdanalysis[dask]
If you want to add your own backend to an existing class, pass a
backends.BackendBase
subclass (see its documentation to learn how to implement it properly), and specifyunsupported_backend=True
.- Returns:
names of built-in backends that can be used in
run(backend=...)()
- Return type:
Added in version 2.8.0.
- property parallelizable
Boolean mark showing that a given class can be parallelizable with split-apply-combine procedure. Namely, if we can safely distribute
_single_frame()
to multiple workers and then combine them with a proper_conclude()
call. If set toFalse
, no backends except forserial
are supported.Note
If you want to check parallelizability of the whole class, without explicitly creating an instance of the class, see
_analysis_algorithm_is_parallelizable
. Note that you setting it to other value will break things if the algorithm behind the analysis is not trivially parallelizable.- Returns:
if a given
AnalysisBase
subclass instance is parallelizable with split-apply-combine, or not- Return type:
Added in version 2.8.0.
- run(start: int | None = None, stop: int | None = None, step: int | None = None, frames: Iterable | None = None, verbose: bool | None = None, n_workers: int | None = None, n_parts: int | None = None, backend: str | BackendBase | None = None, *, unsupported_backend: bool = False, progressbar_kwargs=None)
Perform the calculation
- Parameters:
start (int, optional) – start frame of analysis
stop (int, optional) – stop frame of analysis
step (int, optional) – number of frames to skip between each analysed frame
frames (array_like, optional) –
array of integers or booleans to slice trajectory;
frames
can only be used instead ofstart
,stop
, andstep
. Setting bothframes
and at least one ofstart
,stop
,step
to a non-default value will raise aValueError
.Added in version 2.2.0.
verbose (bool, optional) – Turn on verbosity
progressbar_kwargs (dict, optional) – ProgressBar keywords with custom parameters regarding progress bar position, etc; see
MDAnalysis.lib.log.ProgressBar
for full list. Available only forbackend='serial'
backend (Union[str, BackendBase], optional) –
By default, performs calculations in a serial fashion. Otherwise, user can choose a backend:
str
is matched to a builtin backend (one ofserial
,multiprocessing
anddask
), or aMDAnalysis.analysis.results.BackendBase
subclass.Added in version 2.8.0.
n_workers (int) –
positive integer with number of workers (processes, in case of built-in backends) to split the work between
Added in version 2.8.0.
n_parts (int, optional) –
number of parts to split computations across. Can be more than number of workers.
Added in version 2.8.0.
unsupported_backend (bool, optional) –
if you want to run your custom backend on a parallelizable class that has not been tested by developers, by default False
Added in version 2.8.0.
Changed in version 2.2.0: Added ability to analyze arbitrary frames by passing a list of frame indices in the frames keyword argument.
Changed in version 2.5.0: Add progressbar_kwargs parameter, allowing to modify description, position etc of tqdm progressbars
Changed in version 2.8.0: Introduced
backend
,n_workers
,n_parts
andunsupported_backend
keywords, and refactored the method logic to support parallelizable execution.
4.8.1.4.3. Functions
- MDAnalysis.analysis.dssp.dssp.assign(coord: ndarray) ndarray [source]
Assigns secondary structure for a given coordinate array, either with or without assigned hydrogens
- Parameters:
coord (np.ndarray) – input coordinates in either (n, 4, 3) or (n, 5, 3) shape, without or with hydrogens, respectively. Second dimension k represents (N, CA, C, O) atoms coordinates (if k=4), or (N, CA, C, O, H) coordinates (when k=5).
- Returns:
output (n,) array with one-hot labels in C3 notation (‘-’, ‘H’, ‘E’), representing loop, helix and sheet, respectively.
- Return type:
np.ndarray
Added in version 2.8.0.
- MDAnalysis.analysis.dssp.dssp.translate(onehot: ndarray) ndarray [source]
Translate a one-hot encoding summary into char-based secondary structure assignment.
One-hot encoding corresponds to C3 notation: ‘-’, ‘H’, ‘E’ are loop, helix and sheet, respectively. Input array must have its last axis of shape 3:
(n_residues, 3)
or(n_frames, n_residues, 3)
Examples
from MDAnalysis.analysis.dssp import translate import numpy as np # encoding 'HE-' onehot = np.array([[False, True, False], # 'H' [False, False, True], # 'E' [True, False, False]]) # '-' ''.join(translate(onehot)) print(''.join(translate(onehot)))
Running this code produces
HE-
- Parameters:
onehot (np.ndarray) – input array of one-hot encoding in (‘-’, ‘H’, ‘E’) order
- Returns:
array of ‘-’, ‘H’ and ‘E’ symbols with secondary structure
- Return type:
np.ndarray
Added in version 2.8.0.