7.1. Trajectory Readers and Writers — MDAnalysis.coordinates
The coordinates submodule contains code to read, write and store coordinate
information, either single frames (e.g., the GRO
format) or trajectories (such as the DCD
format);
see the Table of supported coordinate formats for all formats.
MDAnalysis calls the classes that read a coordinate trajectory and make the data available “Readers”. Similarly, classes that write out coordinates are called “Writers”. Readers and Writers provide a common interface to the underlying coordinate data. This abstraction of coordinate access through an object-oriented interface is one of the key capabilities of MDAnalysis.
7.1.1. Readers
All Readers are based on a ProtoReader
class that defines a common Trajectory API and allows other code to
interface with all trajectory formats in the same way, independent of the
details of the trajectory format itself.
The Universe
contains the API entry point
attribute Universe.trajectory
that points to the actual
ProtoReader
object; all Readers are accessible
through this entry point in the same manner (”duck typing”).
There are three types of base Reader which act as starting points for each specific format. These are:
ReaderBase
A standard multi frame Reader which allows iteration over a single file to provide multiple frames of data. This is used by formats such as TRR and DCD.
SingleFrameReaderBase
A simplified Reader which reads a file containing only a single frame of information. This is used with formats such as GRO and CRD
ChainReader
An advanced Reader designed to read a sequence of files, to provide iteration over all the frames in each file seamlessly. This Reader can also provide this functionality over a sequence of files in different formats.
Normally, one does not explicitly need to select a reader. This is handled
automatically when creating a Universe
and
the appropriate reader for the file type is selected (typically by the file
extension but this choice can be overriden with the format
argument to
Universe
).
If additional simulation data is available, it may be added to and read
alongside a trajectory using
add_auxiliary()
as described in
the Auxiliary API.
7.1.2. Writers
In order to write coordinates, a factory function is provided
(MDAnalysis.coordinates.core.writer()
, which is also made available as
MDAnalysis.Writer()
) that returns a Writer appropriate for the desired
file format (as indicated by the filename suffix). Furthermore, a trajectory
ProtoReader
can also have a method
Writer()
that returns an appropriate
WriterBase
for the file format of the
trajectory.
In analogy to MDAnalysis.coordinates.core.writer()
, there is also a
MDAnalysis.coordinates.core.reader()
function available to return a
trajectory ProtoReader
instance although this
is often not needed because the Universe
class can choose an appropriate reader automatically.
A typical approach is to generate a new trajectory from an old one, e.g., to only keep the protein:
u = MDAnalysis.Universe(PDB, XTC)
protein = u.select_atoms("protein")
with MDAnalysis.Writer("protein.xtc", protein.n_atoms) as W:
for ts in u.trajectory:
W.write(protein)
Using the with()
statement will automatically close the trajectory when
the last frame has been written.
7.1.3. Timesteps
Both Readers and Writers use Timesteps as their working object. A
Timestep
represents all data for a given
frame in a trajectory. The data inside a
Timestep
is often accessed indirectly
through a AtomGroup
but it is also possible to
manipulate Timesteps directly.
The current Timestep
can be accessed
through the ts
attribute of
the trajectory attached to the active
Universe
:
ts = u.trajectory.ts
ts.positions # returns a numpy array of positions
Most individual formats have slightly different data available in each Timestep due to differences in individual simulation packages, but all share in common a broad set of basic data, detailed in Timestep API
7.1.4. Supported coordinate formats
The table below lists the coordinate file formats understood by MDAnalysis. The
emphasis is on formats that are used in popular molecular dynamics codes. By
default, MDAnalysis figures out formats by looking at the extension. Thus, a
DCD file always has to end with “.dcd” to be recognized as such unless the
format is explicitly specified with the format keyword to
Universe
or
load_new()
. A number of files are
also recognized when they are compressed with gzip or
bzip2 such as “.xyz.bz2”.
Name |
extension |
IO |
remarks |
---|---|---|---|
CHARMM, NAMD |
dcd |
r/w |
standard CHARMM binary trajectory; endianness is
autodetected. Fixed atoms may not be handled
correctly (requires testing). Module
|
LAMMPS |
dcd |
r/w |
CHARMM-style binary trajectory; endianness is
autodetected. Units are appropriate for LAMMPS.
Module |
LAMMPS [1] |
data |
r |
Single frame of coordinates read from .data files |
LAMMPS [1] |
lammpsdump |
r |
Ascii trajectory in atom style |
Gromacs |
xtc |
r/w |
Compressed (lossy) xtc trajectory format. Module
|
Gromacs |
trr |
r/w |
Full precision trr trajectory. Coordinates and
velocities are processed. Module
|
Gromacs |
tng |
r |
Variable precision tng trajectory. Coordinates,
velocities and forces are processed along with any
additional tng block data requested for reading.
Uses the PyTNG package for tng file reading.
Module |
XYZ [1] |
xyz |
r/w |
Generic white-space separate XYZ format; can be
compressed (gzip or bzip2). Module
|
TXYZ [1] |
txyz, arc |
r |
Tinker XYZ format.
Module |
HOOMD [1] |
gsd |
r |
HOOMD GSD format (using |
GAMESS |
gms, log, out |
r |
Generic semi-formatted GAMESS output log; can be
compressed (gzip or bzip2). Module
|
AMBER |
trj, mdcrd |
r |
formatted (ASCII) trajectories; the presence of a
periodic box is autodetected (experimental).
Module |
AMBER |
inpcrd, restrt |
r |
formatted (ASCII) coordinate/restart file
Module |
AMBER |
ncdf, nc |
r/w |
binary (NetCDF) trajectories are fully supported with
optional netcdf4-python module (coordinates and
velocities). Module |
Brookhaven [1] |
pdb/ent |
r/w |
a relaxed PDB format (as used in MD simulations)
is read by default; Multiple frames (MODEL)
are supported but require the multiframe keyword.
Module |
XPDB |
pdb |
r |
Extended PDB format (can use 5-digit residue
numbers). To use, specify the format “XPBD”
explicitly: |
PDBQT [1] |
pdbqt |
r/w |
file format used by AutoDock with atom types t
and partial charges q. Module:
|
PQR [1] |
pqr |
r/w |
PDB-like but whitespace-separated files with charge
and radius information. Module
|
GROMOS96 |
gro |
r/w |
basic GROMOS96 format (velocities as well). Only
the first frame present will be read.
Module |
GROMOS11 |
trc |
r |
Basic GROMOS11 trajectory format.
Can read positions, box-sizes and timesteps.
Module |
CHARMM CARD [1] |
crd |
r/w |
“CARD” coordinate output from CHARMM; deals with
either standard or EXTended format. Module
|
DESRES [1] |
dms |
r |
DESRES Molecular Structure file format reader.
Module |
IBIsCO/YASP |
trz |
r/w |
Binary IBIsCO or YASP trajectories Module
|
MOL2 |
mol2 |
r/w |
Text-based Tripos molecular structure format
|
DL_Poly [1] |
config |
r |
DL_Poly ascii config file
|
DL_Poly [1] |
history |
r |
DL_Poly ascii history file
|
MMTF [1] |
mmtf |
r |
Macromolecular Transmission Format
|
NAMD |
coor, namdbin |
r/w |
NAMD binary file format for coordinates
|
FHIAIMS |
in |
r/w |
FHI-AIMS file format for coordinates
|
H5MD |
h5md |
r |
H5MD file format for coordinates
|
chemfiles library |
CHEMFILES |
r/w |
interface to chemfiles, see the list of chemfiles
file formats and
|
7.1.5. Trajectory API
The Trajectory API defines how classes have to be structured that allow reading and writing of coordinate files. By following the API it is possible to seamlessly enhance the I/O capabilities of MDAnalysis. The actual underlying I/O code can be written in C or python or a mixture thereof.
Typically, each format resides in its own module, named by the format specifier (and using upper case by convention).
Reader and Writer classes are derived from base classes in
MDAnalysis.coordinates.base
.
7.1.5.1. Registry
In various places, MDAnalysis tries to automatically select appropriate formats
(e.g. by looking at file extensions). In order to allow it to choose the
correct format, all I/O classes must subclass either
MDAnalysis.coordinates.base.ReaderBase
,
MDAnalysis.coordinates.base.SingleFrameReaderBase
,
or MDAnalysis.coordinates.base.WriterBase
and set the
format
attribute with a string
defining the expected suffix. To assign multiple suffixes to an I/O class, a
list of suffixes can be given.
In addition to this, a Reader may define a _format_hint
staticmethod, which
returns a boolean of if it can process a given object. E.g. the
MDAnalysis.coordinates.memory.MemoryReader
identifies itself as
capable of reading numpy arrays. This functionality is used in
MDAnalysis.core._get_readers.get_reader_for()
when figuring out how to
read an object (which was usually supplied to mda.Universe).
To define that a Writer can write multiple trajectory frames, set the
multiframe attribute to True
. The default is False
.
To define that a Writer does not support single frame writing the
singleframe attribute can be set to False
. This is True
by default, ie we assume all Writers can also do a single frame.
7.1.5.2. Timestep class
A Timestep instance holds data for the current frame. It is updated whenever a new frame of the trajectory is read.
Timestep classes are derived from
MDAnalysis.coordinates.timestep.Timestep
, which is the primary
implementation example (and used directly for the DCDReader).
The discussion on this format is detailed in Issue 250
7.1.5.2.1. Methods
__init__(n_atoms, positions=True, velocities=False, forces=False)
Define the number of atoms this Timestep will hold and whether or not it will have velocity and force information
__eq__
Compares a Timestep with another
__getitem__(atoms)
position(s) of atoms; can be a slice or numpy array and then returns coordinate array
__len__()
number of coordinates (atoms) in the frame
__iter__()
iterator over all coordinates
copy()
deep copy of the instance
_init_unitcell
hook that returns empty data structure for the unitcell representation of this particular file format; called from within
__init__()
to initializeTimestep._unitcell
.
7.1.5.2.2. Attributes
n_atoms
number of atoms in the frame
frame
current frame number (0-based)
_frame
The native frame number of the trajectory. This can differ from
frame
as that will always count sequentially from 0 on iteration, whilst_frame
is taken directly from the trajectory.time
The current system time in ps. This value is calculated either from a time set as the Timestep attribute, or from frame * dt. Either method allows allows an offset to be applied to the time.
dt
The change in system time between different frames. This can be set as an attribute, but defaults to 1.0 ps.
data
A dictionary containing all miscellaneous information for the current Timestep.
positions
A numpy array of all positions in this Timestep, otherwise raises a
NoDataError
velocities
If present, returns a numpy array of velocities, otherwise raises a
NoDataError
forces
If present, returns a numpy array of forces, otherwise raises a
NoDataError
has_positions
Boolean of whether position data is available
has_velocities
Boolean of whether velocity data is available
has_forces
Boolean of whether force data is available
dimensions
system box dimensions (x, y, z, alpha, beta, gamma) Also comes with a setter that takes a MDAnalysis box so that one can do
Timestep.dimensions = [A, B, C, alpha, beta, gamma]
which then converts automatically to the underlying representation and stores it in
Timestep._unitcell
.volume
system box volume (derived as the determinant of the box vectors of
dimensions
)aux
namespace for the representative values of any added auxiliary data.
7.1.5.2.3. Private attributes
These attributes are set directly by the underlying trajectory readers. Normally the user should not have to directly access those, but instead should use the attribute above.
_pos
raw coordinates, a
numpy.float32
array;X = _pos[:,0], Y = _pos[:,1], Z = _pos[:,2]
_velocities
raw velocities, a
numpy.float32
array containing velocities (similar to_pos
)_forces
forces, similar to velocities above.
_unitcell
native unit cell description; the format depends on the underlying trajectory format. A user should use the
dimensions
attribute to access the data in a canonical format instead of accessingTimestep._unitcell
directly.The method
Timestep._init_unitcell()
is a hook to initialize this attribute.
7.1.5.3. Trajectory Reader class
Trajectory readers are derived from
MDAnalysis.coordinates.base.ReaderBase
(or from
MDAnalysis.coordinates.base.ProtoReader
if they do not required
Reader.__del__()
method). A special case are SingleFrame readers for
formats that contain only a single coordinate frame. These readers are derived
from a subclass of ProtoReader
named
MDAnalysis.coordinates.base.SingleFrameReaderBase
.
Typically, many methods and attributes are overriden but the ones listed below must be implemented.
See also
See the section on Readers in MDAnalysis.coordinates.base
for implementation details.
7.1.5.3.1. Methods
The MDAnalysis.coordinates.DCD.DCDReader
class is the primary
implementation example.
Mandatory methods
The following methods must be implemented in a Reader class.
__init__(filename, **kwargs)
open filename; other kwargs are processed as needed and the Reader is free to ignore them. Typically, when MDAnalysis creates a Reader from
MDAnalysis.Universe
it supplies as much information as possible in kwargs; at the moment the following data are supplied:- n_atoms: the number of atoms from the supplied topology. This is
not required for all readers and can be ignored if not required.
__iter__()
allow iteration from beginning to end:
for ts in trajectory: print(ts.frame)
Readers will automatically rewind the trajectory to before the initial frame (often by re-opening the file) before starting the iteration. Multi frame readers (see Readers) will also rewind the trajectory after the iteration so that the current trajectory frame is set to the first trajectory frame. Single frame readers do not explicitly rewind after iteration but simply remain on the one frame in the trajectory.
close()
close the file and cease I/O
next()
advance to next time step or raise
IOError
when moving past the last framerewind()
reposition to first frame
__entry__()
entry method of a Context Manager (returns self)
__exit__()
exit method of a Context Manager, should call
close()
.
Note
a __del__()
method should also be present to ensure that the
trajectory is properly closed. However, certain types of Reader can ignore
this requirement. These include the SingleFrameReaderBase
(file reading
is done within a context manager and needs no closing by hand) and the ChainReader
(it is a collection of Readers, each already with its own __del__
method).
Optional methods
Not all trajectory formats support the following methods, either because the data are not available or the methods have not been implemented. Code should deal with missing methods gracefully.
__len__()
number of frames in trajectory
__getitem__(arg)
advance to time step arg = frame and return
Timestep
; or if arg is a slice, then return an iterable over that part of the trajectory.The first functionality allows one to randomly access frames in the trajectory:
universe.trajectory[314]
would load frame 314 into the current
Timestep
.Using slices allows iteration over parts of a trajectory
for ts in universe.trajectory[1000:2000]: process_frame(ts) # do some analysis on ts
or skipping frames
for ts in universe.trajectory[1000::100]: process_frame(ts) # do some analysis on ts
The last example starts reading the trajectory at frame 1000 and reads every 100th frame until the end.
A sequence of indices or a mask of booleans can also be provided to index a trajectory.
The performance of the
__getitem__()
method depends on the underlying trajectory reader and if it can implement random access to frames. All readers in MDAnalysis should support random access.For external custom readers this may not be easily (or reliably) implementable and thus one is restricted to sequential iteration. If the Reader is not able to provide random access to frames then it should raise
TypeError
on indexing. It is possible to partially implement__getitem__
(as done onMDAnalysis.coordinates.base.ProtoReader.__getitem__
where slicing the full trajectory is equivalent toMDAnalysis.coordinates.base.ProtoReader.__iter__
(which is always implemented) and other slices raiseTypeError
.When indexed with a slice, a sequence of indices, or a mask of booleans, the return value is an instance of
FrameIteratorSliced
orFrameIteratorIndices
. See FrameIterators for more details.parse_n_atoms(filename, **kwargs)
Provide the number of atoms in the trajectory file, allowing the Reader to be used to provide an extremely minimal Topology. Must be implemented as either a staticmethod or a classmethod.
Writer(filename, **kwargs)
returns a
WriterBase
which is set up with the same parameters as the trajectory that is being read (e.g. time step, length etc), which facilitates copying and simple on-the-fly manipulation.If no Writer is defined then a
NotImplementedError
is raised.The kwargs can be used to customize the Writer as they are typically passed through to the init method of the Writer, with sensible defaults filled in; the actual keyword arguments depend on the Writer.
timeseries(atomGroup, [start[,stop[,skip[,format]]]])
returns a subset of coordinate data
7.1.5.3.2. Attributes
filename
filename of the trajectory
n_atoms
number of atoms (coordinate sets) in a frame (constant)
n_frames
total number of frames (if known) –
None
if not knownts
the
Timestep
object; typically customized for each trajectory format and derived fromtimestep.Timestep
.units
dictionary with keys time, length, speed, force and the appropriate unit (e.g. ‘AKMA’ and ‘Angstrom’ for Charmm dcds, ‘ps’ and ‘nm’ for Gromacs trajectories,
None
and ‘Angstrom’ for PDB). Any field not used should be set toNone
.format
string that identifies the file format, e.g. “DCD”, “PDB”, “CRD”, “XTC”, “TRR”; this is typically the file extension in upper case.
dt
time between frames in ps; a managed attribute (read only) that computes on the fly
skip_timestep * delta
and converts to the MDAnalysis base unit for time (pico seconds by default)totaltime
total length of the trajectory =
n_frames * dt
time
time of the current time step, in MDAnalysis time units (ps)
frame
frame number of the current time step (0-based)
aux_list
list of the names of any added auxiliary data.
_auxs
dictionary of the
AuxReader
instances for any added auxiliary data.
Optional attributes
delta
integrator time step (in native units); hence the “length” of a trajctory frame is
skip_timestep*delta
time unitscompressed
string that identifies the compression (e.g. “gz” or “bz2”) or
None
.fixed
bool, saying if there are fixed atoms (e.g. dcds)
periodic
boolean saying if contains box information for periodic boundary conditions unit cell information is stored in attribute dimensions
skip_timestep
number of integrator steps between frames + 1 (i.e. the stride at which the MD simulation was sampled)
7.1.5.4. Trajectory Writer class
Trajectory writers are derived from
MDAnalysis.coordinates.base.WriterBase
. They are used to write
multiple frames to a trajectory file. Every time the
write()
method is called,
another frame is appended to the trajectory.
Typically, many methods and attributes are overriden.
Signature:
with TrajectoryWriter(filename, n_atoms, **kwargs) as w:
w.write(Universe) # write a whole universe
or:
w.write(AtomGroup) # write a selection of Atoms from Universe
See also
See the section on Writers in MDAnalysis.coordinates.base
for implementation details.
7.1.5.4.1. Methods
__init__(filename, n_atoms, **kwargs)
Set-up the reader. This may open file filename and may write content to it such as headers immediately but the writer is allowed to delay I/O up to the first call of
write()
.Any
**kwargs
that are not processed by the writer must be silently ignored.write(obj)
write Timestep data in obj
convert_dimensions_to_unitcell(timestep)
take the dimensions from the timestep and convert to the native unitcell representation of the format
close()
close file and finish I/O
__del__()
ensures that close() is called
7.1.5.4.2. Attributes
filename
name of the trajectory file
units
dictionary with keys time, length, speed, force and the appropriate unit (e.g. ‘AKMA’ and ‘Angstrom’ for Charmm dcds, ‘ps’ and ‘nm’ for Gromacs trajectories,
None
and ‘Angstrom’ for PDB). Any field not used should be set toNone
.format
string that identifies the file format, e.g. “DCD”, “PDB”, “CRD”, “XTC”, “TRR”
Optional
ts
Timestep
instance
7.1.5.5. Single Frame Writer class
A single frame writer is a special case of a trajectory writer in that it writes only a single coordinate frame to a file, for instance, a pdb or gro file. Unlike trajectory formats, which only contains coordinates, single frame formats contain much more information (e.g. atom and residue names and numbers) and hence it is possible to write selections of atoms in a meaningful way.
Signature:
W = FrameWriter(filename, **kwargs)
W.write(AtomGroup)
W.write(Universe)
The blanket kwargs is required so that one can pass the same kind of
arguments (filename and n_atoms) as for the Trajectory writers. In
this way, the simple writer()
factory function can be used for all writers.
7.1.5.5.1. Methods
__init__(filename, **kwargs)
opens filename for writing; kwargs are typically ignored
write(obj)
writes the object obj, containing a
AtomGroup
group of atoms (typically obtained from a selection) orUniverse
to the file and closes the file
Note
Trajectory and Frame writers can be used in almost exactly the same
manner with the one difference that Frame writers cannot deal with
raw Timestep
objects.