6.1. Trajectory Readers and Writers — MDAnalysis.coordinates

The coordinates submodule contains code to read, write and store coordinate information, either single frames (e.g., the GRO format) or trajectories (such as the DCD format); see the Table of supported coordinate formats for all formats.

MDAnalysis calls the classes that read a coordinate trajectory and make the data available “Readers”. Similarly, classes that write out coordinates are called “Writers”. Readers and Writers provide a common interface to the underlying coordinate data. This abstraction of coordinate access through an object-oriented interface is one of the key capabilities of MDAnalysis.

6.1.1. Readers

All Readers are based on a ProtoReader class that defines a common Trajectory API and allows other code to interface with all trajectory formats in the same way, independent of the details of the trajectory format itself.

The Universe contains the API entry point attribute Universe.trajectory that points to the actual ProtoReader object; all Readers are accessible through this entry point in the same manner (“duck typing”).

There are three types of base Reader which act as starting points for each specific format. These are:

ReaderBase

A standard multi frame Reader which allows iteration over a single file to provide multiple frames of data. This is used by formats such as TRR and DCD.

SingleFrameReaderBase

A simplified Reader which reads a file containing only a single frame of information. This is used with formats such as GRO and CRD

ChainReader

An advanced Reader designed to read a sequence of files, to provide iteration over all the frames in each file seamlessly. This Reader can also provide this functionality over a sequence of files in different formats.

Normally, one does not explicitly need to select a reader. This is handled automatically when creating a Universe and the appropriate reader for the file type is selected (typically by the file extension but this choice can be overriden with the format argument to Universe).

If additional simulation data is available, it may be added to and read alongside a trajectory using add_auxiliary() as described in the Auxiliary API.

6.1.2. Writers

In order to write coordinates, a factory function is provided (MDAnalysis.coordinates.core.writer(), which is also made available as MDAnalysis.Writer()) that returns a Writer appropriate for the desired file format (as indicated by the filename suffix). Furthermore, a trajectory ProtoReader can also have a method Writer() that returns an appropriate WriterBase for the file format of the trajectory.

In analogy to MDAnalysis.coordinates.core.writer(), there is also a MDAnalysis.coordinates.core.reader() function available to return a trajectory ProtoReader instance although this is often not needed because the Universe class can choose an appropriate reader automatically.

A typical approach is to generate a new trajectory from an old one, e.g., to only keep the protein:

u = MDAnalysis.Universe(PDB, XTC)
protein = u.select_atoms("protein")
with MDAnalysis.Writer("protein.xtc", protein.n_atoms) as W:
    for ts in u.trajectory:
        W.write(protein)

Using the with() statement will automatically close the trajectory when the last frame has been written.

6.1.3. Timesteps

Both Readers and Writers use Timesteps as their working object. A Timestep represents all data for a given frame in a trajectory. The data inside a Timestep is often accessed indirectly through a AtomGroup but it is also possible to manipulate Timesteps directly.

The current Timestep can be accessed through the ts attribute of the trajectory attached to the active Universe:

ts = u.trajectory.ts
ts.positions  # returns a numpy array of positions

Most individual formats have slightly different data available in each Timestep due to differences in individual simulation packages, but all share in common a broad set of basic data, detailed in Timestep API

6.1.4. Supported coordinate formats

The table below lists the coordinate file formats understood by MDAnalysis. The emphasis is on formats that are used in popular molecular dynamics codes. By default, MDAnalysis figures out formats by looking at the extension. Thus, a DCD file always has to end with “.dcd” to be recognized as such unless the format is explicitly specified with the format keyword to Universe or load_new(). A number of files are also recognized when they are compressed with gzip or bzip2 such as “.xyz.bz2”.

Table of supported coordinate formats

Name

extension

IO

remarks

CHARMM, NAMD

dcd

r/w

standard CHARMM binary trajectory; endianness is autodetected. Fixed atoms may not be handled correctly (requires testing). Module MDAnalysis.coordinates.DCD

LAMMPS

dcd

r/w

CHARMM-style binary trajectory; endianness is autodetected. Units are appropriate for LAMMPS. Module MDAnalysis.coordinates.LAMMPS

LAMMPS 1

data

r

Single frame of coordinates read from .data files

LAMMPS 1

lammpsdump

r

Ascii trajectory in atom style

Gromacs

xtc

r/w

Compressed (lossy) xtc trajectory format. Module MDAnalysis.coordinates.XTC

Gromacs

trr

r/w

Full precision trr trajectory. Coordinates and velocities are processed. Module MDAnalysis.coordinates.TRR

XYZ 1

xyz

r/w

Generic white-space separate XYZ format; can be compressed (gzip or bzip2). Module MDAnalysis.coordinates.XYZ

TXYZ 1

txyz, arc

r

Tinker XYZ format. Module MDAnalysis.coordinates.TXYZ

HOOMD 1

gsd

r

HOOMD GSD format (using gsd.hoomd). Module MDAnalysis.coordinates.GSD

GAMESS

gms, log, out

r

Generic semi-formatted GAMESS output log; can be compressed (gzip or bzip2). Module MDAnalysis.coordinates.GMS

AMBER

trj, mdcrd

r

formatted (ASCII) trajectories; the presence of a periodic box is autodetected (experimental). Module MDAnalysis.coordinates.TRJ

AMBER

inpcrd, restrt

r

formatted (ASCII) coordinate/restart file Module MDAnalysis.coordinates.INPCRD

AMBER

ncdf, nc

r/w

binary (NetCDF) trajectories are fully supported with optional netcdf4-python module (coordinates and velocities). Module MDAnalysis.coordinates.TRJ

Brookhaven 1

pdb/ent

r/w

a relaxed PDB format (as used in MD simulations) is read by default; Multiple frames (MODEL) are supported but require the multiframe keyword. Module MDAnalysis.coordinates.PDB

XPDB

pdb

r

Extended PDB format (can use 5-digit residue numbers). To use, specify the format “XPBD” explicitly: Universe(..., format="XPDB"). Module MDAnalysis.coordinates.PDB

PDBQT 1

pdbqt

r/w

file format used by AutoDock with atom types t and partial charges q. Module: MDAnalysis.coordinates.PDBQT

PQR 1

pqr

r/w

PDB-like but whitespace-separated files with charge and radius information. Module MDAnalysis.coordinates.PQR

GROMOS96

1

gro

r/w

basic GROMOS96 format (velocities as well). Only the first frame present will be read. Module MDAnalysis.coordinates.GRO

CHARMM CARD 1

crd

r/w

“CARD” coordinate output from CHARMM; deals with either standard or EXTended format. Module MDAnalysis.coordinates.CRD

DESRES 1

dms

r

DESRES Molecular Structure file format reader. Module MDAnalysis.coordinates.DMS

IBIsCO/YASP

trz

r/w

Binary IBIsCO or YASP trajectories Module MDAnalysis.coordinates.TRZ

MOL2

mol2

r/w

Text-based Tripos molecular structure format MDAnalysis.coordinates.MOL2

DL_Poly 1

config

r

DL_Poly ascii config file MDAnalysis.coordinates.DLPOLY

DL_Poly 1

history

r

DL_Poly ascii history file MDAnalysis.coordinates.DLPOLY

MMTF 1

mmtf

r

Macromolecular Transmission Format MDAnalysis.coordinates.MMTF

1(1,2,3,4,5,6,7,8,9,10,11,12,13,14)

This format can also be used to provide basic topology information (i.e. the list of atoms); it is possible to create a full Universe by simply providing a file of this format: u = Universe(filename)

6.1.5. Trajectory API

The Trajectory API defines how classes have to be structured that allow reading and writing of coordinate files. By following the API it is possible to seamlessly enhance the I/O capabilities of MDAnalysis. The actual underlying I/O code can be written in C or python or a mixture thereof.

Typically, each format resides in its own module, named by the format specifier (and using upper case by convention).

Reader and Writer classes are derived from base classes in MDAnalysis.coordinates.base.

6.1.5.1. History

  • 2010-04-30 Draft [orbeckst]

  • 2010-08-20 added single frame writers to API [orbeckst]

  • 2010-10-09 added write() method to Writers [orbeckst]

  • 2010-10-19 use close() instead of close_trajectory() [orbeckst]

  • 2010-10-30 clarified Writer write() methods (see also Issue 49)

  • 2011-02-01 extended call signature of Reader class

  • 2011-03-30 optional Writer() method for Readers

  • 2011-04-18 added time and frame managed attributes to Reader

  • 2011-04-20 added volume to Timestep

  • 2012-02-11 added _velocities to Timestep

  • 2012-05-24 multiframe keyword to distinguish trajectory from single frame writers

  • 2012-06-04 missing implementations of Reader.__getitem__ should raise TypeError

  • 2013-08-02 Readers/Writers must conform to the Python Context Manager API

  • 2015-01-15 Timestep._init_unitcell() method added

  • 2015-06-11 Reworked Timestep init. Base Timestep now does Vels & Forces

  • 2015-07-21 Major changes to Timestep and Reader API (release 0.11.0)

  • 2016-04-03 Removed references to Strict Readers for PDBS [jdetle]

6.1.5.2. Registry

In various places, MDAnalysis tries to automatically select appropriate formats (e.g. by looking at file extensions). In order to allow it to choose the correct format, all I/O classes must subclass either MDAnalysis.coordinates.base.ReaderBase, MDAnalysis.coordinates.base.SingleFrameReaderBase, or MDAnalysis.coordinates.base.WriterBase and set the format attribute with a string defining the expected suffix. To assign multiple suffixes to an I/O class, a list of suffixes can be given.

To define that a Writer can write multiple trajectory frames, set the multiframe attribute to True. The default is False. To define that a Writer does not support single frame writing the singleframe attribute can be set to False. This is True by default, ie we assume all Writers can also do a single frame.

6.1.5.3. Timestep class

A Timestep instance holds data for the current frame. It is updated whenever a new frame of the trajectory is read.

Timestep classes are derived from MDAnalysis.coordinates.base.Timestep, which is the primary implementation example (and used directly for the DCDReader).

The discussion on this format is detailed in Issue 250

6.1.5.3.1. Methods

__init__(n_atoms, positions=True, velocities=False, forces=False)

Define the number of atoms this Timestep will hold and whether or not it will have velocity and force information

__eq__

Compares a Timestep with another

__getitem__(atoms)

position(s) of atoms; can be a slice or numpy array and then returns coordinate array

__len__()

number of coordinates (atoms) in the frame

__iter__()

iterator over all coordinates

copy()

deep copy of the instance

_init_unitcell

hook that returns empty data structure for the unitcell representation of this particular file format; called from within __init__() to initialize Timestep._unitcell.

6.1.5.3.2. Attributes

n_atoms

number of atoms in the frame

frame

current frame number (0-based)

_frame

The native frame number of the trajectory. This can differ from frame as that will always count sequentially from 0 on iteration, whilst _frame is taken directly from the trajectory.

time

The current system time in ps. This value is calculated either from a time set as the Timestep attribute, or from frame * dt. Either method allows allows an offset to be applied to the time.

dt

The change in system time between different frames. This can be set as an attribute, but defaults to 1.0 ps.

data

A dictionary containing all miscellaneous information for the current Timestep.

positions

A numpy array of all positions in this Timestep, otherwise raises a NoDataError

velocities

If present, returns a numpy array of velocities, otherwise raises a NoDataError

forces

If present, returns a numpy array of forces, otherwise raises a NoDataError

has_positions

Boolean of whether position data is available

has_velocities

Boolean of whether velocity data is available

has_forces

Boolean of whether force data is available

dimensions

system box dimensions (x, y, z, alpha, beta, gamma) (typically implemented as a property because it needs to translate whatever is in the underlying _unitcell attribute. Also comes with a setter that takes a MDAnalysis box so that one can do

Timestep.dimensions = [A, B, C, alpha, beta, gamma]

which then converts automatically to the underlying representation and stores it in Timestep._unitcell.

volume

system box volume (derived as the determinant of the box vectors of dimensions)

aux

namespace for the representative values of any added auxiliary data.

6.1.5.3.3. Private attributes

These attributes are set directly by the underlying trajectory readers. Normally the user should not have to directly access those, but instead should use the attribute above.

_pos

raw coordinates, a numpy.float32 array; X = _pos[:,0], Y = _pos[:,1], Z = _pos[:,2]

_velocities

raw velocities, a numpy.float32 array containing velocities (similar to _pos)

_forces

forces, similar to velocities above.

_unitcell

native unit cell description; the format depends on the underlying trajectory format. A user should use the dimensions attribute to access the data in a canonical format instead of accessing Timestep._unitcell directly.

The method Timestep._init_unitcell() is a hook to initialize this attribute.

6.1.5.4. Trajectory Reader class

Trajectory readers are derived from MDAnalysis.coordinates.base.ReaderBase. Typically, many methods and attributes are overriden.

6.1.5.4.1. Methods

The MDAnalysis.coordinates.DCD.DCDReader class is the primary implementation example.

Mandatory methods

The following methods must be implemented in a Reader class.

__init__(filename, **kwargs)

open filename; other kwargs are processed as needed and the Reader is free to ignore them. Typically, when MDAnalysis creates a Reader from MDAnalysis.Universe it supplies as much information as possible in kwargs; at the moment the following data are supplied:

  • n_atoms: the number of atoms from the supplied topology. This is

    not required for all readers and can be ignored if not required.

__iter__()

allow iteration from beginning to end:

for ts in trajectory:
    print(ts.frame)
close()

close the file and cease I/O

next()

advance to next time step or raise IOError when moving past the last frame

rewind()

reposition to first frame

__entry__()

entry method of a Context Manager (returns self)

__exit__()

exit method of a Context Manager, should call close().

Note

a __del__() method should also be present to ensure that the trajectory is properly closed. However, certain types of Reader can ignore this requirement. These include the SingleFrameReaderBase (file reading is done within a context manager and needs no closing by hand) and the ChainReader (it is a collection of Readers, each already with its own __del__ method).

Optional methods

Not all trajectory formats support the following methods, either because the data are not available or the methods have not been implemented. Code should deal with missing methods gracefully.

__len__()

number of frames in trajectory

__getitem__(arg)

advance to time step arg = frame and return Timestep; or if arg is a slice, then return an iterable over that part of the trajectory.

The first functionality allows one to randomly access frames in the trajectory:

universe.trajectory[314]

would load frame 314 into the current Timestep.

Using slices allows iteration over parts of a trajectory

for ts in universe.trajectory[1000:2000]:
    process_frame(ts)   # do some analysis on ts

or skipping frames

for ts in universe.trajectory[1000::100]:
    process_frame(ts)   # do some analysis on ts

The last example starts reading the trajectory at frame 1000 and reads every 100th frame until the end.

A sequence of indices or a mask of booleans can also be provided to index a trajectory.

The performance of the __getitem__() method depends on the underlying trajectory reader and if it can implement random access to frames. In many cases this is not easily (or reliably) implementable and thus one is restricted to sequential iteration.

If the Reader is not able to provide random access to frames then it should raise TypeError on indexing. It is possible to partially implement __getitem__ (as done on MDAnalysis.coordinates.base.ProtoReader.__getitem__ where slicing the full trajectory is equivalent to MDAnalysis.coordinates.base.ProtoReader.__iter__ (which is always implemented) and other slices raise TypeError.

When indexed with a slice, a sequence of indices, or a mask of booleans, the return value is an instance of FrameIteratorSliced or FrameIteratorIndices.

parse_n_atoms(filename, **kwargs)

Provide the number of atoms in the trajectory file, allowing the Reader to be used to provide an extremely minimal Topology. Must be implemented as either a staticmethod or a classmethod.

Writer(filename, **kwargs)

returns a WriterBase which is set up with the same parameters as the trajectory that is being read (e.g. time step, length etc), which facilitates copying and simple on-the-fly manipulation.

If no Writer is defined then a NotImplementedError is raised.

The kwargs can be used to customize the Writer as they are typically passed through to the init method of the Writer, with sensible defaults filled in; the actual keyword arguments depend on the Writer.

timeseries(atomGroup, [start[,stop[,skip[,format]]]])

returns a subset of coordinate data

6.1.5.4.2. Attributes

filename

filename of the trajectory

n_atoms

number of atoms (coordinate sets) in a frame (constant)

n_frames

total number of frames (if known) – None if not known

ts

the Timestep object; typically customized for each trajectory format and derived from base.Timestep.

units

dictionary with keys time, length, speed, force and the appropriate unit (e.g. ‘AKMA’ and ‘Angstrom’ for Charmm dcds, ‘ps’ and ‘nm’ for Gromacs trajectories, None and ‘Angstrom’ for PDB). Any field not used should be set to None.

format

string that identifies the file format, e.g. “DCD”, “PDB”, “CRD”, “XTC”, “TRR”; this is typically the file extension in upper case.

dt

time between frames in ps; a managed attribute (read only) that computes on the fly skip_timestep * delta and converts to the MDAnalysis base unit for time (pico seconds by default)

totaltime

total length of the trajectory = n_frames * dt

time

time of the current time step, in MDAnalysis time units (ps)

frame

frame number of the current time step (0-based)

aux_list

list of the names of any added auxiliary data.

_auxs

dictionary of the AuxReader instances for any added auxiliary data.

Optional attributes

delta

integrator time step (in native units); hence the “length” of a trajctory frame is skip_timestep*delta time units

compressed

string that identifies the compression (e.g. “gz” or “bz2”) or None.

fixed

bool, saying if there are fixed atoms (e.g. dcds)

periodic

boolean saying if contains box information for periodic boundary conditions unit cell information is stored in attribute dimensions

skip_timestep

number of integrator steps between frames + 1 (i.e. the stride at which the MD simulation was sampled)

6.1.5.5. Trajectory Writer class

Trajectory writers are derived from MDAnalysis.coordinates.base.WriterBase. They are used to write multiple frames to a trajectory file. Every time the write() method is called, another frame is appended to the trajectory.

Typically, many methods and attributes are overriden.

Signature:

W = TrajectoryWriter(filename,n_atoms,**kwargs)
W.write_next_timestep(Timestep)

or:

W.write(AtomGroup)   # write a selection
W.write(Universe)    # write a whole universe
W.write(Timestep)    # same as write_next_timestep()

6.1.5.5.1. Methods

__init__(filename,n_atoms[,start[,step[,delta[,remarks]]]])

opens filename and writes header if required by format

write(obj)

write Timestep data in obj

write_next_timestep([timestep])

write data in timestep to trajectory file

convert_dimensions_to_unitcell(timestep)

take the dimensions from the timestep and convert to the native unitcell representation of the format

close()

close file and finish I/O

__del__()

ensures that close() is called

6.1.5.5.2. Attributes

filename

name of the trajectory file

start, stop, step

first and last frame number (0-based) and step

units

dictionary with keys time, length, speed, force and the appropriate unit (e.g. ‘AKMA’ and ‘Angstrom’ for Charmm dcds, ‘ps’ and ‘nm’ for Gromacs trajectories, None and ‘Angstrom’ for PDB). Any field not used should be set to None.

format

string that identifies the file format, e.g. “DCD”, “PDB”, “CRD”, “XTC”, “TRR”

Optional

ts

Timestep instance

6.1.5.6. Single Frame Writer class

A single frame writer is a special case of a trajectory writer in that it writes only a single coordinate frame to a file, for instance, a pdb or gro file. Unlike trajectory formats, which only contains coordinates, single frame formats contain much more information (e.g. atom and residue names and numbers) and hence it is possible to write selections of atoms in a meaningful way.

Signature:

W = FrameWriter(filename, **kwargs)
W.write(AtomGroup)
W.write(Universe)

The blanket kwargs is required so that one can pass the same kind of arguments (filename and n_atoms) as for the Trajectory writers. In this way, the simple writer() factory function can be used for all writers.

6.1.5.6.1. Methods

__init__(filename, **kwargs)

opens filename for writing; kwargs are typically ignored

write(obj)

writes the object obj, containing a AtomGroup group of atoms (typically obtained from a selection) or Universe to the file and closes the file

Note

Trajectory and Frame writers can be used in almost exactly the same manner with the one difference that Frame writers cannot deal with raw Timestep objects.