An efficient and extensible format, library, and API for binary trajectory data from molecular simulations.

SOFTWARE NEWS AND UPDATES

WWW.C-CHEM.ORG

An Efficient and Extensible Format, Library, and API for Binary Trajectory Data from Molecular Simulations Magnus Lundborg,[a] Rossen Apostolov,[b] Daniel Spa˚ngberg,[c] Anders G€arden€as,[d] David van der Spoel,[d] and Erik Lindahl*[a,e] Molecular dynamics simulations is an important application in theoretical chemistry, and with the large high-performance computing resources available today the programs also generate huge amounts of output data. In particular in life sciences, with complex biomolecules such as proteins, simulation projects regularly deal with several terabytes of data. Apart from the need for more cost-efficient storage, it is increasingly important to be able to archive data, secure the integrity against disk or file transfer errors, to provide rapid access, and facilitate exchange of data through open interfaces. There is already a whole range of different formats used, but few if any of them (including our previous ones) fulfill all these goals. To address these shortcomings,

we present “Trajectory Next Generation” (TNG)—a flexible but highly optimized and efficient file format designed with interoperability in mind. TNG both provides state-of-the-art multiframe compression as well as a container framework that will make it possible to extend it with new compression algorithms without modifications in programs using it. TNG will be the new file format in the next major release of the GROMACS package, but it has been implemented as a separate library and API with liberal licensing to enable wide adoption both in academic and comC 2013 Wiley Periodicals, Inc. mercial codes. V

Introduction

In this work, we present the specifications of a new strictly specified format for storage of data obtained from molecular simulations—Trajectory Next Generation (TNG). The standard builds on a container-payload framework and is flexible, extensible, optimized for parallel I/O, and multiframe compression, and aims to address the shortcomings of existing formats such as the XTC format previously used in GROningen MAchine for Chemical Simulations (GROMACS). Both the TNG application programming interface (API) and implementation is open source and released under the revised BSD license, which in

Computer simulations constitute a major and powerful tool for investigating the atomistic behavior of molecular systems, and the rapid growth of computational power means many applications are generating more output data than ever. Both massively parallel single simulations on supercomputers and distributed computing projects relying on ensemble modeling can easily generate tens of terabytes of data for a single project. In the past few decades, dozens of software packages have been developed that implement methods such as molecular dynamics (MD) or Monte Carlo for molecular simulations. Because both these applications rely on inherently stochastic processes to generate sufficient sampling of a complex system, this data expansion is a natural consequence of advances in the field, and if anything the growth rate is increasing. In many cases, the transfer, storage, analysis, archival, and post simulation manipulation of data has become just as challenging as the simulation itself. Both for our own molecular simulation code and in the community, there is a shortage of extensible file formats that both provide the highest compression possible, quick random access, all necessary information contained in a single file that is easily exchanged, as well as strong integrity checks and the ability to, for example, validate data with modern digital signatures. There are many universal data exchange formats that are highly flexible (and even preferable in some cases), but molecular simulation data also has very special requirements and lossy compression possibilities that makes it attractive with more specific formats. Development and adoption of a common, well-designed standard for data storage in particle simulations will thus bring great benefits to all users and developers alike. 260

Journal of Computational Chemistry 2014, 35, 260–269

DOI: 10.1002/jcc.23495

[a] M. Lundborg, E. Lindahl Department of Theoretical Physics and Swedish e-Science Research Center, Royal Institute of Technology, Science for Life Laboratory, Box 1031, SE-171 21 Solna, Sweden [b] R. Apostolov PDC Center for High Performance Computing, Royal Institute of Technology, Teknikringen 14, SE-100 44 Stockholm, Sweden and Science for Life Laboratory, Box 1031, SE-171 21 Solna, Sweden [c] D. Spa˚ngberg Department of Chemistry—A˚ngstr€ om Laboratory, Uppsala Multidisciplinary Center for Advanced Computational Methods (UPPMAX), Uppsala University, Box 523, SE-751 20 Uppsala, Sweden [d] A. G€ arden€ as, D. van der Spoel Department of Cell and Molecular Biology, Uppsala Center for Computational Chemistry, Uppsala University, Box 596, SE-751 24 Uppsala, Sweden [e] E. Lindahl Department of Biochemistry and Biophysics, Center for Biomembrane Research, Stockholm University, 106 91, Stockholm, Sweden E-mail: [email protected] Contract grant sponsor: ERC award 209825; Contract grant sponsor: ScalaLife project; Contract grant number: EU contract INFSO-RI-261523; Contract grant sponsor: Swedish e-Science Research Center; Contract grant sponsor: Swedish research council; Contract grant number: 2010491, 2010-5107 C 2013 Wiley Periodicals, Inc. V

WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG

particular makes it possible to use as a linked library in commercial codes without any requirements on the license of those codes.

MD Applications and Their File Formats

Programmable Grid Arrays and future Application-Specific Integrated Circuit chips, we might need to support every line of such libraries ourselves, which is not realistic. The generality also makes it difficult to use molecular simulation-specific lossy compression without cancelling the advantages of NetCDF.

The MD method is a rather intuitive direct implementation of Newton’s equation of motion and is applicable to a wide range of molecular systems, which has led to the emergence of many software applications that implement the method. Wikipedia lists over 40[1] packages, although the real number of existing codes is likely to be much larger—most scientists in the field have written their own implementation at some point. Many of these applications adopt their own formats for storage of trajectory data; this is natural as the codes have had slightly different goals, and historically storage was never a significant problem, but in our opinion none of them (including GROMACS) fulfill all the criteria we need from a modern simple day-to-day file format. Below we give an overview of some of the most popular applications, the formats they use and the reason why (in our opinion) each format was not sufficient for our usage case.

CHARMM[7] was one of the very first packages for molecular simulation, and has very broad adoption with hundreds of modules implemented. Trajectory data are stored in a DCD format—binary FORTRAN files with atom coordinates (optionally velocities and forces). The format does not use compression and does not support storage of additional data types. The files are not transportable between big-endian and little-endian computer architectures, and require additional tools for conversion,[8] but the DCD format, or slight variations of it, has been adopted by a large number of codes, including, for example, the NAMD[9,10] simulation package (focused on high-performance massively parallel simulation) and X-PLOR[11] (structure refinement). It is also supported by a wide range of analysis and display programs.

AMBER formats

Desmond formats

AMBER[2,3] (Assisted Model Building with Energy Refinement) is both a software package and a collection of force fields for molecular mechanics modeling. It is one of the most widely used programs and many research groups build additional functionality on top of the core distribution. Trajectory data from AMBER simulations are stored in NetCDF[4,5] format (Network Common Data Form) developed by Unidata.[6] In fact, the format simply specifies a set of conventions that have to be used along with the NetCDF libraries. Those libraries can represent arbitrary array-based data and have bindings for many languages such as C/C11, Fortran (F77 and F90), Java, Python. The NetCDF libraries are portable and the format extensible, and a huge advantage is that it makes it possible to read arbitrary subsets of data into many analysis programs that support NetCDF. It also supports basic nonlossy compression of data using Zlib. However, the NetCDF library is large; the source code is in fact even larger than AMBER itself (see Table 1). A few megabytes of source code is not a problem for storage or transfer today, but with GROMACS’ requirements to run on nonstandard platforms such as Playstation3 (natively, not Linux), Field-

Desmond[12,13] is a software package developed to perform high-speed MD simulations of biological systems on conventional commodity clusters. Desmond stores trajectory data not in a single file but in a collection of files each containing a number of trajectory frames.[14] When the number of files is large, they are organized in a directory structure. Metadata about the simulation is stored in a separate metadata file. The topology of the molecular system is saved separately. Various tools are used to inspect and modify the data in the files as the internal binary structure of the files is difficult for interpretation. This is a very powerful choice for the very largest simulations when the data are only handled in the program itself and are not realistic to compress on-the-fly. However, for more common usage cases, it is a bit of a hurdle that there is no single file that can be transferred, and it is not trivial to adopt the Desmond format in other applications as there is no open library/API for it.

Table 1. Approximate sizes of code bases of three major MD programs and some data or trajectory I/O libraries (note that NetCDF and HDF5 are obviously much more general than TNG).

Software AMBER GROMACS NAMD NetCDF HDF5 TNG

Version 11 4.6.3 2.8

1.4

Size (MB) 4 11 8 5 8 0.2

The size of AMBER includes only the compute engine sources; tests, benchmarks etc. are excluded.

CHARMM formats

GROMACS formats GROMACS[15,16] is an MD package mainly used for biomolecular simulations. It focuses on achieving high performance and portability across hardware systems, and for full disclosure it should be noted that it is developed by our team. GROMACS uses two kinds of formats for storage of output data. TRR is a full-precision portable-binary data format, while XTC[17] is a lossy compression format. The latter is available as part of the xdrfile library. The XTC format is portable and offers very good compression of the data. In fact, to the best of our knowledge, it is more efficient than any other available format—if an XTC file is compressed with gzip the file size increases by a small fraction. However, it has a number of drawbacks such as no possibility to store arbitrary user- or meta-data, no indexing for fast searches (which is complicated when the size of frames varies with lossy compression), the topology of the Journal of Computational Chemistry 2014, 35, 260–269

261


WWW.C-CHEM.ORG

simulated system has to be read from a separate file, and both the TRR and XTC formats were developed in the 1990s and are limited to 232 particles. LAMMPS formats LAMMPS[18,19] (Large-scale Atomic/Molecular Massively Parallel Simulator) is a classical MD code. It is designed to be highly flexible and adaptable for simulations not only of biomolecular systems but also polymers, solid-state materials, and coarsegrained/mesoscopic systems, and supports a large number of very special potentials. Due to the more universal nature of LAMMPS compared to other MD packages, output data are stored in ASCII text files with a very flexible format that allows detailed description of arbitrary kinds of data. The data are not compressed although gzipped files can be processed. From our point-of-view, the LAMMPS file format is thus impractical for storage of huge amounts of trajectory data.

Requirements Because none of the present formats solved the needs we had for a future standard file format, we have developed a new container-type format named Trajectory Next Generation (hereafter called TNG) that fulfills the following requirements: Fully architecture-independent, regarding both endianness and the ability to mix single/double precision trajectories and I/O libraries. It must be self-sufficient, that is, it should not require any other files for reading, and all the data should be contained in a single file for easy transport. Small footprint and high portability of the library and easy to bundle by third parties, or even compile built-in as part of an MD package. Built-in support for storage of different data types, for example, arbitrary vectors or floats. Custom data storage. The format should be extensible (say if a user wants to store distance restraints statistics or something else), and other versions of the library should be able to skip blocks they cannot interpret. Support for future compression algorithms. Inspired by current multimedia formats, we want a format container that is easy to read with a standard API, but the payload itself should be possible to alter under-the-hood as new compression algorithms are implemented. To improve over XTC, we need temporal compression of data similar to multimedia formats, that is, compressing several frames as a block. Integrity check of data blocks using hashes. (Version 1 only supports MD5 sums in the block headers, but future version will support digital signatures in each block). Digital signatures to make it possible to guarantee what program and user generated a particular trajectory. This is critical to ensure data integrity, for example, in distributed computing projects where users receive credits for the amount of simulations they produce. 262


Possibility to store extended meta-data with full information about simulated systems and conditions. Random access, using file pointers to quickly locate frames, even when the frames are of nonconstant size due to high compression. Efficient parallel I/O. There are many other features that could be implemented in a trajectory, in particular a complete description of the topology of the system, but our aim here is to create a greatest common divisor that can be used in lots of programs with very little work rather than aiming to describe every single component of any simulation. To fulfill the above requirements, TNG has been developed along the following specifications as released in version 1 of the format; whereas a normal developer should likely just use the public API and open library, the specification is intended to make it possible to reimplement support of the format from scratch.

Methods Specifications A TNG file is made up of a number of data blocks. Numerical values for fields can be integers (64 bit), floats (32 bit), or doubles (64 bit), and are stored as big or little endians (constant throughout the file with automatic conversion if the hardware architecture changes). Some flags are also stored as a single byte (8 bits). When creating a TNG file, the endianness defaults to the endianness of the hardware creating the file. This is a change from XTC that always used network (big) endian. The reason for this is that most current architectures are little endian, and in benchmarking we realized that an unacceptably large part of the I/O time both during writing and reading was spent on entirely unnecessary endian conversions. TNG still performs automatic transparent conversion to and from the native endianness for all numerical fields, but only when necessary. Strings are encoded as UTF-8 and limited to 1024 bytes, including the terminating null character. Empty strings consist of only the terminating null character. This limit is deliberate, as it makes it possible to use the format even in languages that do not support runtime memory allocation, and arbitrary-length strings can still be stored as an array of strings. Each block contains an MD5 hash of the block contents to verify the integrity of the data—to enable checks against corruptions during file transfer or hard disk failures. Each block contains the following fields as a header:

Size of the block header (integer). Size of the block contents (except header) (integer). Block identifier (integer). MD5 hash (16 characters). Block name (string). Version of the block with respect to the block identifier (integer). ID of alternative hash (integer) (if not using MD5). Length of alternative hash in bytes (integer). WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG

index 0. The TNG format does not strictly define what an individual frame has to be. From the file point-of-view, it does not matter if the time between frames is set according to the data with the highest output frequency or if there is, for example, one frame per MD simulation step (which would mean that not all frame indices are present in the trajectory). The latter is recommended, but as long as the Time per frame in TRAJECTORY FRAME SET blocks and the stride length in data blocks are properly set, any scheme can be used for storing the data, and it will always be possible to read back. A TRAJECTORY FRAME SET block marks the beginning of a frame set, and all subsequent blocks until the next TRAJECTORY FRAME SET block are considered part of that frame set. Data blocks that contain data that do not change during the simulation (frameindependent data) should be placed before the first TRAJECTORY FRAME SET block, whereas frame-dependent data blocks are placed inside frame sets (i.e., after a TRAJECTORY FRAME SET block). Data can be particle specific, such as POSITIONS, VELOCITIES, or PARTIAL CHARGES or general, for example, BOX SHAPE. Frame sets can also be divided by separate particle mapping blocks, allowing parallel writing of data from distributed simulations where particles can move between nodes. In that case, there are one or more PARTICLE MAPPING blocks in the frame set, each followed by the data blocks related to those particles. If parallel writing is not required, there is no need for more than one PARTICLE MAPPING block, and usually none whatsoever. It is also possible to change the number of molecules from one frame set to another, which is necessary to simulate grand canonical ensembles. In that case, the number of frames per frame set must be set to 1 (if the number of particles can change every frame), which means that multiframe compression cannot be used. Figure 1 provides an overview of the file structure.

Table 2. Information blocks containing general information about the file and also dividing the file into frame sets, that is, chunks of frames and their related data blocks. Block name

Block identifier

Block description

GENERAL INFO

0x0000000000000000

MOLECULES

0x0000000000000001

TRAJECTORY FRAME SET

0x0000000000000002

PARTICLE MAPPING

0x0000000000000003

Information about the file creation, internal file pointers to first/last frame sets, and so forth. Descriptions of molecules, chains, residues and atoms. Marks the beginning of a frame set (a sequence of frames). A mapping between the particle numbering inside the frame set and the molecule particle numbers.

Alternative hash (number of bytes determined by the previous value). ID of digital signature (integer). Length of digital signature in bytes (integer). Digital signature (number of bytes determined by the previous value). The header is directly followed by the block contents, which varies from block to block. Please note that the current version of the API does not support an alternative hash (only MD5) or a digital signature. This will be added soon. Description of blocks There are two general kinds of blocks in a TNG file, namely information, or metadata, blocks (see Table 2), and data blocks (see Table 3). The information blocks describe the TNG file and divide it in different sections, whereas the data blocks reflect the state of the simulation, or of the particles, at a specific point in time or in general. Initially, the most obvious way of storing data would appear to be a single block for each timestep, but to enable temporal compression of data, and to more quickly access data from a specific frame, we have designed the trajectory to be divided into frame sets, each containing a number of frames or timesteps. The numbering of frames is zero-based, meaning that the first frame is frame

block is the only mandatory block and contains the following data fields:

The GENERAL INFO

Name and version of the program used to perform the simulation (on file creation) (string). Name and version of the program used when last modifying the file (string). Name of the person who created the file (string). Name of the person who last modified the file (string).

Table 3. Data blocks that can contain general data or particle related data. Block name BOX SHAPE POSITIONS VELOCITIES FORCES PARTIAL CHARGES FORMAL CHARGES B-FACTORS ANISOTROPIC B-FACTORS OCCUPANCY

Block identifier

Block description

0x0000000010000000 0x0000000010000001 0x0000000010000002 0x0000000010000003 0x0000000010000004 0x0000000010000005 0x0000000010000006 0x0000000010000007 0x0000000010000008

Dimensions of the periodic box (nine values per frame). Particle coordinates (three values per particle and frame). Particle velocities (three values per particle and frame). Forces on particles (three values per particle and frame). Partial charges of particles (one value per particle and frame). Formal charges of particles (one value per particle and frame). B-factors (temperature factors) of particles (one value per particle and frame). Anisotropic B-factors of particles (six or nine values per particle and frame). Occupancy of particles (one value per particle and frame).

Block name is a descriptive name of the block, whereas the block identifier is a unique enumeration of the block.


263


WWW.C-CHEM.ORG

more steps between frame sets. For simulations using a grand canonical ensemble, it is best to set this to 1 so that the number of atoms in the frame sets can be updated regularly. Pointer to the file position of the beginning of the first TRAJECTORY FRAME SET block (integer). Pointer to the file position of the beginning of the last TRAJECTORY FRAME SET block (integer). (Updated when finishing writing the trajectory file—otherwise set to 21.) Length of steps (number of “TRAJECTORY FRAME SET blocks”) for medium stride pointers (integer). By default, it is set to 100 “TRAJECTORY FRAME SET blocks.” Length of steps (number of “TRAJECTORY FRAME SET blocks”) for long stride pointers (integer). By default, it is set to 10 000 “TRAJECTORY FRAME SET blocks.” Exponential of unit used for distance measurements (integer). By default, it is set to 29 (i.e., nm).

The MOLECULES block contains a description of the molecules in the system. The numbers of each molecule can change during the simulation, but the composition must be constant. Figure 1. Schematic overview of the TNG file structure. The blocks with a dashed outline can be any number of data blocks. “Constant data” represents data blocks containing data that does not change during the trajectory. “Variable data” contains data that is modified in the trajectory, for example, particle positions. FRAME SET represents TRAJECTORY FRAME SET blocks and PARTICLE MAP. corresponds to PARTICLE MAPPING blocks. (a) and (b) show files without and with PARTICLE MAPPING blocks, respectively.

Name of computer/other info where the file was created (string). Name of computer/other info where the file was last modified (string). PGP signature of the user who created the file (string). PGP signature of the user who last modified the file (string). Name of the force field used to perform the simulation (string). Time of initial file creation, UTC time zone seconds (also called Unix seconds) since GMT 01-01-1970 00:00:00 (integer). The 64-bit representation makes sure that the format can be used for another 500 billion years. Time of completing the simulation, UTC time zone seconds since GMT 01-01-1970 00:00:00 (integer). Use variable number of atoms in frames (8 bit flag). If set to TRUE, the number of each molecule is specified in the TRAJECTORY FRAME SET block. Number of frames in each frame set (integer). This is the expected number of frames in each frame set, but it does not have to be constant. It is OK to have frame sets with fewer or more frames, for example, after concatenating multiple trajectory files. This avoids the need to recompress all data after a concatenation, but it means that searching for a specific frame might need a few 264


Number of molecules (integer). For each molecule: Molecule ID (integer). Molecule name (string). Quaternary structure, for example, one means monomeric, four means tetrameric, and so forth (integer). Number of molecules of this kind—only used if not using “variable number of atoms” in the “general info block” (integer). Number of chains in the molecule (integer). Number of residues in the molecule (integer). Number of atoms in the molecule (integer). For each chain: Chain ID (integer). Unique in each molecule. Chain name (string). Number of residues in the chain (integer). For each residue: Residue ID (integer). Unique in the chain or (molecule if there is no chain). Residue name (string). Number of atoms in the residue (integer). For each atom (in the molecule if there is no residue): Atom ID (integer). Unique in the molecule. Atom name (string). Atom type (string). Number of bonds in the molecule (integer). For each bond: From Atom ID (integer). To Atom ID (integer). This first version of the TNG format is somewhat biomolecule centric, mainly because this is the area in which we have WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG

Long stride pointer to the previous, for example, 10 000th TRAJECTORY FRAME SET block (integer). Time stamp (in seconds) of first frame in frame set (double). Time (in seconds) per frame (double).

blocks contain a mapping between the particle numbers in the molecular system and the order in which the particles are written in the frame set. If no PARTICLE MAPPING block is present in a frame set, the particle numbering is the same as the particle numbering in the molecular system. If there is at least one PARTICLE MAPPING block in a frame set, all particles with data stored in the frame set must be present in a PARTICLE MAPPING block. This in turn makes it possible to store only data of specific particles, for example, charges in a protein but not for any water molecules.

PARTICLE MAPPING

Figure 2. Pointer structure of a TNG file. From the GENERAL INFO block, there are pointers to the first and last frame set. In each TRAJECTORY FRAME SET block, there are pointers to the next and previous frame sets and also certain numbers of frame sets ahead and back (determined by the length of steps of medium and long stride pointers, set in the GENERAL INFO block). The numbers indicate the frame set number.

both most experience and needs. This means that chains and residues need to be recorded without names to, for example, store crystals of atoms that do not use any residue or chain information. However, this is a good example of the extensibility of the format: The MOLECULES block of future versions will be made more general, but the API will handle different versions without user intervention.

TRAJECTORY FRAME SET blocks indicate the beginning of a frame set and divide the trajectory data into smaller chunks. The pointers enable fast access to any frame set, or frame, in a large trajectory (see Fig. 2).

Number of the first frame of the frame set (integer). Number of frames in the frame set (integer). If the Variable number of atoms flag in the GENERAL INFO block is set to TRUE: Array of integers specifying the count of each molecule type. The molecule types are listed in the MOLECULES block and should be listed in the same order here. This is used for, for example, simulations using a grand canonical ensemble (in which case the number of frames in each frame set should be 1). Pointer to the next TRAJECTORY FRAME SET block (integer). Pointer to the previous TRAJECTORY FRAME SET block (integer). Medium stride pointer to the next, for example, 100th TRAJECTORY FRAME SET block (integer). (Medium stride length specified in the GENERAL INFO block.). Medium stride pointer to the previous, for example, 100th TRAJECTORY FRAME SET block (integer). Long stride pointer to the next, for example, 10 000th TRAJECTORY FRAME SET block (integer). (Long stride length specified in the GENERAL INFO block.).

Number of first particle (particle number as stored in the molecular system, zero-based numbering) (integer). Number of particles in this particle mapping block (integer). Array of particle numbers: . Each value is the number of the particle in the molecular system corresponding to the particle number as stored in the trajectory (integer).

are used for storing generic data. Framedependent data blocks are located after the TRAJECTORY FRAME SET block to which they belong. Frame- and particledependent data blocks should follow the relevant particle mapping block (if using any particle mapping block).

Data blocks

Data type flag (8 bit flag). 0 5 character/string data, 1 5 integer data, 2 5 float data (32 bit), and 3 5 double data (64 bit). Dependency flag (8 bit flag). 1 5 frame-dependent, 2 5 particle-dependent. Can be combined, that is, 3 5 frameand particle-dependent. Sparse data flag to signify if not all frames in the frame sets have data entries in this data block, for example, energies and positions might be saved at different intervals meaning that at least one of them would be saved as sparse data (8 bit flag). Only present if the data are frame-dependent. Number of values (integer). If the data is frame–dependent, this is the number of values per frame. If the data is particle–dependent, this is the number of values per particle (per frame). ID of the CODEC used to compress the positions (integer). Multiplier for integers to obtain the appropriate floating point number, for compressed frames (double). Only present if the above CODEC id is >0 and if the data type is double or float. If using sparse data the following fields are required: Number of first frame containing data (integer). Number of frames between data points (integer). Journal of Computational Chemistry 2014, 35, 260–269

265


WWW.C-CHEM.ORG

Particle-dependent data blocks contain the following fields: Number of the first particle in the data block (integer). This must be the same as in the preceding PARTICLE MAPPING block, if present. This allows writing data starting from a certain atom in the molecular system. Number of particles in the data block (integer). This must be the same as in the preceding PARTICLE MAPPING block, if present. Continuous field of data. Data stored in data blocks (see Table 3) can be compressed. The first temporal multiframe trajectory NG compression format (which we term TNG-MF1 to avoid confusion with the container format)[20] can be used for efficiently compressing positions and velocities. Zlib compression is a nonlossy included alternative that is not as efficient, but can be used for compressing any kind of data. In addition to the data block types mentioned in Table 3, any number (limited only by the 64-bit block identifier) of custom data blocks can be added and used to store strings or numerical data. To identify blocks, we use a hexadecimal 64bit representation, where the upper 32 bits denote the developer, or group, responsible for the data block type, and the lower 32 bits the type of data. User IDs in the range 0x0– 0x0FFFFFFF are reserved for current and future internal TNG user specifications. A second range of 0x10000000–0x1FFFFFFF is reserved for program-specific, or other registered, users. An official user ID in this range can be reserved by contacting the authors, and in the future by registering through a web page that can also be queried. This way it will be possible for anybody to identify the author of an arbitrary block, and also guarantee that registered user IDs will never clash. Finally, the user ID range 0x20000000–0xFFFFFFFF is unofficial, freely available, and can be used by anybody without registering, but on the other hand, it will not be possible to predict who the user of a a block is. Similarly, the data ID part of the block ID is divided into four ranges. The range 0x0–0x0FFFFFFF denotes reserved information blocks and the range 0x10000000– 0x1FFFFFFF reserved data blocks. For registered user IDs, all blocks in this range should also be registered so they can be queried. In contrast, the ranges 0x20000000–0x2FFFFFFF and 0x30000000 through 0x3FFFFFFF can be used freely for nonregistered information and data blocks, respectively. Compression of trajectory data The compression algorithms by Spa˚ngberg et al.[20] are included in the TNG library. These compression algorithms (now called TNG-MF1) work on the atomic positions and velocities. The single or double precision required for conservation of energy when performing the MD-simulations is usually unnecessarily high for the properties determined directly from the trajectory, such as distribution functions and correlation functions, and so forth. Therefore, the compression algorithms use userconfigurable reduced precision, by quantizing the data into integers. Atoms close in order in each trajectory frame are often 266


close in space, so the value of the difference in coordinates are often smaller than the absolute values of the coordinates. Also, a given atom typically does not move very far between two frames. In the case of velocities, the values do not change much between two frames, given that frames are stored often. Therefore, both spatial and temporal compression can be used to reduce the size of the trajectory data. Although this is a very efficient compression scheme when the properties of the data is known, it is also orthogonal to the concepts used in general data formats such as NetCDF or HDF5, which is one of the main reasons for an MD-specific format. There are four basic algorithms used to compress either absolute values of coordinates or velocities or differences of coordinates or velocities: Variable number of bits storage of individual values, storing three values (x, y, and z) together as a single number with variable base, grouping several consecutive atoms close in space together and coding the storage of them efficiently, and finally, utilizing block-sorting algorithms. The actual compression algorithms implemented in TNG-MF1 are combinations of the above basic algorithms. The speed of the algorithms are quite different, and therefore they are sorted into different compression levels, allowing the user to choose the appropriate level of compression ratio and compression speed. The compression algorithms to use for a given set of coordinates and velocities are determined automatically, and the algorithm chosen is returned to the caller, to allow the use of the same predetermined algorithms on subsequent data-blocks. To the best of our knowledge, TNG-MF1 is the most efficient molecular simulation data compression format available to date. It is also possible to compress any data block with other compression algorithms. Currently, Zlib is the only additional algorithm supported by the API, but we expect this to be extended in the future, for by instance spectral recompression algorithms that can only be used after the simulation has been completed.[21] API Implementation The TNG format is released as a standalone library with the main API written in C. On top of this, there are thin C11 and FORTRAN 77 layers to allow access from a variety of programming languages. Note that the FORTRAN implementation requires CRAY pointers, which are technically not part of the FORTRAN 77 standard, but available in virtually all compilers anyhow. The API is divided into a low-level and a high-level set of functions. These two sets of functions can be referred to as a low-level and a high-level API, but they can be used interchangeably. The lowlevel API provides functions granting fine-grained control of the data, whereas the functions of the high-level API are simplified and use sensible default values to make it easier to use. All functions of the API use a “tng_” prefix and in the high-level API the prefix is further extended to “tng_util_.” The routines in the TNG-MF1 compression library, which is bundled with the TNG API, are prefixed “tng_compress_.” They all perform memory to memory compression and uncompression. When compressing, it is necessary to specify whether positions or velocities are compressed, as different algorithms are effective for positions or velocities. The type of WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG

compressed data is stored in the compressed data block, so for uncompression only a single routine is necessary. The data type to compress/uncompress can either be double precision, single precision, or (prequantized) integer data. Currently, the released GROMACS-4.6 uses the XTC and TRR formats for storing trajectories, but from version 5.0 the TNG format will be used instead. In addition to making the API and library available as open source, we have chosen the revised BSD license to make it possible to link any code in the world with the library without any restrictions on redistribution, but still encourage other teams to contribute new compression algorithms to the library itself. We also provide converters to/from commonly used file formats, such as AMBER NetCDF and PDB.

Results Output benchmarks To test the performance of the file format and the API, GROMACS 4.6.3 was modified to enable writing TNG files, mainly using the high-level API. Four different test systems, one of which was run for two different time lengths, of varying sizes were used: One molecule of ethanol was solvated by 640 water molecules. The simulation was run for 5 ns. The positions and the shape of the periodic box were saved every 10 ps. The RNase ZF-1A (2026 atoms) was solvated by approximately 4900 water molecules. The system charge was neutralized by six Cl2 ions. The simulation was only run for 400 ps. Trajectory snapshots were saved to file every 200 fs, which is considerably more frequent than in most normal simulations of a biomolecular system, although, for example, velocity autocorrelation analysis might require even denser frames. N-acetylaladanamide (NAAA) in a POPC bilayer has previously been simulated by Murugan et al.[22] The system consisted of 17,417 atoms, including 3572 water molecules. In this study, the simulation time was 2 ns. Positions and the shape of the periodic box were saved every 4 ps. The Kv1.2 ion channel[23] was simulated for both 100 ps and 500 ps with 5 fs time steps and trajectory snapshots were saved every 1 ps. There were approximately 121,000 particles (including 8920 virtual sites), two thirds of which were water, in the system. For all simulations, the stride lengths of the data blocks were set to the writing frequency (in MD steps) of the simulation, that is, 5000, 100, 2000, and 200 frames, respectively. The number of frames per frame set was set to have 100 written frames per frame set, that is, 100 times the writing frequency of the simulation (500,000, 10,000, 200,000, and 20,000 frames per frame set). The precision of the compressed coordinates in the TNG file was set to 0.001 nm, which was also the precision of the XTC compression. For comparison, a standard GROMACS 4.6.3 version was used with the same setups, but writing to a compressed XTC

Table 4. Comparison of times and file sizes using a modified version of GROMACS writing TNG files compared to a standard version of GROMACS writing XTC files.

Output type

Total CPU time (s)

Ethanol TNG 1046.4 XTC 1073.5 RNase ZF-1A TNG 360.1 XTC 360.0 NAAA in POPC membrane TNG 3046.0 XTC 3020.7 Kv1.2 ion channel 100 ps TNG 530.4 XTC 513.8 Kv1.2 ion channel 500 ps TNG 2450.7 XTC 2481.1

Trajectory writing time (s)[a]

% of time spent writing trajectory

Output file size (MB)[b]

0.8 0.6

0.07 0.06

3.3 3.3

8.7 7.5

2.42 2.10

99.8 116.8

3.1 2.1

0.10 0.07

29.2 30.8

6.4 3.7

1.21 0.72

74.8 88.7

21.6 16.0

0.88 0.64

370.5 441.7

[a] Includes writing the TRR file. [b] The TRR file sizes were 11, 385, 100, 279, and 1391 MB respectively. The reported times are averaged from three simulation runs. Because the simulation time varies from one simulation to another (load balancing), it is the percentage of time spent writing column that is most relevant for comparing the time differences.

file instead of a TNG file. For each system, the simulation was performed three times and the average times are reported in Table 4. The RNase ZF-1A TRR file was also converted to DCD, NetCDF, and LAMMPS formats to compare the resulting file size (see Fig. 3). Apparently, mdconvert[24] does not compress

Figure 3. Comparison of file sizes of different file formats. mdconvert[24] (from the MDTraj library) was used for converting the TRR file from the RNase simulation to DCD and NetCDF, whereas VMD was used for converting to LAMMPS. The LAMMPS file in this case was 10.63 times larger than the TNG file, whereas for TRR, DCD, and NetCDF the size factor to TNG was 3.86 and for XTC it was 1.17.


267


WWW.C-CHEM.ORG

Table 5. Time to read TNG, XTC, and TRR (uncompressed single precision) files from the five benchmark simulations.

System

N particles

N frames

TNG (s)

XTC (s)

TRR (s)

Ethanol RNase ZF-1A NAAA in POPC Kv1.2 100 ps Kv1.2 500 ps

1929 16,816 17,417 121,449 121,449

501 2001 501 201 1001

1.4 24.6 6.6 21.2 104.8

1.0 33.6 8.7 26.0 129.3

1.0 33.7 8.8 26.4 131.1

the NetCDF data, but TNG clearly performs well. The benchmarks show that the molecular system has a large impact on the efficiency of the TNG-MF1 compression. Even denser frame storage or a smaller fraction of water could amplify the advantage, but possibly at the cost of time. Input benchmarks There is not yet any tool that reads both XTC and TNG files. This will be amended in the near future. To compare the reading speeds of the trajectory files from the simulations tng_io_read_pos_util (from the example files in the TNG library) was used to read TNG files and gmxdump, from GROMACS 4.6.3, was used to read XTC and TRR files. Both these tools read coordinates from the files as well as simulation box shapes. The TNG reading tool does not output the simulation box shape every frame, whereas gmxdump does that. The output from both programs were directed to the null device to reduce the influence of terminal output on the results. Each file was read three times and the execution times were measured using the Debian GNU/Linux tool “time.” The best “real” time output is reported in Table 5. In four of the five cases, the TNG files were quicker to read and in one case slower. There are currently no benchmarks comparing TNG reading speeds to other formats that XTC and TRR, but it is expected that reading other uncompressed formats (e.g., DCD) would be comparable to reading TRR.

tested, more noticeably when saving frames often. Systems consisting mainly of water are almost as well compressed by XTC, whereas systems with a smaller fraction of water are more efficiently compressed by TNG-MF1. Picking what TNG-MF1 compression algorithm to use, which is only done once, takes longer than compressing the data, which means that writing very short TNG trajectories is not as efficient, which can be seen by comparing the two simulation lengths of the Kv1.2 system in Table 4. The data storage saved by switching to this new file format can be considerable in most MD groups. Using a system with more atoms and writing output less often, the time required for writing trajectories (regardless of format) will be overshadowed by the calculation times. The TNG file includes a description of the molecular system, but that accounts for a negligible portion of the file size. Because the Trajectory NG compression algorithm[20] is dependent on the number of frames in each compressed block, it is possible that it could be made more efficient by tuning the number of frames per frame set in the TNG file or the selection of compression algorithms used. Such optimization was beyond the scope of this work, but the advantage of the format is that this type of modifications can easily be implemented in future versions without requiring developers or users to alter their API calls. No significant effort has been put into optimizing the TNG API or the compression algorithm. It is possible that future versions can be made faster than the current version, but currently this is not a bottleneck. Version 1.4 of the API has been released and is available through the git repository of the project git://git.gromacs.org/ tng.git. Future versions of the API will be fully backwards compatible and the version numbering of the blocks in the TNG file ensure that future modifications of the file format can be handled gracefully.

Acknowledgment Computational resources were provided by the Swedish National Infrastructure for Computing (025/12-32).

API Examples The Supporting Information contains a number of brief examples of writing and reading TNG files using the high- and lowlevel APIs (Listings S1–S4).

Conclusions This work presents both the development and full specification of the first public release of the portable binary trajectory file format TNG, along with an API to facilitate easy access to data. The file format supports state-of-the-art compression, by using the previously described trajectory NG compression algorithm,[20] referred to as TNG-MF1. The file format is flexible enough to ensure that custom data can easily be stored. The small benchmarks performed here show that writing a TNG file with compressed data blocks is just-so-slightly slower than writing an XTC file, but the resulting file is clearly smaller for some of the systems 268


Keywords: molecular format API compression

dynamics

simulation file

How to cite this article: M. Lundborg, R. Apostolov, D. Spa˚ngberg, A. G€arden€as, D. van der Spoel, E. Lindahl. J. Comput. Chem. 2014, 35, 260–269. DOI: 10.1002/jcc.23495

]

Additional Supporting Information may be found in the online version of this article.

[1] List of Software for Molecular Mechanics Modeling. Available at: http://en.wikipedia.org/w/index.php?title5List_of_software_for_molecular_mechani cs_modeling&oldid5564406774. Accessed on November 17, 2013. [2] R. Salomon-Ferrer, D. Case, R. Walker, WIREs Comput. Mol. Sci. 2013, 3, 198.

WWW.CHEMISTRYVIEWS.COM

WWW.C-CHEM.ORG

[3] The Amber Molecular Dynamics Package. Available at: http://ambermd.org/. Accessed on November 17, 2013. [4] R. Rew, G. Davis, IEEE Comput. Graph. Appl. 1990, 10, 76. [5] S. A. Brown, M. Folk, G. Goucher, R. Rew, Comput. Phys. 1993, 7, 304. [6] Unidata j NetCDF. Available at: http://www.unidata.ucar.edu/software/ netcdf/. Accessed on November 17, 2013. [7] B. R. Brooks, C. L. Brooks, III, A. D. MacKerell Jr., L. Nilsson, R. J. Petrella, B. Roux, Y. Won, G. Archontis, C. Bartels, S. Boresch, A. Caflisch, L. Caves, Q. Cui, A. R. Dinner, M. Feig, S. Fischer, J. Gao, M. Hodoscek, W. Im, K. Kuczera, T. Lazaridis, J. Ma, V. Ovchinnikov, E. Paci, R. W. Pastor, C. B. Post, J. Z. Pu, M. Schaefer, B. Tidor, R. M. Venable, H. L. Woodcock, X. Wu, W. Yang, D. M. York, M. Karplus, J. Comput. Chem. 2009, 30, 1545. [8] File formats. Available at: http://www.ks.uiuc.edu/Research/namd/2.6/ ug/node13.html. Accessed on November 17, 2013. [9] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale, K. Schulten, J. Comput. Chem. 2005, 26, 1781. [10] NAMD—Scalable Molecular Dynamics. Available at: http://www.ks. uiuc.edu/Research/namd/. Accessed on November 17, 2013. [11] A. T. Br€ unger, P. D. Adams, G. M. Clore, W. L. DeLano, P. Gros, R. W. Grosse-Kunstleve, J. -S. Jiang, J. Kuszewski, M. Nilges, N. S. Pannu, R. J. Read, L. M. Rice, T. Simonson, G. L. Warren, Acta Crystallogr. D Biol. Crystallogr. 1998, 54, 905. [12] K. J. Bowers, E. Chow, H. Xu, R. O. Dror, M. P. Eastwood, B. A. Gregersen, J. L. Klepeis, I. Kolossvary, M. A. Moraes, F. D. Sacerdoti, J. K. Salmon, Y. Shan, D. E. Shaw, In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, ACM: New York, 2006. [13] D. E. Shaw Research. Available at: http://www.deshawresearch.com/ resources_desmond.html. Accessed on November 17, 2013.


[14] Desmond Users Guide, Section 13.1, Desmond Molecular Dynamics System, Version 3.4.0/0.7.2, D. E. Shaw Research: New York, 2013. [15] B. Hess, C. Kutzner, D. van der Spoel, E. Lindahl, J. Chem. Theory Comput. 2008, 4, 435. [16] S. Pronk, S. Pall, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R. Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, E. Lindahl, Bioinformatics 2013, 29, 845. [17] D. G. Green, K. E. Meacham, M. Surridge, F. van Hoesel, H. J. C. Berendsen, In Methods and Techniques in Computational Chemistry: METECC-95; E. Clementi, G. Corongiu, Eds.; STEF: Cagliari, 1995; pp. 435–463. [18] S. Plimpton, J. Comput. Phys. 1995, 117, 1. [19] LAMMPS Molecular Dynamics Simulator. Available at: http://lammps.sandia.gov/. Accessed on November 17, 2013. [20] D. Spa˚ngberg, D. S. D. Larsson, D. van der Spoel, J. Mol. Model. 2011, 17, 2669. [21] T. Meyer, C. Ferrer-Costa, A. P erez, M. Rueda, A. Bidon-Chanal, F. J. Luque, C. A. Laughton, M. Orozco, J. Chem. Theory Comput. 2006, 2, 251. [22] N. A. Murugan, R. Apostolov, Z. Rinkevicius, J. Kongsted, E. Lindahl, H. A˚gren, J. Am. Chem. Soc. 2013, 135, 13590. [23] P. Bjelkmar, P. S. Niemel€a, I. Vattulainen, E. Lindahl, PLoS Comput. Biol. 2009, 5, e1000289. [24] Command-Line Trajectory Conversion: mdconvert. Available at: http:// mdtraj.s3.amazonaws.com/mdconvert.html. Accessed on November 17, 2013

Received: 31 August 2013 Revised: 24 October 2013 Accepted: 1 November 2013 Published online on 20 November 2013


269

Efficient SDH Computation In Molecular Simulations Data.

jqcML: an open-source java API for mass spectrometry quality control data in the qcML format.

ms-data-core-api: an open-source, metadata-oriented library for computational proteomics.

MOSAIC: a data model and file formats for molecular simulations.

An efficient protocol for obtaining accurate hydration free energies using quantum chemistry and reweighting from molecular dynamics simulations.

HPDB-Haskell library for processing atomic biomolecular structures in Protein Data Bank format.

Insights into molecular interactions between CaM and its inhibitors from molecular dynamics simulations and experimental data.

Technical Note: Development and validation of an open data format for CT projection data.

An extensible framework and database of infectious disease for biosurveillance.

A standard file format for data from DNA sequencing instruments.

A Framework for Bus Trajectory Extraction and Missing Data Recovery for Data Sampled from the Internet.

A fast and efficient python library for interfacing with the Biological Magnetic Resonance Data Bank.

MMTF-An efficient file format for the transmission, visualization, and analysis of macromolecular structures.

An efficient and extendable python library to analyze neuronal morphologies.

The geospatial data quality REST API for primary biodiversity data.

Utilizing fast multipole expansions for efficient and accurate quantum-classical molecular dynamics simulations.

Binary mixtures of asymmetric continuous charge distributions: molecular dynamics simulations and integral equations.

SaDA: From Sampling to Data Analysis-An Extensible Open Source Infrastructure for Rapid, Robust and Automated Management and Analysis of Modern Ecological High-Throughput Microarray Data.

The NeXus data format.

An Open Data Format for Visualization and Analysis of Cross-Linked Mass Spectrometry Results.

MzJava: An open source library for mass spectrometry data processing.

A new method for constructing networks from binary data.

Development of an informatics infrastructure for data exchange of biomolecular simulations: Architecture, data models and ontology.

Efficient deformation algorithm for plasmid DNA simulations.