h5py file writing - flush and update

Technical background

In a previousblog (loanword)In this section, we described the use of VMD to visualize track files in the H5MD standardized format.H5MD is essentially a track file with astandardized formof the hdf5 binary file, this article introduces two content update operations on hdf5.

Write and update data

One of the functions that we usually use is to pass thefunction to open or create an hdf5 file, and then use thecreate_datasetCreate forms in the file and then continuously fill the forms with data. So what if you want to update the data in the file? The operation logic is relatively simple, directly load the corresponding form and get the return value, and then directly in the return value to update the data content can be. The following is a code example:

import h5py
import numpy as np
import os

h5_name = 'example.h5'
if (h5_name):
    with ('example.h5', 'r+') as file:  
        dataset = file['my_dataset']
        new_data = ([0])
        dataset[...] = new_data

else:
    # Create a newHDF5file
    with (h5_name, 'w') as f:
        dset = f.create_dataset("my_dataset", (10,), dtype='f')
        data = (10)
        dset[...] = data

This code is divided into two parts, if in the specified directory does not exist in this hdf5 file, we will first create a hdf5 file, form content for the number of 1 ~ 10 (here using a VSCode plug-in plus H5Web to visualize the h5 file):

If the corresponding h5 file already exists under the path, then modify the form content in it. For example, if we execute the above test code twice, then we get the content of the h5 file like this:

Refresh File

The hdf5 file, as a canonical format binary file, has a strict integrity check. Then a problem arises, if the process is interrupted during the write process, then this hdf5 file will be corrupted:

Of course, if it'sCtrl+CTo stop the process manually, then we are the ones who can refer to thisblog (loanword)of the content for listening and managing the termination signal. The problem, however, is that if the termination signal is ignored by the systemkill -9There is no way to capture the relevant signals by forcing a termination. So here is an option to save the intermediate process by flushing, case in point:

import h5py
import numpy as np
import time

h5_name = 'example.h5'
# Create a newHDF5file
with (h5_name, 'w') as f:
    dset = f.create_dataset("my_dataset", (10,), dtype='f')
    data = (10)
    dset[...] = data
    new_data = ([0])
    ()
    (30)
    dset[...] = new_data

In this case we sleep for a period of 30 seconds, during which time we will be in the systemkill -9Kill this Python process. If you don't add the()This line, there will be the above picture in the error, means that the hdf5 file is corrupt. If you add this line of code, the effect is as follows:

It is important to note here that hdf5 file corruption only occurs the first time an hdf5 file is written. If you use the first section of ther+mode for secondary writes without file corruption.

Summary outline

hdf5 is a data storage format that is often likely to be used in quantum chemistry and molecular dynamics simulations, thanks to its good compression rate and integrity check, which guarantees the reliability of its data/trace storage to a certain extent. This paper describes two operations on hdf5 files: updating the data content in an existing hdf5 file, and the method of flush synchronization update.

This article was first linked to:/dechinphy/p/

Author ID: DechinPhy