Location>code7788 >text

Python Stores and Reads and Writes Binary Files

Popularity:515 ℃/2024-09-09 17:51:55

Technical background

In general, we will choose to use the plaintext form to store data, such as json, txt, csv and so on. If the need for higher compression rate storage format, you can also choose to use hdf5 or npz and other formats. There is also a more compact data storage format, is directly in accordance with the binary format. In this format, there is no spacer between the stored data, and it should be the smallest storage type without compression.

Usage

In Python, we can use the () function to store numpy array types directly into a binary file. When reading, although you can directly use open(file_name, 'rb') to read, but in order to adapt to the large number of IO scenarios, here we use memory mapped mmap form to read the data.

Full Example

The following is a complete sample code, the relevant functions are marked directly in the code in the form of comments:

import numpy as np
import mmap
import resource
# Get the size of the page data volume in bytes
PAGE_SIZE = ()
# Define the single-precision floating-point data footprint bytes (in bytes)
DATA_SIZE = 4
# Calculate page storage data amount (num_float32)
PAGE_FNUM = int(PAGE_SIZE/DATA_SIZE)
print ("The PAGE_SIZE is: {}".format(PAGE_SIZE))
print ("Corresponding float32 numbers should be: {}".format(PAGE_FNUM))
# Generate example data, defining two pages of data using a PAGE_FNUM+4 size amount of data
tmp_arr = (PAGE_FNUM+4).astype(np.float32)
# Data storage path
tmp_file = '/tmp/'
# Store the array to a binary file
tmp_arr.tofile(tmp_file)
# Read 4 data at a time from the binary file
READ_NUM = 4
with open(tmp_file, 'rb') as file.
    # Memory map for first page of data
    mm = ((), 0, access=mmap.ACCESS_READ, offset=0)
    # Bits 1, 2, 3, and 4 of the first page of data
    print (((DATA_SIZE*READ_NUM), dtype='<f4'))
    # 5, 6, 7, 8 bits of the first page of data
    print (((DATA_SIZE*READ_NUM), dtype='<f4'))
    # Memory map for the second page of data
    mm = ((), 0, access=mmap.ACCESS_READ, offset=PAGE_SIZE)
    # Bits 1 to 4 of the second page of data
    print (((DATA_SIZE*READ_NUM), dtype='<f4'))
    # 5~8 bits of the second page of data
    print (((DATA_SIZE*READ_NUM), dtype='<f4'))
    # Turn off memory mapping
    ()
# Exit file IO

The output of the script is:

The PAGE_SIZE is: 4096
Corresponding float32 numbers should be: 1024
[0. 1. 2. 3.]
[4. 5. 6. 7.]
[1024. 1025. 1026. 1027.]
[]

Analysis of results

The first data we print is the page size, shown here as 4096 bytes. And a single-precision floating-point number takes up 4 bytes, so one page holds 1024 single-precision floating-point numbers, which is the result of the second printout. As we define the numpy array is an incremental array from 0, so the first page of the first 8 digits of the data is from 0 to 7. And the second page of the data is 1024 ~ 1027 a total of 4 floating point numbers, accounting for 16 bytes. So we use () to read the data in the second page for the second time, we get an empty array. In addition, we can check the size of the binary file:

In [1]: import os

In [2]: ('/tmp/')
Out[2]: 4112

That's a total of 4112 bytes, which happens to be 4096 + 16 bytes.

Summary outline

This article describes a scheme for dumping a Numpy array into a compact binary format in Python, and its readability using a memory-mapped form. A binary data stream not only facilitates memory mapping in the form of pages, but also has a hashable feature compared to traditional Numpy single-precision floating-point arrays. Overall it is a very friendly storage format for high performance computing, and is used in cudaSPONGE as an output format for molecular dynamics simulation trajectories.

copyright statement

This article was first linked to:/dechinphy/p/

Author ID: DechinPhy

More original articles:/dechinphy/

Buy the blogger coffee:/dechinphy/gallery/image/