This article only focuses on the use of zstd, and does not care about the specific implementation of its algorithm.
I haven't tried to use all the functional modes of zstd, but I will briefly introduce the application scenarios of each mode. When using it, check the API.
step 0:why zstd?
zstd is a Facebook open source compression algorithm, which can be found here:/facebook/zstd
Official usage documentation:/zstd/zstd_manual.html
According to rumors (by one's own evaluation), zstd has relatively excellent compression speed and compression ratio in general task scenarios, which is worth a try compared to its peers such as gzip and zlib.
However, the specific compression algorithm used needs to be selected according to your own work scenario. It is best to provide some sample files to try running on benchmark.
The code here is C++ and also supports other language versions. You can search on the official website.
step 1: Download and install zstd
git clone /facebook/
cd zstd/
make
sudo make install
make check
At this point, the installation and testing of zstd should be completed normally,cd examples/
To try to compile and run some demo code
step 2: Use of zstd
step 2.1: Basic Function API
step 2.1.1: General Compression
Single file compression:ZSTD_compress
andZSTD_decompress
Use in pairs, memory space needs to be allocated in advance during the process, see demo
Multiple files batch compression: One of the most important misunderstandings is rightZSTD_compressCCtx
Understanding, this function will only compress multiple files into multiple.zst
, in factZSTD_compressCCtx
Multiple files are not packaged into one.zst
。
The meaning is reusingZSTD_CCtx
Context, avoiding repeated requests and freeing of memory, for batch compression of individual files
step 2.1.2: Streaming compression
Streaming compression: essentially doing a filePartitionCompression, usually suitable for large files rather than small files, e.g.Logs, video streams, databases, etc.。
Read data in chunks, compared to one-time readingReduced memory usage, and chunking can also be supportedMulti-threaded concurrency。
Decompression is the same as reading in chunks and decompression.ZSTD_compressStream
andZSTD_decompressStream
Use in pairs
Notice,ZSTD_decompress
andZSTD_decompressStream
All can decompress files correctly, but the performance is different
ZSTD_c_compressionLevel
: It is a trade-off that adjusts the compression speed and compression rate. The default level is 3. The current range is 1 ~ ZSTD_maxCLevel()
step 2.1.3: Dictionary compression
The following discussion was made by the official:
The smaller the amount of data to compress, the more difficult it is to compress. This problem is common to all compression algorithms, and reason is, compression algorithms learn from past data how to compress future data. But at the beginning of a new data set, there is no "past" to build upon.
The essence of compression algorithms is to the past contentpatternThe learning is intuitively similar to pattern recognition in AI.
A method is proposed: using a pre-trained dictionary to compress multiple files to improve compression rate and speed.
Applicable toSmall files, multiple filesCompression is more efficient than standard compression.
zstd --train FullPathToTrainingSet/* -o dictionaryName
, dictionary files required for early training
The API used in the code is:
-
ZSTD_createCDict
: Create a dictionary structure -
ZSTD_compress_usingCDict
: Actual compression -
ZSTD_decompress_usingDDict
: Actual decompression
step 2.2: Folder compression maintains the original file structure
The above compression results are all single-to-single. The existing zstd API does not support keeping multiple files in the original tree-like file structure and compressing them into one file.
Q: So, if you want to compress multiple files into a compressed package, there must be a way?
A: Of course!
Our common way of handling is to use mixedtar
andzst
,tar
The tool maintains the original file structure and packages it into a file.
Use laterzst
Compress individual files.
Another way is to use streaming compression, write the path structure of the recursive traversal to the compressed content, which needs to be added between filesDelimiter or metadata。
This approach will be more complicated, and I personally don’t like it when the performance bottleneck is not serious.
step 3: A small case
Note that the suffix of the compressed file is.zst
Recently, the tool function will be namedor
) and other parts set up a check function, which has unified judgment and return processing in the corresponding environment.
This operation can prevent you from writing the same error handling every time, which can significantly reduce the number of lines of code and increase readability in actual programming practice. It is excellent and worth learning from.
Similar to zstd:
/*! CHECK
* Check that the condition holds. If it doesn't print a message and die.
*/
#define CHECK(cond, ...) \
do { \
if (!(cond)) { \
fprintf(stderr, \
"%s:%d CHECK(%s) failed: ", \
__FILE__, \
__LINE__, \
#cond); \
fprintf(stderr, "" __VA_ARGS__); \
fprintf(stderr, "\n"); \
exit(1); \
} \
} while (0)
/*! CHECK_ZSTD
* Check the zstd error code and die if an error occurred after printing a
* message.
*/
#define CHECK_ZSTD(fn) \
do { \
size_t const err = (fn); \
CHECK(!ZSTD_isError(err), "%s", ZSTD_getErrorName(err)); \
} while (0)
and cuda:
#define CHECK(call)\
{\
const cudaError_t error=call;\
if(error!=cudaSuccess)\
{\
printf("ERROR: %s:%d,",__FILE__,__LINE__);\
printf("code:%d,reason:%s\n",error,cudaGetErrorString(error));\
exit(1);\
}\
}