Introduction to ggml

ggml ggml is a machine learning library written in C and C++ that focuses on model inference for the Transformer architecture. The project is fully open source, in active development, and the development community is growing. ggml is similar to machine learning libraries such as PyTorch and TensorFlow, but as it is currently in the early stages of development, some of the underlying design is still being improved.

compared with cap (a poem) and other projects, ggml has also been continuously and widely popularized. To realize end-side large language model reasoning, including theollama、jan、LM Studio and many other projects use ggml internally.

Compared to other libraries, ggml has the following advantages.

Minimizing Implementation: The core library is self-contained and contains only 5 files. If you want to add GPU support, you can add your own implementation, it's not mandatory.
Simple compilation: You don't need fancy compilation tools, if you don't need a GPU, simply GGC or Clang will do the job.
lightweighting: The compiled binary is less than 1MB, which is small enough compared to PyTorch (which takes hundreds of MB).
good compatibility: Support for all types of hardware, including x86_64, ARM, Apple Silicon, CUDA, and more.
Support for tensor quantization: Tensor can be quantized as a way to save memory and in some cases even improve performance.
Memory usage is as efficient as it gets: The overhead of storing the tensor and performing the computation is minimized.

Of course, there are some drawbacks to ggml at the moment. If you choose to develop with ggml, these are some of the things you need to know (and may be improved upon).

Not any tensor operation can be executed on the backend you expect. For example, some operations that can run on a CPU may not yet be supported on CUDA.
Developing with ggml may not be as simple and straightforward as it requires some deeper knowledge of the underlying programming.
The program is still in active development, so there is a possibility of more significant changes.

This article will get you started with ggml development. It will not cover advanced projects such as using LLM inference, etc. Instead, we will focus on the core concepts and basic usage of ggml. Instead, we will focus on the core concepts and basic usage of ggml, so that developers who want to use ggml can build a foundation for advanced development.

Start Learning

Let's start with compilation. For simplicity, let's start with the compilation in theUbuntu As an example, you can compile ggml on any platform (including Windows, macOS, BSD, etc.). Of course, ggml supports compilation on all kinds of platforms (including Windows, macOS, BSD, etc.). The commands are as follows.

# Start by installing build dependencies
# "gdb" is optional, but is recommended
sudo apt install build-essential cmake git gdb

# Then, clone the repository
git clone /ggerganov/
cd ggml

# Try compiling one of the examples
cmake -B build
cmake --build build --config Release --target simple-ctx

# Run the example
./build/bin/simple-ctx

Desired Output.

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

Seeing that the expected output is fine, we continue.

Terminology and concepts

First we learn some core concepts of ggml. If you're familiar with PyTorch or TensorFlow, this may be a bit of a leap for you. But since ggml is alow level libraries, understanding these concepts can give you a much greater degree of control over performance.

ggml_context: A "container" for various types of objects (e.g., tensors, computational graphs, other data).
ggml_cgraph:: The representation of the computation graph can be understood as the "computation execution sequence" to be passed to the backend.
ggml_backend: The interface that executes the computational graph, there are many types: CPU (default), CUDA, Metal (Apple Silicon), Vulkan, RPC, and so on.
ggml_backend_buffer_type: denotes a kind of cache that can be understood as a link to everyggml_backend A "memory allocator" for the GPU. For example, if you want to perform calculations on the GPU, you need to pass abuffer_type (often abbreviated asbuft ) to allocate memory on the GPU.
ggml_backend_buffer: denotes the name of a program that has been passed throughbuffer_type allocated cache. Note that a cache can store multiple tensor data.
ggml_gallocr: denotes an allocator that allocates memory to a computational graph, allowing efficient memory allocation to tensors in the computational graph.
ggml_backend_sched: A scheduler that enables multiple backends to be used concurrently to distribute computational tasks across hardware platforms (e.g., mixed CPU and GPU computation) when dealing with large models or multi-GPU reasoning. The scheduler also automatically transfers operators not supported by the GPU to the CPU to ensure optimal resource utilization and compatibility.

simple example

The simple example here will replicate the sample program from the last line of instruction code in the first section. We first create two matrices and then multiply them together to get the result. Using PyTorch, the code might look like this.

import torch

# Create two matrices
matrix1 = ([
  [2, 8],
  [5, 1],
  [4, 2],
  [8, 6],
])
matrix2 = ([
  [10, 5],
  [9, 9],
  [5, 4],
])

# Perform matrix multiplication
result = (matrix1, )
print()

To use ggml, follow these steps.

Assign aggml_context to store tensor data
Distribute the tensor and assign a value
Create a matrix multiplication operation for theggml_cgraph
executable calculation
Getting Calculations
Free memory and exit

please note: In this example, we directly add theggml_context The tensor specific data is allocated in the However, the memory should actually be allocated as a device-side cache, which we will cover in the next section.

We'll start by creating a new folderexamples/demo Then execute the following commands to create the C file and the CMake file.

cd ggml # make sure you're in the project root

# create C source and CMakeLists file
touch examples/demo/
touch examples/demo/

The code for this example is based on the The.

compilerexamples/demo/ , write the following code.

#include ""
#include <>
#include <>

int main(void) {
    // initialize data of matrices to perform matrix multiplication
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    // 1. Allocate `ggml_context` to store tensor data
    // Calculate the size needed to allocate
    size_t ctx_size = 0;
    ctx_size += rows_A * cols_A * ggml_type_size(GGML_TYPE_F32); // tensor a
    ctx_size += rows_B * cols_B * ggml_type_size(GGML_TYPE_F32); // tensor b
    ctx_size += rows_A * rows_B * ggml_type_size(GGML_TYPE_F32); // result
    ctx_size += 3 * ggml_tensor_overhead(); // metadata for 3 tensors
    ctx_size += ggml_graph_overhead(); // compute graph
    ctx_size += 1024; // some overhead (exact calculation omitted for simplicity)

    // Allocate `ggml_context` to store tensor data
    struct ggml_init_params params = {
        /*.mem_size =*/ ctx_size,
        /*.mem_buffer =*/ NULL,
        /*.no_alloc =*/ false,
    };
    struct ggml_context * ctx = ggml_init(params);

    // 2. Create tensors and set data
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);
    memcpy(tensor_a->data, matrix_A, ggml_nbytes(tensor_a));
    memcpy(tensor_b->data, matrix_B, ggml_nbytes(tensor_b));


    // 3. Create a `ggml_cgraph` for mul_mat operation
    struct ggml_cgraph * gf = ggml_new_graph(ctx);

    // result = a*b^T
    // Pay attention: ggml_mul_mat(A, B) ==> B will be transposed internally
    // the result is transposed
    struct ggml_tensor * result = ggml_mul_mat(ctx, tensor_a, tensor_b);

    // Mark the "result" tensor to be computed
    ggml_build_forward_expand(gf, result);

    // 4. Run the computation
    int n_threads = 1; // Optional: number of threads to perform some operations with multi-threading
    ggml_graph_compute_with_ctx(ctx, gf, n_threads);

    // 5. Retrieve results (output tensors)
    float * result_data = (float *) result->data;
    printf("mul mat (%d x %d) (transposed result):\n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1]/* rows */; j++) {
        if (j > 0) {
            printf("\n");
        }

        for (int i = 0; i < result->ne[0]/* cols */; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]\n");

    // 6. Free memory and exit
    ggml_free(ctx);
    return 0;
}

Then write the following code toexamples/demo/ :

set(TEST_TARGET demo)
add_executable(${TEST_TARGET} demo)
target_link_libraries(${TEST_TARGET} PRIVATE ggml)

compilerexamples/ , add this line of code at the end:.

add_subdirectory(demo)

Then compile and run.

cmake -B build
cmake --build build --config Release --target demo

# Run it
./build/bin/demo

The desired result should be something like.

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

Example of using the backend

“Backend” in ggml refers to an interface that can handle tensor operations. Backend can be CPU, CUDA, Vulkan, etc.

In ggml, "backend" refers to an interface that can handle tensor operations, such as CPU, CUDA, Vulkan, etc.

A backend can abstract the implementation of a computation graph. When defined, a computation graph can then go through the computation on the relevant hardware with the corresponding backend implementation. Note that during this process, ggml automatically reserves memory for needed intermediate results and optimizes memory usage based on its lifecycle.

The basic steps to perform computation or reasoning using the backend are as follows.

initializationggml_backend
allocateggml_context to hold the tensor's metadata (at this point there is no need to directly allocate the tensor's data)
Create metadata for the tensor (aka shape and data type)
Assign aggml_backend_buffer Used to store all tensors
Copying tensor specific data from memory (RAM) to the back-end cache
Create a matrix multiplication forggml_cgraph
Create aggml_gallocr For the distribution of calculation charts
Optional: withggml_backend_sched Scheduling Calculator
Running Calculation Chart
Get the result, i.e., the output of the computational graph
Free memory and exit

The code for this example is based on the:

#include ""
#include ""
#include ""
#ifdef GGML_USE_CUDA
#include ""
#endif

#include <>
#include <>
#include <>

int main(void) {
    // initialize data of matrices to perform matrix multiplication
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    // 1. Initialize backend
    ggml_backend_t backend = NULL;
#ifdef GGML_USE_CUDA
    fprintf(stderr, "%s: using CUDA backend\n", __func__);
    backend = ggml_backend_cuda_init(0); // init device 0
    if (!backend) {
        fprintf(stderr, "%s: ggml_backend_cuda_init() failed\n", __func__);
    }
#endif
    // if there aren't GPU Backends fallback to CPU backend
    if (!backend) {
        backend = ggml_backend_cpu_init();
    }

    // Calculate the size needed to allocate
    size_t ctx_size = 0;
    ctx_size += 2 * ggml_tensor_overhead(); // tensors
    // no need to allocate anything else!

    // 2. Allocate `ggml_context` to store tensor data
    struct ggml_init_params params = {
        /*.mem_size =*/ ctx_size,
        /*.mem_buffer =*/ NULL,
        /*.no_alloc =*/ true, // the tensors will be allocated later by ggml_backend_alloc_ctx_tensors()
    };
    struct ggml_context * ctx = ggml_init(params);

    // Create tensors metadata (only there shapes and data type)
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);

    // 4. Allocate a `ggml_backend_buffer` to store all tensors
    ggml_backend_buffer_t buffer = ggml_backend_alloc_ctx_tensors(ctx, backend);

    // 5. Copy tensor data from main memory (RAM) to backend buffer
    ggml_backend_tensor_set(tensor_a, matrix_A, 0, ggml_nbytes(tensor_a));
    ggml_backend_tensor_set(tensor_b, matrix_B, 0, ggml_nbytes(tensor_b));

    // 6. Create a `ggml_cgraph` for mul_mat operation
    struct ggml_cgraph * gf = NULL;
    struct ggml_context * ctx_cgraph = NULL;
    {
        // create a temporally context to build the graph
        struct ggml_init_params params0 = {
            /*.mem_size =*/ ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead(),
            /*.mem_buffer =*/ NULL,
            /*.no_alloc =*/ true, // the tensors will be allocated later by ggml_gallocr_alloc_graph()
        };
        ctx_cgraph = ggml_init(params0);
        gf = ggml_new_graph(ctx_cgraph);

        // result = a*b^T
        // Pay attention: ggml_mul_mat(A, B) ==> B will be transposed internally
        // the result is transposed
        struct ggml_tensor * result0 = ggml_mul_mat(ctx_cgraph, tensor_a, tensor_b);

        // Add "result" tensor and all of its dependencies to the cgraph
        ggml_build_forward_expand(gf, result0);
    }

    // 7. Create a `ggml_gallocr` for cgraph computation
    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
    ggml_gallocr_alloc_graph(allocr, gf);

    // (we skip step 8. Optionally: schedule the cgraph using `ggml_backend_sched`)

    // 9. Run the computation
    int n_threads = 1; // Optional: number of threads to perform some operations with multi-threading
    if (ggml_backend_is_cpu(backend)) {
        ggml_backend_cpu_set_n_threads(backend, n_threads);
    }
    ggml_backend_graph_compute(backend, gf);

    // 10. Retrieve results (output tensors)
    // in this example, output tensor is always the last tensor in the graph
    struct ggml_tensor * result = gf->nodes[gf->n_nodes - 1];
    float * result_data = malloc(ggml_nbytes(result));
    // because the tensor data is stored in device buffer, we need to copy it back to RAM
    ggml_backend_tensor_get(result, result_data, 0, ggml_nbytes(result));
    printf("mul mat (%d x %d) (transposed result):\n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1]/* rows */; j++) {
        if (j > 0) {
            printf("\n");
        }

        for (int i = 0; i < result->ne[0]/* cols */; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]\n");
    free(result_data);

    // 11. Free memory and exit
    ggml_free(ctx_cgraph);
    ggml_gallocr_free(allocr);
    ggml_free(ctx);
    ggml_backend_buffer_free(buffer);
    ggml_backend_free(backend);
    return 0;
}

Compile and run.

cmake -B build
cmake --build build --config Release --target demo

# Run it
./build/bin/demo

The desired result should be the same as in the above example.

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

Print Calculation Chart

ggml_cgraph represents the computation graph, which defines the order in which the backend executes the computation. Printing the computation graph is a very useful debugging tool, especially when the model is complex.

It is possible to useggml_graph_print Go to Print Calculation Chart.

...

// Mark the "result" tensor to be computed
ggml_build_forward_expand(gf, result0);

// Print the cgraph
ggml_graph_print(gf);

Running program.

=== GRAPH ===
n_nodes = 1
 - 0: [     4, 3, 1] MUL_MAT
n_leafs = 2
 - 0: [     2, 4] NONE leaf_0
 - 1: [     2, 3] NONE leaf_1
========================================

Additionally, you can print the calculation graphs in graphviz's dot file format.

ggml_graph_dump_dot(gf, NULL, "");

Then usedot command or use thisnode particle marking the following noun as a direct object Rendering of files into images.

ggml-debug

summarize

This article introduces ggml, covering basic concepts, simple examples, and back-end examples. In addition to these basics, ggml has a lot to learn.

We'll be following up with multiple articles covering more ggml, including GGUF format models, model quantization, and how multiple backends can be coordinated to work together. In addition, you can also refer toggml example folder Learn more about advanced usage and sample programs. Stay tuned to our ggml content.

Original in English./blog/introduction-to-ggml

Original authors: Xuan Son NGUYEN, Georgi Gerganov, slaren

Translator: hugging-hoi2022