is a high-performance CPU/GPU large language model inference framework for consumer-grade devices or edge devices. Developers can use the tool to convert and quantize all kinds of open source big language models into gguf format files, and then through the implementation of local reasoning. After my research, compared to other big model landing solutions, small and medium-sized R&D enterprises use may be the only product landing solution. Keywords: "small and medium-sized R&D enterprises", "product landing solution".
Small and medium-sized R&D enterprises: Compared with the hardware investment of 10 million +, small and medium-sized R&D enterprises can only support a small amount of hardware investment, and also lack of professional R&D personnel.
Product Landing Solution: Projects need to have the ability to land in verticals, and in most cases also require private deployment.
There are quite a few articles on the internet that introduce it, and there are even some paid courses on B site. But the version is more backward, basically has no reference value. This article adopts version b3669, the release date is September 2024, the reference code: examples/. The learning curve is steeper because the author (Georgi Gerganov) does not provide detailed documentation of the interface, and the code quality of examples is really low. This paper aims to introduce how to use to perform inference and introduce key functions to help developers get started, in-depth functionality remains to be investigated.
I. Reasoning process
1. Process description
In the case of common interactive reasoning, for example, the program can be divided into roughly five sub-functional modules.
Initialization: model and system prompt word initialization. In fact, from the program processing analysis, there is no special distinction between the system prompts and user input, the actual project development can be put together to deal with. The conceptual difference between them will be explained later.
User input: waiting for the user to input text information. The big language model is actually the process of analyzing and understanding human text information, and the essence of the product landing is to further complete some specified tasks with the help of the understanding of the big model. In this process, the Internet has created many concepts, what agent, function and so on. In fact, the essence is to study how to further combine the big model with the program and complete the interaction. At least for now, my point of view is: the big model only has the ability of semantic analysis, semantic reasoning.
Analyze the prediction: this is one of the core capabilities of the big language model, which needs to analyze the context (system prompt words, user input, already reasoned content) and then further complete the prediction of the next word (token).
Inference Sampling: this one is another core capability of the Big Language Model, which entails randomly selecting a token from the results of the analytic prediction and sending it as an input in reverse to the analytic prediction module to continue until the end of the output (EOS).
Output: This module does not strictly belong to the big model, but it is a necessary module to complete the user interaction. In terms of product design, you can choose verbatim output (token-by-token) or one-time output (token-by-once).
2. Introduction to the concept
Roles: Big language models usually have three built-in roles: system, user, and assistant. These three roles are not uniformly specified by all models, but basically all current open source models are compatible with the interaction of these three roles, which helps the big model to better understand the human context and complete the task. system represents the system prompt, which is often referred to as prompt. there are quite a number of courses on the Internet that describe the writing of system prompts as a prompts project, and also make a big fuss about categorizing them, which is not necessary. It is not necessary. From my experience, a good system prompt (prompt) should have three points can be: clear semantics, clear format, simple task. Semantics is clear that in the system prompt try not to use ambiguous words, in human terms, "to make the problem clear". The format is clear, i.e. you can use markdown or json to specify some important concepts. If you need to let the big model in accordance with a fixed process of analysis, you can use markdown numbering syntax, if you need to let the big model on the reasoning results of structured processing, you can use json syntax. The task is simple, i.e., don't let the big model deal with tasks that have too much complex logic or too many processes. The reasoning ability of the big model is completely based on semantic understanding, it does not have a strict sense of program execution logic and mathematical operation logic. That's why, when you ask the big model: who is bigger, 1.11 or 1.8, it will tell you in all seriousness that when the integer part is the same size, only the fractional part needs to be compared, because 11 is bigger than 8, so 1.11 is bigger than 1.8. So what if we do have some computational tasks or complex processes to deal with in reality? My solution is to interact with the program and switch contexts dynamically. In addition to the system roles, users generally represent inputs and helpers generally represent outputs.
token: this is not to be understood as a token, but rather as a set of vector ids, a common unit for describing the length of a large model context. What does a token represent? There are a lot of wrong interpretations on the Internet, the more common saying is: an English word is 1 token, a Chinese is usually 2-3 tokens. above the process introduction section, I have already explained how the "analyze the prediction" and "sampling reasoning" interact with each other. I have explained how "analyzing prediction" interacts with "sampling inference" in the process introduction section above. The "inference sampling" generates a token, which is fed to the "analysis prediction" for the next token prediction, and the output module can choose the token-by-token method to output to the user. In fact, for Chinese, a token usually represents a participle. For example, "I love China" may be "I", "love", "China", or it may be "I", "love", "China", "country". The former represents 3 tokens and the latter represents 4. The exact division depends on the training of Chinese commands for the larger model. In addition to the common token representing words, there is also a class of special token (special token), for example, as mentioned above, the big model one word by one word inference generation, how does the program know when to end? In fact, there is an eos-token, when the token is read, it means that this round of reasoning is over.
3. Program structure
The program structure is clear, the core modules are llama and ggmll. ggml is called through llama, which is not usually used by the developers directly. The common structures and functions are defined in llama. common is the encapsulation of the functions in llama, which sometimes serves the purpose of easy calling. But version iteration, common function changes faster, the best way is to read the process directly call the function.
4. Source code analysis
I'm going to use examples/main/ as the basis for a focused analysis.
(1) Initialization
Global parameters, this structure is mainly used to receive user input and subsequently used to initialize the model and inference context.
gpt_params params;
System initialization function:
llama_backend_init(); llama_numa_init(params.numa);
System resource release function:
llama_backend_free();
Creating models and inference contexts:
llama_init_result llama_init = llama_init_from_gpt_params(params); llama_model *model = llama_init.model; llama_context *ctx = llama_init.context;
It is declared in. If you need to create the model and context separately you can use the other two pairs of functions in:
llama_model_params model_params = llama_model_params_from_gpt_params(gpt_params_); llama_model_ = llama_load_model_from_file(.c_str(), model_params); llama_context_params ctx_eval_params = llama_context_params_from_gpt_params(gpt_params_); llama_context *ctx_eval = llama_new_context_with_model(llama_model_, ctx_eval_params);
Creating a thread pool for ggml, a process that may be related to model acceleration, which is not explained in detail in the code:
struct ggml_threadpool * threadpool = ggml_threadpool_new(&tpp); llama_attach_threadpool(ctx, threadpool, threadpool_batch);
In addition to accomplishing general reasoning tasks, context storage and reading are also implemented. The premise of context switching is that the model cannot be switched and only the first time reasoning receives the user input prompt. using this feature, dynamic switching of contexts can be realized.
std::string path_session = params.path_prompt_cache; std::vector<llama_token> session_tokens;
At this point, the process regarding the system initialization module is complete.
(2) User input
To receive user input and reason about the output, several variables are defined in the source code set:
std::vector<llama_token> embd_inp;
std::vector<llama_token> embd;
Check the encoder, most modern models don't have a clearly defined encodec
if (llama_model_has_encoder(model)) { int enc_input_size = embd_inp.size(); llama_token * enc_input_buf = embd_inp.data(); if (llama_encode(ctx, llama_batch_get_one(enc_input_buf, enc_input_size, 0, 0))) { LOG_TEE("%s : failed to eval\n", __func__); return 1; } llama_token decoder_start_token_id = llama_model_decoder_start_token(model); if (decoder_start_token_id == -1) { decoder_start_token_id = llama_token_bos(model); } embd_inp.clear(); embd_inp.push_back(decoder_start_token_id); }
(3) Analyzing forecasts
The core code for analyzing the prediction part is as follows, I removed the logic dealing with attention and session and kept only the logic for the inference part.
// predict if (!()) { // Note: (n_ctx - 4) here is to match the logic for commandline prompt handling via // --prompt or --file which uses the same value. int max_embd_size = n_ctx - 4; // Ensure the input doesn't exceed the context size by truncating embd if necessary. if ((int) () > max_embd_size) { const int skipped_tokens = (int) () - max_embd_size; (max_embd_size); console::set_display(console::error); printf("<<input too long: skipped %d token%s>>", skipped_tokens, skipped_tokens != 1 ? "s" : ""); console::set_display(console::reset); fflush(stdout); } for (int i = 0; i < (int) (); i += params.n_batch) { int n_eval = (int) () - i; if (n_eval > params.n_batch) { n_eval = params.n_batch; } LOG("eval: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, embd).c_str()); if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval, n_past, 0))) { LOG_TEE("%s : failed to eval\n", __func__); return 1; } n_past += n_eval; LOG("n_past = %d\n", n_past); // Display total tokens alongside total time if (params.n_print > 0 && n_past % params.n_print == 0) { LOG_TEE("\n\033[31mTokens consumed so far = %d / %d \033[0m\n", n_past, n_ctx); } } } ();
The logic focuses on the following: first, if the context of the reasoning exceeds the length limit, the excess is discarded. Consider refactoring this part of the logic in actual development. Second, each inference has a limit on the number of processes (n_batch), which is mainly for the system not to be unresponsive for a long period of time when there are too many inputs at once. Finally, embd will be cleaned up every time reasoning is completed, and the information after reasoning is completed will be saved in ctx.
(4) Inferential sampling
The source code for the sampling inference section is divided into two parts:
if ((int) embd_inp.size() <= n_consumed && !is_interacting) { // optionally save the session on first sample (for faster prompt loading next time) if (!path_session.empty() && need_to_save_session && !params.prompt_cache_ro) { need_to_save_session = false; llama_state_save_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size()); LOG("saved session to %s\n", path_session.c_str()); } const llama_token id = llama_sampling_sample(ctx_sampling, ctx, ctx_guidance); llama_sampling_accept(ctx_sampling, ctx, id, /* apply_grammar= */ true); LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev).c_str()); embd.push_back(id); // echo this to console input_echo = true; // decrement remaining sampling budget --n_remain; LOG("n_remain: %d\n", n_remain); } else { // some user input remains from prompt or interaction, forward it to processing LOG("embd_inp.size(): %d, n_consumed: %d\n", (int) embd_inp.size(), n_consumed); while ((int) embd_inp.size() > n_consumed) { embd.push_back(embd_inp[n_consumed]); // push the prompt in the sampling context in order to apply repetition penalties later // for the prompt, we don't apply grammar rules llama_sampling_accept(ctx_sampling, ctx, embd_inp[n_consumed], /* apply_grammar= */ false); ++n_consumed; if ((int) () >= params.n_batch) { break; } } }
The first thing to focus on is part 2, the logic of this section is to load the user's input into the context, and since the user's input does not need to be reasoned about, only the llama_sampling_accept function needs to be called. Part 1 enters only after all user input is complete, sampling one token at a time and writing it to embd. This process alternates with analyzing the predictions until eos is encountered.
if (llama_token_is_eog(model, llama_sampling_last(ctx_sampling))) { LOG("found an EOG token\n"); if (params.interactive) { if (params.enable_chat_template) { chat_add_and_format(model, chat_msgs, "assistant", assistant_ss.str()); } is_interacting = true; printf("\n"); } }
The chat_add_and_format function is only responsible for logging all interactions in char_msgs and has no effect on the overall reasoning process. If you want to implement user output, you can handle it here.
II. Key functions
Initialize llama_model_params with gpt_params
struct llama_model_params llama_model_params_from_gpt_params (const gpt_params & params);
Creating large model pointers
LLAMA_API struct llama_model * llama_load_model_from_file( const char * path_model, struct llama_model_params params);
Creating ggml thread pools and setting up thread pools
GGML_API struct ggml_threadpool* ggml_threadpool_new (struct ggml_threadpool_params * params); LLAMA_API void llama_attach_threadpool( struct llama_context * ctx, ggml_threadpool_t threadpool, ggml_threadpool_t threadpool_batch);
Initialize llama_context_params with gpt_params
struct llama_context_params llama_context_params_from_gpt_params (const gpt_params & params); LLAMA_API struct llama_context * llama_new_context_with_model( struct llama_model * model, struct llama_context_params params);
Segmentation of input and conversion to token
std::vector<llama_token> llama_tokenize( const struct llama_context * ctx, const std::string & text, bool add_special, bool parse_special = false);
Getting a special token
LLAMA_API llama_token llama_token_bos(const struct llama_model * model); // beginning-of-sentence LLAMA_API llama_token llama_token_eos(const struct llama_model * model); // end-of-sentence LLAMA_API llama_token llama_token_cls(const struct llama_model * model); // classification LLAMA_API llama_token llama_token_sep(const struct llama_model * model); // sentence separator LLAMA_API llama_token llama_token_nl (const struct llama_model * model); // next-line LLAMA_API llama_token llama_token_pad(const struct llama_model * model); // padding
Batch process tokens and make predictions
LLAMA_API struct llama_batch llama_batch_get_one( llama_token * tokens, int32_t n_tokens, llama_pos pos_0, llama_seq_id seq_id); LLAMA_API int32_t llama_decode( struct llama_context * ctx, struct llama_batch batch);
Perform sampling and receive sampling
llama_token llama_sampling_sample( struct llama_sampling_context * ctx_sampling, struct llama_context * ctx_main, struct llama_context * ctx_cfg, int idx = -1); void llama_sampling_accept( struct llama_sampling_context * ctx_sampling, struct llama_context * ctx_main, llama_token id, bool apply_grammar);
Convert token to natural language
std::string llama_token_to_piece( const struct llama_context * ctx, llama_token token, bool special = true);
Determine if reasoning is over, note that this token may not match the one obtained by llama_token_eos. So be sure to pass this function to determine
// Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.) LLAMA_API bool llama_token_is_eog(const struct llama_model * model, llama_token token);
III. Summary
The purpose of this article is to present the basic usage, due to Georgi Gerganov's fast updates and lack of documentation. Therefore, some of the explanations may not be accurate. If you are interested in the framework and this article you can send me a message to discuss in depth.