Decrypt the Prompt Series 52. What other areas are worth exploring for chatting big models

After the open source carnival of DeepSeek-R1, I feel that many friends are trapped inTechnical Comfort Zone, but in fact, the current big model technology has only entered the application stage, and there are still many areas that can be explored, so in this chapter, we don’t talk about papers, occasionally not down-to-earth, simply looking up at the sky, and talking about any interesting areas worth exploring. Haha, it may be simply a product of reading too many science fiction novels recently~

Continuous learning that has not yet been conquered

The current big model training is still phased. OpenAI will retrain the model every few months to extend the model's world knowledge deadline. To put it bluntly, it is a brand new and overview writing of knowledge. Just like Sisyphus pushing stones, every full training means a systematic forgetting of previous knowledge. And the true continuous learning and try learning seem to be an unsolved mystery. Of course, there are also opinions that the evolution of organisms has a completely different path from the evolution of inorganics, so many people are also questioning whether the big model really needs to be continuously learned.

The main thing I have come into contact with before is to truly use online updates is in the recommended field, and continuous model training and iterative updates are carried out based on the user-long and short behavior sequences that occur in real time. However, this type of model is essentially behavioral representation and fit, which is quite different from the current big model. Although there have been many meta-learning and continuous learning papers in the field of NLP, if you compare them with RL training and ChatGPT's SFT instruction training, you will find that they may not have found the correct way to open them. In fact, it is not difficult to see from Word2Vec, Bert, CLIP, ChatGPT, and R1 that the technology used in each epoch-making model basically conforms to the principle of simplicity, with less craftsmanship and longer Scaling curves.

Continuous learning actually includes multiple aspects. One of the more important ones is the supplement of pure incremental world knowledge, that is, the knowledge and information generated by incremental worlds since the last training of the model has ended. The biggest problem with continuous training in previous training mode was to forget disasters, learn new things and forget old things, pick up sesame seeds and lose watermelons. One possible reason for pure personal conjecture here comes from the current Transformer model structure, the language ability, world knowledge, task completion ability, and thinking and reasoning ability acquired by the model are entangled and stored in the Transformer parameters. As a result, when we continue to learn, we will forget the ability to complete the task; if we only supplement the ability to complete the task, and if we do not update the knowledge, we will add the illusion of the model (the model thinks it is OK! In fact, it is not possible). However, if there is a structure that can decouple the above abilities in layers, knowledge is not only stored in objective facts, but reasoning ability depends more on model exploration and optimization based on feedback, but language ability is actually not necessary to update. Even the model can continuously update knowledge while achieving inference and language skills, or regularly distillation and compression of knowledge storage. Some previous papers edited by knowledge have actually studied the knowledge storage of large models and found that in the MLP layer, there is actually knowledge stored in the form of Key-Value key-value pairs.

Another direction of continuous learning is reasoning and task completion ability, which is based on feedback given by the environment received by the big model when completing tasks using tools. The model needs to optimize the behavioral path and task completion form based on feedback, so that the success rate of task completion can be gradually improved through continuous practice. Haha, we can draw on the evolution mechanism of civilization in "The Three-Body Problem", can we build a virtual ecosystem for the model, similar to "AI sandboxes" such as Stanford Town. The big model itself is Policy. The sandbox itself generates the task of the big model todo, and evaluates the completion effect of the model to generate feedback signals. The sandbox also allows the model to be connected to various MCP interfaces to interact with the environment, and various constraints and competition conditions can be dynamically added in the sandbox environment, such as

Dynamic reward: Dynamic allocation of inference resources based on task completion, encourage the model to solve more complex problems with fewer resources
Population competition: Comparison of the results of the same task of multiple agents
Environmental mutation simulation: Randomly modify the MCP interface to allow the model to dynamically adapt to different interactions between the environment

What does endogenous RAG look like

In addition to the continuous evolution of the model's capabilities, another thing that seems to have entered the technological consensus is the RAG retrieval enhancement technology. The current way to obtain real-time information in models is still relatively traditional search solutions, building knowledge bases, query rewriting, multiple recalls, and rough and fine scheduling. Although the knowledge Retriever can use the big model to enhance its capabilities in every step, the module for obtaining the entire knowledge and real-time information is still completely plug-in outside the model. In fact, it is a spliced implementation solution of the previous generation of search technology and this generation of big model technology. What's wrong with this solution?

One is the finite length of the model. Although through various attention mechanism improvements and early training, the above context of the model has soared from the earliest 1024 to a length of tens of K, it is still difficult to avoid the attenuation of the answer effect on longer texts. And the reason why the length above is longer is fromSearch recall and multi-round dialogue information is linear tiled and has not been compressed.。

After R1, I was also thinking that the expression of the problem and the solution of the problem may be inconsistent. Just like when we see reflection, correction, and generation of new assumptions in the model reasoning process, we think that this may be a tree-shaped thinking structure. R1 proves that linear thinking link + Attention mechanism can also be implemented. Is it possible that the compression processing mentioned above can also be directly implemented through Attention? But now I (haha, I don’t necessarily think so in the future) think that Attention is not enough, because all the improvements to Attention have improved Aettention’s efficient positioning and selection ability of information in each position and length, but this is just information selection, not information compression. Choice is just a splicing of information, and compression can produce intelligent and abstract concepts beyond information. Some node and relationship abstractions similar to GraphRAG, but do not want Graph to be restricted by triple form.

So I wondered if it is possible to recompress and encode this part of the above while using the search for reasoning, and store it in another independent storage module. After that, every time the model answers, it will answer with the storage module and external search. And as the model continues to answer questions, the content scope of the storage module will continue to expand. Each update of the storage module is a new round of knowledge compression and knowledge disambiguation, reflecting and forming new thoughts from knowledge, so that the density of knowledge in storage will become higher and higher, but the length will not increase linearly. See the recent launch of Nvidiastar AttentionIn fact, there is a similar idea of encoding the context first and then inference, but it only involves information compression once, without deeper and multi-step compression and reflection, similar to the on-the-fly inference information compression method. There is also an open source projectMem0There are similar ideas, and we will constantly summarize and abstract the above history of dialogue through engineering design, resolve conflicts and form long-term, short-term, different types of memory storage.

Another is the mismatch between search capabilities and model capabilities. The depth and breadth of information returned by search engines are relatively limited in one search. The main plan of the previous year was to use the big model to rewrite the query and search from multiple angles. However, the disadvantage of this plan is to cast a net with your eyes closed and you are all based on luck. If you rewrite it well, you can answer it, and if you rewrite it well, you will be done. Therefore, as the big model capability (mainly reflection ability) is gradually improving, a chain search inference model driven by model reflection has emerged, including Deep Research of OpenAI, as well as more open source versions of Deep Search implementation solutions with similar models launched by Jina, Dify, and Huggingface (the boundaries between Research and Search are actually very blurred, please do not worry about this problem, everything is mainly based on effects and specific solutions). The concept is easy to understand, that is, each round is a limited search, and then the model will judge what information needs to be added to answer the user's questions, then generate a new search query, then search, supplement and update the information, and then iterate until the model judges Okay.

After testing this method, the information density and information richness will be significantly improved after using models above O1 and R1. However, using non-thinking models, the effect is basically not much different from the multi-step RAG effect after our manual adjustment, but the speed is significantly slower. The reason is simple. Most problems can still be completed through preliminary planning and information supplement within 2 steps. However, problems beyond this complexity require the model itself to think and reasoning ability. But the problem with this method is that the time of the entire process becomes uncontrollable, from a few minutes to a few dozen minutes. Of course, it is faster than the speed of manually collecting information, but it seems to be in a big gap with our ideal Javis.

So if we want to speed up the information collection process, we will break out of the framework of Deep Research. Is it possible to passively obtain information and convert the model into actively obtaining and storing information? It is similar to directly connecting the model to the data stream, continuously processing, filtering, integrating, and compressing and encoding. In this way, the search process is no longer about calling the search engine to access external data, but directly using the attention to directly obtain effective information in the encoded database. The efficiency of information extraction and the richness of information acquisition will be better. The biggest difficulty is not to build a data stream in the real-time world. After all, you can first make a sub-field. There are many streaming data in scenarios such as financial information. The difficulty is mainly how to stream data and compress and encode it into a database in the same high-dimensional space as the endogenous parameters of the model. After all, this data is not trained with the model, so how to ensure the consistency of vector space is the biggest problem, or to train an Adapter-like multimodal bridge model. I saw Google recentlyTitanIn fact, we have already started to explore these directions. In the next chapter, we will talk about memory.

Hahaha, I’ve talked so much about this chapter. I have read a lot of code recently and read less papers. I really haven’t seen anything worth sharing, so I’ve got another chapter of it!

Want to see a more complete big model paper·Fine-tuning pre-training data·Open source framework·AIGC application>> DecryPrompt