One-for-All: A New Paradigm for Separating Symbolic and Logical Reasoning for Visual Reasoning by SJTU

Through a rigorous evaluation of diverse benchmarks, the paper demonstrates the shortcomings of existing domain-specific approaches in achieving cross-domain reasoning as well as their bias towards biased fitting of data. Visual reasoning is revisited from a two-stage perspective: (1) symbolization and (2) Logical reasoning based on symbols or their representations has been found to be more adept at generalization than symbolization in the inference phase. Therefore, it is more efficient to implement symbolization by using separate encoders for different data domains while using a shared reasoner.

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Paper Address:/abs/2407.19666
Thesis Code:/projects/TwoStageReason

Introduction

Reasoning ability is a centralized manifestation of human intelligence, which is the basis for concept formation, cognitive understanding of the world, and interaction with the environment. Specifically, visual reasoning, as one of the main ways for humans to acquire information and understanding, has become the focus of extensive research. In recent years, with the advancement of deep learning, many research works on visual reasoning have emerged. In addition, various datasets have emerged to evaluate reasoning models.

However, a notable limitation of existing visual reasoning efforts is that they directly rely on performing both the recognition and inference phases via end-to-end deep learning models, e.g., recognizing concepts in images while answering logical questions. However, there are obvious limitations to this paradigm:1) Inference annotations (rules, relations) are much more expensive and difficult than symbolic annotations (triangles, cubes, apples), and current rigorous visual inference datasets are typically small. As a result, current methods tend to be task-specific on small datasets, hindering their generalization potential.2) Generalized models that pursue both symbolic recognition and logical reasoning can be inefficient and challenging, even with recent large-scale language models (LLM) It is also difficult to handle diverse visual reasoning tasks.

The paper argues that visual reasoning focuses on obtaining a symbolic representation from the visual signal first, followed by logical reasoning, as shown in Figure1Shown. Thus, a question surfaces: should these two phases be entangled or untangled? Reasoning naturally has better generalization than symbolization. For example, it is possible to use similar logic to analyze rules for different tasks (e.g., playing Go, doing math, and spotting anomalies), but a completely different knowledge is required for recognizing letters and objects. Therefore, the paper argues that unraveling symbolization and reasoning would be a wiser choice. Recent large-scale language modeling (LLM) The success on text-based reasoning tasks also validates this, as theLLMdirectly utilizing abstract symbols (languages) derived from human observations and focusing on high-level linguistic tasks. In contrast, multimodal large language models (MLLM) still encounter difficulties in visual inference even with more parameters. Recently, another relevant research trend is neural symbolic methods that convert raw inputs into explicit symbols for subsequent inference and analysis. However, neural symbolic methods are usually limited to a single dataset, which makes it challenging to generalize across different tasks.

The paper conducts comprehensive experiments on multiple benchmark tasks with significant domain gaps to test hypotheses. The symbolization phase is defined as the use of a deep neural network (DNN) for representation extraction, using a variety of architectures (e.g.MLP、CNN、GNN、Transformer, neural symbolic modeling,LLMetc.) to implement logical reasoners. The paper focuses on two key issues: (1) in trainedDNNWhere does the symbolization phase end in the model? That is, determining the appropriate symbols (representations) suitable for reasoning, e.g., depth of the model, feature characteristics, etc. (2) Which types of models and training strategies are best suited for reasoning about abstract symbols and conferring generalization capabilities?

For the first problem, the paper finds that different tasks and domains require very different scales of parameters or model depths for good symbolization. Thus, for specific domains, a small stand-alone intra-domain encoder is sufficient to extract symbols from the data for use in subsequent inference stages. While a small, stand-alone in-domain encoder likeCLIPSuch generalized large-scale base models perform well on some tasks, but still face challenges on tasks with large domain gaps from their training data. For the second problem, experimental results show that existing methods often struggle to perform cross-domain reasoning, preferring instead to adapt to biases consistent with the training data. Thus, it may only be possible to achieve this by training it to perform a variety of inference tasks (e.g., puzzle solving, physics prediction, visual quizzing) and data domains (2D、3D(text) to achieve a generalizable shared reasoner, i.e., the "approximation principle".

Based on experimental findings, the paper builds a concise framework that employs separate encoders for optimal symbolization of different data domains and follows the "proximity principle" to build a shared reasoner. The method performs well in cross-domain benchmarks and achieves excellent performance with fewer parameters.

Overall, the contributions of the paper are as follows:

An efficient two-stage approach to visual inference is summarized, drawing on ideas from previous visual inference networks.
Optimal design principles for symbolic and logical reasoning in visual reasoning are explored.
A concise framework is introduced that performs well on multiple datasets with domain gaps.

Preliminary

Two Stages

As mentioned above, visual reasoning can be divided into two phases: a symbolization phase to extract symbolic representations of the underlying data, and an inference phase for logical reasoning.

For humans, different modalities of visual and auditory information collected from the sensory organs are converted into electrical signals through different pathways and then sent to the cerebellar cortex for logical reasoning. Analogously, for a general purpose visual reasoning machine, a separated task-specific symbolizer and a shared domain-independent reasoner is a reasonable choice. Furthermore, the reasoner should be able to reason uniformly about input information from various modalities. In other words, the essence of reasoning lies in its ability to generalize.

Symbolization Stage

In the symbolization phase, various task-oriented feature extraction networks are implemented. These networks convert multimodal inputs (text, images, video) into symbolic representations using symbolic encoders customized for each task. Specifically, it is assumed that there are\(n\) tasks. For the first\(i\) Tasks with input data\(\mathbf{x}^{i}\) and mandates\(t^{i}\) and task oriented encoders\(E^{i}\) , the set of symbolic representations is obtained by\(\mathbf{f}^{i}\) ：

\[\begin{equation} \mathbf{f}^{i} = E_i(\mathbf{x}^{i} \mid t^{i}). \end{equation} \]

Reasoning Stage

The reasoner receives a symbolic representation of each particular task, designed to capture a deeper and more comprehensive understanding of patterns and relationships embedded in the data. For the set of symbolic representations of all tasks\(\{\mathbf{f}^{(i)}\}_{i=1}^n\) They'll be fed into the reasoner.\(R\) , which is logically processed to obtain its set of inference results\(\{\mathbf{c}^{i}\}_{i=1}^n\) , to facilitate cross multimodal problem solving:

\[\begin{equation} \{\mathbf{c}^{i}\}^n_{i=1} = R(\{\mathbf{f}^{i}\}^n_{i=1}). \label{eq:reasoning} \end{equation} \]

Task-specific Heads

The final part of the framework is the task-specific header that takes the inference results from the reasoner as input and generates task-specific answers. For different tasks, we need to build task-specific classification or regression headers\(H_i\) to get the final output\(s^i\) 。

\[\begin{equation} s^i = H_i(\mathbf{c}^{i} \mid t^{i}). \end{equation} \]

Symbolization-Reasoning Framework

Entanglement . Disentanglement

Considering these two phases, a natural question arises: for the task, should the symbolic encoder (symbolization) and the reasoner (inference) be shared or separated?

In order to validate the shared reasoner hypothesis, a comparison of different designs was performed (Fig.2）：

Both-Separated: Both the symbolic encoder and the logical reasoner are separated (specific model for each task).
Both-Shared: Both the symbolic encoder and the reasoner are shared.
Shared-Encoder-Only: Only the symbol encoder is shared.
Shared-Reasoner-Only: Only the reasoner is shared.

We compare the above four designs on several multimodal visual inference benchmarks. For shared encoders/reasoners, more parameters are used to balance them with the sum of separated encoder/reasoner parameters. In all benchmarks, only the shared reasoner (type4) performs far better than just the shared encoder (type3) and fully shared (type1cap (a poem)2). Moreover, the shared-only reasoner even outperforms the ad hoc fully separated model in some benchmarks, validating its superiority and generalization ability.

Symbolization Depth

Next, the appropriate depth of the symbolic encoder for different tasks is explored. The symbolization phase involves processing inputs from different domains and mapping them to the conceptual level, i.e., symbols. Although binary or index-like representations can be used (one-hot) to represent symbols, but in the context of deep learning, the paper opts for the more representative approach, i.e., high-dimensional features extracted from deep neural networks. Intuitively, different tasks may require different levels of symbolization. The next question is how to determine the level of abstraction for each task.

To answer this question, the paper designed experiments to quantitatively explore the extent of the symbolization process. To control for variables, the same feature extraction network was used in the cross-domain task (ResNet), while continuously adjusting the depth of the network. The outputs of the symbolized network are connected to the same reasoner at different depths and measured in terms of accuracy as an indicator of symbolization completion.

It is assumed that when symbolization is complete, the network depth-accuracy curve will show a clear inflection point. By monitoring the appearance of this inflection point, the appropriate depth of the reasoner can be selected for different tasks, as shown in Figure3Shown. In the experiments, the results are consistent with common sense: both too shallow and too deep networks are detrimental to the inference task. On the one hand, if the encoder is too shallow, symbolization may not be fully completed before it is fed into the reasoner, which then needs to do part of the symbolization, thus affecting inference performance. On the other hand, a network that is too deep tends to overfit a single task, thus weakening the performance of a shared reasoner aimed at generalization. Different tasks in different domains also require different symbolization depths. Setting the depth improperly can lead to shared parameter conflicts, which in turn can lead to poor performance at deeper levels.

Reasoner Architecture

Next, the paper hopes to figure out which architecture is better suited for reasoners, a long-standing question. Many studies have been proposed and improvements have been made on various visual reasoning benchmarks. Here, each task is handled by its own encoder and task header, which are designed to accommodate the inherent characteristics of its data. The thesis uses a shared reasoner for all tasks, and according to the equation2The symbolic representation in is processed.

The paper chose a series of architectures that have had great success in many tasks as candidates for the reasoner: the multilayer perceptron (MLP), convolutional neural networks (CNN(math.) andTransformer, also explores hybrid neural symbolic models that combine the representational power of neural networks with the interpretability of symbolic systems. In addition, the paper employs popular graphical and autoregressive models: graph convolutional networks (GCN(math.) andMiniGPT-4. The above model provides a broad and diverse methodology for the paper. If strong and consistent performance is observed on diverse datasets across a variety of domains, this would lead to the assumption that some type of architecture exists that is particularly adept at logical reasoning.

Generalization of Reasoner

Last but not least, the goal of the paper is to validate its "approximation principle", i.e., to obtain a good reasoner that is close to being able to provide generalized logical reasoning capabilities through training on diverse tasks and data from different domains. The paper argues that reasoning encompasses both universality and generalization, so the complete two-stage model is first trained on one task, and then its reasoner is directly paired with another symbolic encoder for another task. If the reasoner has generalization ability, it should adapt well to the encoder of the other task. However, in the tests of the paper, reasoners trained using only one task/domain typically generalized poorly. Therefore, it is next verified whether the reasoner has better generalization ability after training more tasks/domains.

as shown4As shown in the paper, it is found that the overall task becomes more challenging as more and more data is available across different tasks and domains. However, the reasoner will focus on "pure" reasoning rather than task/domain specific solutions, thus giving better generalization capabilities. In other words, the "approximation principle" of the paper is reasonable. Therefore, it can be predicted that the reasoner should perform better on cross-domain tasks as the training data and tasks increase. In addition, a shared reasoner trained across tasks makes the entire visual reasoning framework lighter and more efficient.

Experiments

Entanglement . Disentanglement Analysis

In order to compare the graph2The models were trained using five datasets of the three types in theRAVEN、CVR、SVRT、Bongard-HOIcap (a poem)Bongard-LOGO. In order to control the variables and to facilitate model sharing, for all the above types, we use theResNet-18As an encoder, use theMLPas a reasoner.

The results are shown in the table1shown, the performance of the shared reasoner only on all five datasets is comparable to that of thead-hocThe program is comparable inRAVENcap (a poem)SVRTon even higher. In addition, the shared encoder-only and shared dual schemes perform significantly worse on all datasets. This reflects the effectiveness of using task-specific symbolizer and shared reasoner designs across multiple tasks.

Optimal Symbolization Depth

Next, the boundary between the two phases is determined by probing the depth of the symbol encoder, as shown in Fig.3shown, the shared reasoner used is aMLP.. By observing changes in accuracy, it was hoped to find significant inflection points and termination points of the symbolization base phase. To ensure fairness, the following two-dimensional dataset (RAVEN、CVR、SVRT、Bongard-LOGO、Bongard-HOI) on the use ofResNet18Encoder. For each benchmark, the respective models were trained to achieve optimal performance, and then the trained networks were interrupted at different depths to detect symbolization termination points. The outputs of the separated symbolic encoders are connected to a shared reasoner and the accuracy at different interruption points is recorded, which is considered as evidence of symbolization termination.

as shown5wristwatches2shown, for each benchmark, the network depth exhibits an initial increase and then enters a plateau. The location of the inflection point varies from task to task depending on the level of difficulty and the level of symbolic abstraction required. For example.Bongard-HOIthe inflection point of a species thanRAVENmuch deeper, suggesting that the former is more difficult to symbolize and requires a deeper symbolization network to obtain complex high-dimensional features. These results validate the necessity of using symbolization networks with different depths on datasets of different complexity and illustrate a reasonable boundary between the two stages.

One-for-All Reasoner Architecture

Next, identify suitable reasoner architectures and test their effectiveness in a shared reasoner-only design. The choice of the9A cross-domain dataset (RAVEN、CVR、SVRT、Bongard-HOI、Bongard-LOGO、Filtered-CoPhy、VQAv2) to conduct experiments because solving various reasoning problems with data from different domains can better demonstrate the model's reasoning capabilities.

Task-specific encoders and heads were designed according to the requirements of different tasks. As for the reasoner, the testing ofCNN、MLP、Transformer、GCN, hybrid neural symbolic models andMiniGPT-4. Each dataset is first trained individually to obtain the best results when using separated encoders and separated heads, and then multiple datasets are trained jointly using a shared reasoner.

as shown3As shown, in all architectures, theMLPunexpected best performance on four datasets and comparable performance on the other five datasets. In addition, theGCNperformed well on the three datasets, consistent with prior experience in inference work. However, what is often considered a more advancedTransformerOther architectures such as this do not show significant advantages. Therefore, the choice ofMLPact asOne-for-AllThe lightweight reasoner in the

One-for-AllPerforms well on most tasks, even with state-of-the-art technology (SOTA）ad-hocThe comparison is no better. The paper, based on complexity willSOTAClassified as light and heavy duty, as shown in the table4Shown.One-for-Allwith lightweightad-hoc SOTAperforms comparably, and in some datasets (e.g.RAVEN) on even beyond the lightweightad-hoc SOTA. This experiment shows that the inference phase has different performance parameters in relation to the recognition task. A lightweight reasoner can also perform well on the inference task if trained on a multi-domain task.

Since reasoning power cannot be measured by accuracy alone, the paper also assesses reasoning power using reasoning consistency. For each task, using the same encoder and reasoner parameters, two question-and-answer approaches were used: "What is the answer to this question?" and "Is a particular option correct?" . A model with good reasoning power should produce consistent results on both methods, unlike randomized models that may be inconsistent. The paper usesF1scores to measure the consistency between these methods, as shown in Table5Shown.One-for-AllWhen trained jointly on multiple datasets, it shows higher consistency than models trained individually, demonstrating its potential for real inference.

In order to further assessLLMperformance, usingMiniGPT-4as a shared reasoner.One-for-AllIt also shows advantages at similar model sizes. Surprisingly, the lightweightOne-for-AllOutperforms on specific tasksMiniGPT-4For exampleRAVENcap (a poem)Bongard-HOI, which provides strong evidence that there is no absolute positive correlation between the number of model parameters and model inference ability.

In order to analyze the results based onLLMof model performance, designing exploration tasks based on the paper's two-phase framework and examining each: (1) Symbolization:LLM-basedCan the model identify the elements of the problem. (2) Conceptualization:LLM-basedCan the model learn the specific concepts behind the task and reason about them. (3) Answer Generation:LLM-basedWhether a model can use its learned concepts to solve problems. TakeMiniGPT-4Represented by a summary of theLLM-basedmodel forRAVENcap (a poem)BongardTypical responses to the middle three levels of the problem, as shown in Figure6Shown.

Thesis DiscoveryLLMCertain hallucinatory situations may be encountered when solving visual reasoning tasks. As shown in Figure6As shown, forRAVENQuestion.MiniGPT-4Successfully recognizes objects in the first stage but fails in the second stage when reasoning based on alignment rules. ForBongardQuestion.MiniGPT-4Successfully recognizing human activities in the first stage and mastering reasoning in the second stage of logical reasoning, but failing in the answer generation stage and getting lost in answering questions using rules. Based on the above case, it is possible to understandLLM-basedmodel's shortcomings in reasoning tasks, namely its good conceptual understanding, but underperformance in logical reasoning and answer generation.

Approximation Principle Verification

Next, it is verified that training the reasoner using data from multiple domains ensures that the reasoner has better generalization ability in theSVRT、Bongard-HOI、Filtered-CoPhy(used form a nominal expression)BallsMission,Filtered-CoPhy(used form a nominal expression)CollisionMandate as well asVQAv2Related experiments were performed on the These datasets cover the2DPuzzle,3Dvideo andVQAtasks, providing diverse and multimodal data utilizing theFiltered-CoPhy(used form a nominal expression)Collisiontask as a benchmark for testing.

The paper introduces an increasing number of cross-domain datasets to train the reasoner and pairs it with a separated encoder for the target test dataset. Given the inherent differences between the individual datasets, the reasoner is preceded by the introduction of an encoder based on theMLPof a highly lightweight adapter. In order to equalize the contribution of each dataset to the inference engine, the sample sizes used for training across datasets were adjusted. Specifically, the size of the samples used for1,000cap (a poem)3,000The sample size of the

as shown6shown, the reasoner progressively improves as the number of training datasets increases. Although processing more datasets from diverse domains significantly increases the complexity, the trained reasoner is better at cross-domainFiltered-CoPhyThis shows that as the domain of the training dataset increases, the reasoner will focus on task-independent pure reasoning. This suggests that as the domain of the training dataset increases, the reasoner will focus on task-independent pure reasoning, which validates our Approximation Principle.

Additional Ablation Study

be on the table7The ablation experiments on whether to use a pre-trained model in a symbolic encoder are shown in the selectedRAVEN、CVRcap (a poem)SVRTdataset, and using theImageNetas a pre-training dataset. The results are very close and the possible reasons areImageNetwith significant domain differences between these three inference datasets.

The paper tested theCLIPAs a manifestation of the generalized and large-scale base model when acting as a generalized symbolic encoder, it will beCLIPas a visual coder for the multimodal dataset, followed by the use of theMLPas the reasoner, and a task-specific header network is used. As shown in Table8shown, even after fine-tuning, usingCLIPThe results obtained are still not as good as the bestOne-for-AllMethods. This validates that even if a method likeCLIPSuch a large model is also unable to perform the task of symbolizing different datasets, thus confirming the theoretical rationale for adopting a separated encoder and shared reasoner framework design.

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

One-for-All: A New Paradigm for Separating Symbolic and Logical Reasoning for Visual Reasoning by SJTU | ECCV 2024