Through a rigorous evaluation of diverse benchmarks, the paper demonstrates the shortcomings of existing domain-specific approaches in achieving cross-domain reasoning as well as their bias towards biased fitting of data. Visual reasoning is revisited from a two-stage perspective: (
1
) symbolization and (2
) Logical reasoning based on symbols or their representations has been found to be more adept at generalization than symbolization in the inference phase. Therefore, it is more efficient to implement symbolization by using separate encoders for different data domains while using a shared reasoner.Source: Xiaofei's Algorithmic Engineering Notes Public
discuss a paper or thesis (old): Take A Step Back: Rethinking the Two Stages in Visual Reasoning
- Paper Address:/abs/2407.19666
- Thesis Code:/projects/TwoStageReason
Introduction
Reasoning ability is a centralized manifestation of human intelligence, which is the basis for concept formation, cognitive understanding of the world, and interaction with the environment. Specifically, visual reasoning, as one of the main ways for humans to acquire information and understanding, has become the focus of extensive research. In recent years, with the advancement of deep learning, many research works on visual reasoning have emerged. In addition, various datasets have emerged to evaluate reasoning models.
However, a notable limitation of existing visual reasoning efforts is that they directly rely on performing both the recognition and inference phases via end-to-end deep learning models, e.g., recognizing concepts in images while answering logical questions. However, there are obvious limitations to this paradigm:1
) Inference annotations (rules, relations) are much more expensive and difficult than symbolic annotations (triangles, cubes, apples), and current rigorous visual inference datasets are typically small. As a result, current methods tend to be task-specific on small datasets, hindering their generalization potential.2
) Generalized models that pursue both symbolic recognition and logical reasoning can be inefficient and challenging, even with recent large-scale language models (LLM
) It is also difficult to handle diverse visual reasoning tasks.
The paper argues that visual reasoning focuses on obtaining a symbolic representation from the visual signal first, followed by logical reasoning, as shown in Figure1
Shown. Thus, a question surfaces: should these two phases be entangled or untangled? Reasoning naturally has better generalization than symbolization. For example, it is possible to use similar logic to analyze rules for different tasks (e.g., playing Go, doing math, and spotting anomalies), but a completely different knowledge is required for recognizing letters and objects. Therefore, the paper argues that unraveling symbolization and reasoning would be a wiser choice. Recent large-scale language modeling (LLM
) The success on text-based reasoning tasks also validates this, as theLLM
directly utilizing abstract symbols (languages) derived from human observations and focusing on high-level linguistic tasks. In contrast, multimodal large language models (MLLM
) still encounter difficulties in visual inference even with more parameters. Recently, another relevant research trend is neural symbolic methods that convert raw inputs into explicit symbols for subsequent inference and analysis. However, neural symbolic methods are usually limited to a single dataset, which makes it challenging to generalize across different tasks.
The paper conducts comprehensive experiments on multiple benchmark tasks with significant domain gaps to test hypotheses. The symbolization phase is defined as the use of a deep neural network (DNN
) for representation extraction, using a variety of architectures (e.g.MLP
、CNN
、GNN
、Transformer
, neural symbolic modeling,LLM
etc.) to implement logical reasoners. The paper focuses on two key issues: (1
) in trainedDNN
Where does the symbolization phase end in the model? That is, determining the appropriate symbols (representations) suitable for reasoning, e.g., depth of the model, feature characteristics, etc. (2
) Which types of models and training strategies are best suited for reasoning about abstract symbols and conferring generalization capabilities?
For the first problem, the paper finds that different tasks and domains require very different scales of parameters or model depths for good symbolization. Thus, for specific domains, a small stand-alone intra-domain encoder is sufficient to extract symbols from the data for use in subsequent inference stages. While a small, stand-alone in-domain encoder likeCLIP
Such generalized large-scale base models perform well on some tasks, but still face challenges on tasks with large domain gaps from their training data. For the second problem, experimental results show that existing methods often struggle to perform cross-domain reasoning, preferring instead to adapt to biases consistent with the training data. Thus, it may only be possible to achieve this by training it to perform a variety of inference tasks (e.g., puzzle solving, physics prediction, visual quizzing) and data domains (2D
、3D
(text) to achieve a generalizable shared reasoner, i.e., the "approximation principle".
Based on experimental findings, the paper builds a concise framework that employs separate encoders for optimal symbolization of different data domains and follows the "proximity principle" to build a shared reasoner. The method performs well in cross-domain benchmarks and achieves excellent performance with fewer parameters.
Overall, the contributions of the paper are as follows:
- An efficient two-stage approach to visual inference is summarized, drawing on ideas from previous visual inference networks.
- Optimal design principles for symbolic and logical reasoning in visual reasoning are explored.
- A concise framework is introduced that performs well on multiple datasets with domain gaps.
Preliminary
Two Stages
As mentioned above, visual reasoning can be divided into two phases: a symbolization phase to extract symbolic representations of the underlying data, and an inference phase for logical reasoning.
For humans, different modalities of visual and auditory information collected from the sensory organs are converted into electrical signals through different pathways and then sent to the cerebellar cortex for logical reasoning. Analogously, for a general purpose visual reasoning machine, a separated task-specific symbolizer and a shared domain-independent reasoner is a reasonable choice. Furthermore, the reasoner should be able to reason uniformly about input information from various modalities. In other words, the essence of reasoning lies in its ability to generalize.
-
Symbolization Stage
In the symbolization phase, various task-oriented feature extraction networks are implemented. These networks convert multimodal inputs (text, images, video) into symbolic representations using symbolic encoders customized for each task. Specifically, it is assumed that there are\(n\) tasks. For the first\(i\) Tasks with input data\(\mathbf{x}^{i}\) and mandates\(t^{i}\) and task oriented encoders\(E^{i}\) , the set of symbolic representations is obtained by\(\mathbf{f}^{i}\) :
-
Reasoning Stage
The reasoner receives a symbolic representation of each particular task, designed to capture a deeper and more comprehensive understanding of patterns and relationships embedded in the data. For the set of symbolic representations of all tasks\(\{\mathbf{f}^{(i)}\}_{i=1}^n\) They'll be fed into the reasoner.\(R\) , which is logically processed to obtain its set of inference results\(\{\mathbf{c}^{i}\}_{i=1}^n\) , to facilitate cross multimodal problem solving:
-
Task-specific Heads
The final part of the framework is the task-specific header that takes the inference results from the reasoner as input and generates task-specific answers. For different tasks, we need to build task-specific classification or regression headers\(H_i\) to get the final output\(s^i\) 。
Symbolization-Reasoning Framework
Entanglement . Disentanglement
Considering these two phases, a natural question arises: for the task, should the symbolic encoder (symbolization) and the reasoner (inference) be shared or separated?
In order to validate the shared reasoner hypothesis, a comparison of different designs was performed (Fig.2
):
-
Both-Separated
: Both the symbolic encoder and the logical reasoner are separated (specific model for each task). -
Both-Shared
: Both the symbolic encoder and the reasoner are shared. -
Shared-Encoder-Only
: Only the symbol encoder is shared. -
Shared-Reasoner-Only
: Only the reasoner is shared.
We compare the above four designs on several multimodal visual inference benchmarks. For shared encoders/reasoners, more parameters are used to balance them with the sum of separated encoder/reasoner parameters. In all benchmarks, only the shared reasoner (type4
) performs far better than just the shared encoder (type3
) and fully shared (type1
cap (a poem)2
). Moreover, the shared-only reasoner even outperforms the ad hoc fully separated model in some benchmarks, validating its superiority and generalization ability.
Symbolization Depth
Next, the appropriate depth of the symbolic encoder for different tasks is explored. The symbolization phase involves processing inputs from different domains and mapping them to the conceptual level, i.e., symbols. Although binary or index-like representations can be used (one-hot
) to represent symbols, but in the context of deep learning, the paper opts for the more representative approach, i.e., high-dimensional features extracted from deep neural networks. Intuitively, different tasks may require different levels of symbolization. The next question is how to determine the level of abstraction for each task.
To answer this question, the paper designed experiments to quantitatively explore the extent of the symbolization process. To control for variables, the same feature extraction network was used in the cross-domain task (ResNet
), while continuously adjusting the depth of the network. The outputs of the symbolized network are connected to the same reasoner at different depths and measured in terms of accuracy as an indicator of symbolization completion.
It is assumed that when symbolization is complete, the network depth-accuracy curve will show a clear inflection point. By monitoring the appearance of this inflection point, the appropriate depth of the reasoner can be selected for different tasks, as shown in Figure3
Shown. In the experiments, the results are consistent with common sense: both too shallow and too deep networks are detrimental to the inference task. On the one hand, if the encoder is too shallow, symbolization may not be fully completed before it is fed into the reasoner, which then needs to do part of the symbolization, thus affecting inference performance. On the other hand, a network that is too deep tends to overfit a single task, thus weakening the performance of a shared reasoner aimed at generalization. Different tasks in different domains also require different symbolization depths. Setting the depth improperly can lead to shared parameter conflicts, which in turn can lead to poor performance at deeper levels.
Reasoner Architecture
Next, the paper hopes to figure out which architecture is better suited for reasoners, a long-standing question. Many studies have been proposed and improvements have been made on various visual reasoning benchmarks. Here, each task is handled by its own encoder and task header, which are designed to accommodate the inherent characteristics of its data. The thesis uses a shared reasoner for all tasks, and according to the equation2
The symbolic representation in is processed.
The paper chose a series of architectures that have had great success in many tasks as candidates for the reasoner: the multilayer perceptron (MLP
), convolutional neural networks (CNN
(math.) andTransformer
, also explores hybrid neural symbolic models that combine the representational power of neural networks with the interpretability of symbolic systems. In addition, the paper employs popular graphical and autoregressive models: graph convolutional networks (GCN
(math.) andMiniGPT-4
. The above model provides a broad and diverse methodology for the paper. If strong and consistent performance is observed on diverse datasets across a variety of domains, this would lead to the assumption that some type of architecture exists that is particularly adept at logical reasoning.
Generalization of Reasoner
Last but not least, the goal of the paper is to validate its "approximation principle", i.e., to obtain a good reasoner that is close to being able to provide generalized logical reasoning capabilities through training on diverse tasks and data from different domains. The paper argues that reasoning encompasses both universality and generalization, so the complete two-stage model is first trained on one task, and then its reasoner is directly paired with another symbolic encoder for another task. If the reasoner has generalization ability, it should adapt well to the encoder of the other task. However, in the tests of the paper, reasoners trained using only one task/domain typically generalized poorly. Therefore, it is next verified whether the reasoner has better generalization ability after training more tasks/domains.
as shown4
As shown in the paper, it is found that the overall task becomes more challenging as more and more data is available across different tasks and domains. However, the reasoner will focus on "pure" reasoning rather than task/domain specific solutions, thus giving better generalization capabilities. In other words, the "approximation principle" of the paper is reasonable. Therefore, it can be predicted that the reasoner should perform better on cross-domain tasks as the training data and tasks increase. In addition, a shared reasoner trained across tasks makes the entire visual reasoning framework lighter and more efficient.
Experiments
Entanglement . Disentanglement Analysis
In order to compare the graph2
The models were trained using five datasets of the three types in theRAVEN
、CVR
、SVRT
、Bongard-HOI
cap (a poem)Bongard-LOGO
. In order to control the variables and to facilitate model sharing, for all the above types, we use theResNet-18
As an encoder, use theMLP
as a reasoner.
The results are shown in the table1
shown, the performance of the shared reasoner only on all five datasets is comparable to that of thead-hoc
The program is comparable inRAVEN
cap (a poem)SVRT
on even higher. In addition, the shared encoder-only and shared dual schemes perform significantly worse on all datasets. This reflects the effectiveness of using task-specific symbolizer and shared reasoner designs across multiple tasks.
Optimal Symbolization Depth
Next, the boundary between the two phases is determined by probing the depth of the symbol encoder, as shown in Fig.3
shown, the shared reasoner used is aMLP
.. By observing changes in accuracy, it was hoped to find significant inflection points and termination points of the symbolization base phase. To ensure fairness, the following two-dimensional dataset (RAVEN
、CVR
、SVRT
、Bongard-LOGO
、Bongard-HOI
) on the use ofResNet18
Encoder. For each benchmark, the respective models were trained to achieve optimal performance, and then the trained networks were interrupted at different depths to detect symbolization termination points. The outputs of the separated symbolic encoders are connected to a shared reasoner and the accuracy at different interruption points is recorded, which is considered as evidence of symbolization termination.
as shown5
wristwatches2
shown, for each benchmark, the network depth exhibits an initial increase and then enters a plateau. The location of the inflection point varies from task to task depending on the level of difficulty and the level of symbolic abstraction required. For example.Bongard-HOI
the inflection point of a species thanRAVEN
much deeper, suggesting that the former is more difficult to symbolize and requires a deeper symbolization network to obtain complex high-dimensional features. These results validate the necessity of using symbolization networks with different depths on datasets of different complexity and illustrate a reasonable boundary between the two stages.
One-for-All Reasoner Architecture
Next, identify suitable reasoner architectures and test their effectiveness in a shared reasoner-only design. The choice of the9
A cross-domain dataset (RAVEN
、CVR
、SVRT
、Bongard-HOI
、Bongard-LOGO
、Filtered-CoPhy
、VQAv2
) to conduct experiments because solving various reasoning problems with data from different domains can better demonstrate the model's reasoning capabilities.
Task-specific encoders and heads were designed according to the requirements of different tasks. As for the reasoner, the testing ofCNN
、MLP
、Transformer
、GCN
, hybrid neural symbolic models andMiniGPT-4
. Each dataset is first trained individually to obtain the best results when using separated encoders and separated heads, and then multiple datasets are trained jointly using a shared reasoner.
as shown3
As shown, in all architectures, theMLP
unexpected best performance on four datasets and comparable performance on the other five datasets. In addition, theGCN
performed well on the three datasets, consistent with prior experience in inference work. However, what is often considered a more advancedTransformer
Other architectures such as this do not show significant advantages. Therefore, the choice ofMLP
act asOne-for-All
The lightweight reasoner in the
One-for-All
Performs well on most tasks, even with state-of-the-art technology (SOTA
)ad-hoc
The comparison is no better. The paper, based on complexity willSOTA
Classified as light and heavy duty, as shown in the table4
Shown.One-for-All
with lightweightad-hoc SOTA
performs comparably, and in some datasets (e.g.RAVEN
) on even beyond the lightweightad-hoc SOTA
. This experiment shows that the inference phase has different performance parameters in relation to the recognition task. A lightweight reasoner can also perform well on the inference task if trained on a multi-domain task.
Since reasoning power cannot be measured by accuracy alone, the paper also assesses reasoning power using reasoning consistency. For each task, using the same encoder and reasoner parameters, two question-and-answer approaches were used: "What is the answer to this question?" and "Is a particular option correct?" . A model with good reasoning power should produce consistent results on both methods, unlike randomized models that may be inconsistent. The paper usesF1
scores to measure the consistency between these methods, as shown in Table5
Shown.One-for-All
When trained jointly on multiple datasets, it shows higher consistency than models trained individually, demonstrating its potential for real inference.
In order to further assessLLM
performance, usingMiniGPT-4
as a shared reasoner.One-for-All
It also shows advantages at similar model sizes. Surprisingly, the lightweightOne-for-All
Outperforms on specific tasksMiniGPT-4
For exampleRAVEN
cap (a poem)Bongard-HOI
, which provides strong evidence that there is no absolute positive correlation between the number of model parameters and model inference ability.
In order to analyze the results based onLLM
of model performance, designing exploration tasks based on the paper's two-phase framework and examining each: (1
) Symbolization:LLM-based
Can the model identify the elements of the problem. (2
) Conceptualization:LLM-based
Can the model learn the specific concepts behind the task and reason about them. (3
) Answer Generation:LLM-based
Whether a model can use its learned concepts to solve problems. TakeMiniGPT-4
Represented by a summary of theLLM-based
model forRAVEN
cap (a poem)Bongard
Typical responses to the middle three levels of the problem, as shown in Figure6
Shown.
Thesis DiscoveryLLM
Certain hallucinatory situations may be encountered when solving visual reasoning tasks. As shown in Figure6
As shown, forRAVEN
Question.MiniGPT-4
Successfully recognizes objects in the first stage but fails in the second stage when reasoning based on alignment rules. ForBongard
Question.MiniGPT-4
Successfully recognizing human activities in the first stage and mastering reasoning in the second stage of logical reasoning, but failing in the answer generation stage and getting lost in answering questions using rules. Based on the above case, it is possible to understandLLM-based
model's shortcomings in reasoning tasks, namely its good conceptual understanding, but underperformance in logical reasoning and answer generation.
Approximation Principle Verification
Next, it is verified that training the reasoner using data from multiple domains ensures that the reasoner has better generalization ability in theSVRT
、Bongard-HOI
、Filtered-CoPhy
(used form a nominal expression)Balls
Mission,Filtered-CoPhy
(used form a nominal expression)Collision
Mandate as well asVQAv2
Related experiments were performed on the These datasets cover the2D
Puzzle,3D
video andVQA
tasks, providing diverse and multimodal data utilizing theFiltered-CoPhy
(used form a nominal expression)Collision
task as a benchmark for testing.
The paper introduces an increasing number of cross-domain datasets to train the reasoner and pairs it with a separated encoder for the target test dataset. Given the inherent differences between the individual datasets, the reasoner is preceded by the introduction of an encoder based on theMLP
of a highly lightweight adapter. In order to equalize the contribution of each dataset to the inference engine, the sample sizes used for training across datasets were adjusted. Specifically, the size of the samples used for1,000
cap (a poem)3,000
The sample size of the
as shown6
shown, the reasoner progressively improves as the number of training datasets increases. Although processing more datasets from diverse domains significantly increases the complexity, the trained reasoner is better at cross-domainFiltered-CoPhy
This shows that as the domain of the training dataset increases, the reasoner will focus on task-independent pure reasoning. This suggests that as the domain of the training dataset increases, the reasoner will focus on task-independent pure reasoning, which validates our Approximation Principle.
Additional Ablation Study
be on the table7
The ablation experiments on whether to use a pre-trained model in a symbolic encoder are shown in the selectedRAVEN
、CVR
cap (a poem)SVRT
dataset, and using theImageNet
as a pre-training dataset. The results are very close and the possible reasons areImageNet
with significant domain differences between these three inference datasets.
The paper tested theCLIP
As a manifestation of the generalized and large-scale base model when acting as a generalized symbolic encoder, it will beCLIP
as a visual coder for the multimodal dataset, followed by the use of theMLP
as the reasoner, and a task-specific header network is used. As shown in Table8
shown, even after fine-tuning, usingCLIP
The results obtained are still not as good as the bestOne-for-All
Methods. This validates that even if a method likeCLIP
Such a large model is also unable to perform the task of symbolizing different datasets, thus confirming the theoretical rationale for adopting a separated encoder and shared reasoner framework design.
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].