Instruction-Guided Visual Masking

Jinliang Zheng* 1,2 Jianxiong Li* 1 Sijie Cheng1, Yinan Zheng1, Jiaming Li1, Jihao Liu3 2,
Yu Liu2, Jingjing Liu1, Xianyuan Zhan 1 4,

1AIR, Tsinghua University 2SenseTime Research 3CUHK MMLab 4Shanghai AI Lab

*Equal contribution,
†Project Lead: zhengjl23@mails.tsinghua.edu.cn
✉Corresponding author: zhanxianyuan@air.tsinghua.edu.cn
Exciting News!
Our paper has been selected as outstanding paper at MFM-EAI workshop@ICML2024
Image

Abstraction

Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plugand-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks.

Downstream Control Tasks Results

IVM model proves valuable in vision-language robotic manipulation tasks, where data collection is notoriously challenging and generalization is a major concern. With the integration of IVM, our enhanced robot model exhibits boosted performance and better generalization capabilities.

Results on Real Robots

algebraic reasoning

Figure 1: Real robot LCBC experimental results. Success rate is averaged over 10 episodes and 3 seeds.

Red cup on red plate

Duck on green plate

Red cup on sivler plate

Duck in pot

Results on VQA-type benchmarks

V* bench and visualization results

We evaluate IVM-enhanced GPT4-V on V*bench, a recently proposed challenging VQA-type benchmark characterized by images with abundant redundancies. Results are presented in Table 1. The accuracy of the vanilla GPT4-V is mediocre (55.0%). Our IVM model, however, can significantly improve the performance (+26.2%). Except for the reported score, we provide more visualization results.

Image Image

BibTeX


        @misc{zheng2024instructionguided,
            title={Instruction-Guided Visual Masking}, 
            author={Jinliang Zheng and Jianxiong Li and Sijie Cheng and Yinan Zheng and Jiaming Li and Jihao Liu and Yu Liu and Jingjing Liu and Xianyuan Zhan},
            year={2024},
            eprint={2405.19783},
            archivePrefix={arXiv},
            primaryClass={cs.CV}
        }