IVM

Instruction-Guided Visual Masking

Jinliang Zheng* ^† ¹,² Jianxiong Li* ¹ Sijie Cheng¹, Yinan Zheng¹, Jiaming Li¹, Jihao Liu³ ²,
Yu Liu², Jingjing Liu^✉¹, Xianyuan Zhan^✉ ¹ ⁴,

¹AIR, Tsinghua University ²SenseTime Research ³CUHK MMLab ⁴Shanghai AI Lab

Exciting News!

Our paper has been accepted by NeurIPS-2024

Exciting News!

Our paper has been selected as outstanding paper at MFM-EAI workshop@ICML2024

Abstraction

Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plugand-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks.

Downstream Control Tasks Results

IVM model proves valuable in vision-language robotic manipulation tasks, where data collection is notoriously challenging and generalization is a major concern. With the integration of IVM, our enhanced robot model exhibits boosted performance and better generalization capabilities.

Results on Real Robots

Figure 1: Real robot LCBC experimental results. Success rate is averaged over 10 episodes and 3 seeds.

Red cup on red plate

Duck on green plate

Red cup on sivler plate

Duck in pot

Results on VQA-type benchmarks

V* bench and visualization results

We evaluate IVM-enhanced GPT4-V on V*bench, a recently proposed challenging VQA-type benchmark characterized by images with abundant redundancies. Results are presented in Table 1. The accuracy of the vanilla GPT4-V is mediocre (55.0%). Our IVM model, however, can significantly improve the performance (+26.2%). Except for the reported score, we provide more visualization results.

BibTeX

@article{zheng2024instruction, title={Instruction-Guided Visual Masking}, author={Zheng, Jinliang and Li, Jianxiong and Cheng, Sijie and Zheng, Yinan and Li, Jiaming and Liu, Jihao and Liu, Yu and Liu, Jingjing and Zhan, Xianyuan}, journal={arXiv preprint arXiv:2405.19783}, year={2024} }