01/08/2025 | Press release | Distributed by Public on 01/08/2025 11:01
The development of multimodal language models (MLMs) such as GPT4-V and BLIPs [1,2] have enabled many multimodal applications such as answering complex image-based queries; for example, "How many students are raising their hands in this image?". These models rely heavily on instruction data-datasets that pair visual content with corresponding questions and answers.
However, generating such data is a challenging task due to the limitations of existing approaches. While manual data collection can be expensive and time-consuming, many rely on costly proprietary models to generate instruction data, which are not only computationally intensive but also prone to issues such as hallucinations, scalability constraints, and difficulties in ensuring interpretability and factual accuracy.
To address the challenges in generating multimodal instruction data, we developed ProVision, a scalable, programmatic framework that employs scene graphs and human-written programs to systematically synthesize vision-centric instruction data.
We represent each image as a scene graph, with objects and attributes as nodes and edges denoting their relationships. Using Python programs and textual templates, our data generators synthesize instruction data by creating questions and answers from the scene graph
With these data generators, we can automatically synthesize questions and answers given an image's scene graph. For example, given an image of a busy street, ProVision can generate questions such as, "What is the relationship between the pedestrian and the car?" or "Which object is closer to the red building, car or pedestrian?"
Unlike traditional approaches, ProVision ensures interpretability, factual accuracy, and scalability in generating instruction data for multimodal language models (MLMs). Also, one can add as many data generators as he/she wishes to synthesize novel instruction data.
To synthesize instruction data for images without associated scene graphs, we resort to a scene graph generation pipeline, which is composed of many state-of-the-art vision models, for automatic scene graph generation. With this, we are able to generate instruction data for any image.
The current ProVision integrates a suite of 24 single-image and 14 multi-image instruction generators to create detailed question-answer pairs about objects, attributes, relations, and more. We use these data generators to synthesize over 10M instruction data, which is made publicly available as the ProVision-10M dataset.
ProVision-10M can enhance the performance of multimodal models during fine-tuning. We incorporate our synthesized single-image and multi-image instruction data into established MLM fine-tuning recipes: LLaVA-1.5 for single-image instruction data and Mantis-SigLIP-8B for multi-image instruction data.
The average performance on 8 benchmarks are shown in the following figure. We can see that Provision data with both synthesized and manually annotated scene graphs can enhance average performance, with manually annotated ProVision data yielding the highest improvement in both cases.
In addition, we found that adding the ProVision data in both pretraining and fine-tuning of xGen-MM-4B (BLIP3) can lead to an average improvement of 1.6% across 11 benchmarks, outperforming baseline without our data and adding it into either stage individually.
As we demonstrate the potential of programmatically synthesized instruction data for training multimodal language models, future work can further improve the system by adding more data generators to include new types of instruction data or enhancing the scene graph generation pipeline for more accurate scene graphs.
As we include data generators for synthesizing both single-image and multi-image instruction data, future work can extend the pipeline to synthesize video instruction data and more.
Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post. Connect with us on social media and our website to get regular updates on this and other research projects.
Full Author List: Jieyu Zhang, Le Xue, Linxin Song, Jun Wang, Weikai Huang, Manli Shu, An Yan, Zixian Ma, Juan Carlos Niebles, silvio savarese, Caiming Xiong, Zeyuan Chen, Ranjay Krishna, Ran Xu.
[1] Li, Junnan, Dongxu Li, Silvio Savarese and Steven C. H. Hoi. "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." International Conference on Machine Learning (2023).
[2] Xue, Le, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S. Ryoo, Shrikant B. Kendre, Jieyu Zhang, Can Qin, Shu Zhen Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong and Ran Xu. "xGen-MM (BLIP-3): A Family of Open Large Multimodal Models."
Jieyu Zhang is a Ph.D. student in Computer Science at the University of Washington, Seattle. Before that, they were an undergraduate student in Computer Science at the University of Illinois Urbana-Champaign. My research interests are interactive and data-centric AI/ML with an emphasis on faithful... Read More evaluation and effort-light approaches with applications in natural language processing, computer vision, foundation models, science, etc.
More by JieyuLe Xue is an AI researcher working on multimodal foundation models such as Multimodal LLMs and Multimodal 3D foundation models. He leads AI research for series of projects of xGen-MM(BLIP-3) -- A Family of Open Large Multimodal Models.
More by LeZeyuan Chen is a Senior Manager of Research at Salesforce AI Research, where he has been contributing since 2019. His work focuses on advancing computer vision, machine learning, multimodal AI, AI agents, and workflow automation through code generation and data visualization. He holds a Bachelor's... Read More degree from Huazhong University of Science and Technology, a Master's from Cornell University, and a Ph.D. from North Carolina State University, experiences that have shaped his journey in AI research.
More by ZeyuanRan Xu received his Ph.D. in computer science from University at Buffalo from 2015. Currently, he leads a group of exceptional computer vision and multimodal AI researchers at Salesforce to push the boundary of research and productive AI for CRM.
More by Ran