noodls browser compatibility check

The security settings of your browser are blocking the execution of scripts.

To use noodls, javascript support must be enabled. Please change your browser's security settings to enable javascript.

If you have changed your browser's security settings, you can click here.

related announcements

News

DocuSign Inc.

Amendment to Statement of Changes in Beneficial Ownership (Form 4/A)
DOJ - North Carolina Department of[...]

North Carolina’s Price Gouging Law in Effect After Tropical Storm Chantal
DocuSign Inc.

Amendment to Statement of Changes in Beneficial Ownership (Form 4/A)

Education

salesforce.com Inc.

01/08/2025 | Press release | Distributed by Public on 01/08/2025 11:01

ProVision: Tackling Multimodal Data Challenges with a Scalable, Vision-Centric Framework

To address the challenges in generating multimodal instruction data, we developed ProVision, a scalable, programmatic framework that employs scene graphs and human-written programs to systematically synthesize vision-centric instruction data.

Introduction

The development of multimodal language models (MLMs) such as GPT4-V and BLIPs [1,2] have enabled many multimodal applications such as answering complex image-based queries; for example, "How many students are raising their hands in this image?". These models rely heavily on instruction data-datasets that pair visual content with corresponding questions and answers.

However, generating such data is a challenging task due to the limitations of existing approaches. While manual data collection can be expensive and time-consuming, many rely on costly proprietary models to generate instruction data, which are not only computationally intensive but also prone to issues such as hallucinations, scalability constraints, and difficulties in ensuring interpretability and factual accuracy.

ProVision

To address the challenges in generating multimodal instruction data, we developed ProVision, a scalable, programmatic framework that employs scene graphs and human-written programs to systematically synthesize vision-centric instruction data.

We represent each image as a scene graph, with objects and attributes as nodes and edges denoting their relationships. Using Python programs and textual templates, our data generators synthesize instruction data by creating questions and answers from the scene graph

With these data generators, we can automatically synthesize questions and answers given an image's scene graph. For example, given an image of a busy street, ProVision can generate questions such as, "What is the relationship between the pedestrian and the car?" or "Which object is closer to the red building, car or pedestrian?"

Unlike traditional approaches, ProVision ensures interpretability, factual accuracy, and scalability in generating instruction data for multimodal language models (MLMs). Also, one can add as many data generators as he/she wishes to synthesize novel instruction data.

To synthesize instruction data for images without associated scene graphs, we resort to a scene graph generation pipeline, which is composed of many state-of-the-art vision models, for automatic scene graph generation. With this, we are able to generate instruction data for any image.

The current ProVision integrates a suite of 24 single-image and 14 multi-image instruction generators to create detailed question-answer pairs about objects, attributes, relations, and more. We use these data generators to synthesize over 10M instruction data, which is made publicly available as the ProVision-10M dataset.

Results

ProVision-10M can enhance the performance of multimodal models during fine-tuning. We incorporate our synthesized single-image and multi-image instruction data into established MLM fine-tuning recipes: LLaVA-1.5 for single-image instruction data and Mantis-SigLIP-8B for multi-image instruction data.

The average performance on 8 benchmarks are shown in the following figure. We can see that Provision data with both synthesized and manually annotated scene graphs can enhance average performance, with manually annotated ProVision data yielding the highest improvement in both cases.

In addition, we found that adding the ProVision data in both pretraining and fine-tuning of xGen-MM-4B (BLIP3) can lead to an average improvement of 1.6% across 11 benchmarks, outperforming baseline without our data and adding it into either stage individually.

Future Works

As we demonstrate the potential of programmatically synthesized instruction data for training multimodal language models, future work can further improve the system by adding more data generators to include new types of instruction data or enhancing the scene graph generation pipeline for more accurate scene graphs.

As we include data generators for synthesizing both single-image and multi-image instruction data, future work can extend the pipeline to synthesize video instruction data and more.

Explore More

Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post. Connect with us on social media and our website to get regular updates on this and other research projects.

Check out our Research Paper, which describes our work in greater detail.
Check out our code on GitHub
Check out our ProVision-10M Dataset on Hugging Face
Follow us on Twitter: @SFResearch, @Salesforce
Read our other AI Research Blogs
To learn more about all of the exciting projects at Salesforce AI Research, please visit us at https://www.salesforceairesearch.com

Acknowledgments

Full Author List: Jieyu Zhang, Le Xue, Linxin Song, Jun Wang, Weikai Huang, Manli Shu, An Yan, Zixian Ma, Juan Carlos Niebles, silvio savarese, Caiming Xiong, Zeyuan Chen, Ranjay Krishna, Ran Xu.

Reference

[1] Li, Junnan, Dongxu Li, Silvio Savarese and Steven C. H. Hoi. "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." International Conference on Machine Learning (2023).

[2] Xue, Le, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S. Ryoo, Shrikant B. Kendre, Jieyu Zhang, Can Qin, Shu Zhen Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong and Ran Xu. "xGen-MM (BLIP-3): A Family of Open Large Multimodal Models."

Jieyu Zhang Research Intern

Jieyu Zhang is a Ph.D. student in Computer Science at the University of Washington, Seattle. Before that, they were an undergraduate student in Computer Science at the University of Illinois Urbana-Champaign. My research interests are interactive and data-centric AI/ML with an emphasis on faithful... Read More evaluation and effort-light approaches with applications in natural language processing, computer vision, foundation models, science, etc.

More by Jieyu

Le Xue Senior Applied Scientist

Le Xue is an AI researcher working on multimodal foundation models such as Multimodal LLMs and Multimodal 3D foundation models. He leads AI research for series of projects of xGen-MM(BLIP-3) -- A Family of Open Large Multimodal Models.

More by Le

Zeyuan Chen Senior Manager, Research

Zeyuan Chen is a Senior Manager of Research at Salesforce AI Research, where he has been contributing since 2019. His work focuses on advancing computer vision, machine learning, multimodal AI, AI agents, and workflow automation through code generation and data visualization. He holds a Bachelor's... Read More degree from Huazhong University of Science and Technology, a Master's from Cornell University, and a Ph.D. from North Carolina State University, experiences that have shaped his journey in AI research.

More by Zeyuan

Ran Xu Director, AI Research

Ran Xu received his Ph.D. in computer science from University at Buffalo from 2015. Currently, he leads a group of exceptional computer vision and multimodal AI researchers at Salesforce to push the boundary of research and productive AI for CRM.

More by Ran

salesforce.com Inc. published this content on January 08, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on January 08, 2025 at 17:01 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at support@pubt.io

Back

View original format

related announcements

News

Education

salesforce.com Inc.

ProVision: Tackling Multimodal Data Challenges with a Scalable, Vision-Centric Framework

ProVision: Tackling Multimodal Data Challenges with a Scalable, Vision-Centric Framework

To address the challenges in generating multimodal instruction data, we developed ProVision, a scalable, programmatic framework that employs scene graphs and human-written programs to systematically synthesize vision-centric instruction data.

Jieyu Zhang

Le Xue

2 additional authors

Share article

Introduction

ProVision

Results

Future Works

Explore More

Acknowledgments

Reference

Share article