01/16/2025 | Press release | Distributed by Public on 01/16/2025 13:16
Imagine taking a photo of a gas station sign displaying various fuel prices and asking an AI system, "How much gas can I buy with $50?" It might sound simple, but for the system, it's a complex task. It has to identify where the prices are on the sign, extract the numbers using text recognition, and then perform the calculations to provide an answer. Real-world problems are often like this: they involve different types of information across multiple modalities and demand multi-step solutions.
However, today's open-source multi-modal models struggle to solve realistic complex problems in a step-by-step manner. This limitation stems from their training, which heavily emphasizes straightforward, single-step problems with brief, direct answers. It's like trying to teach someone to cook by only showing them how to make toast-it doesn't prepare them for more complicated recipes.
Worse, most open-source models struggle to articulate their problem-solving process even when prompted to do so. As a result, when these models make mistakes-it's often difficult to determine which part of the process went wrong. For example, for the question above, was it the text recognition? The reasoning? Or the calculation?
To address these challenges, we present TACO, a family of multimodal large action models designed to improve performance on complex questions that require multiple capabilities and demand multi-step solutions.
To answer such questions, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses (Figure 1).
Figure 1. Example outputs of TACO vs. other multimodal large language models.
To enable TACO to generate chains-of-thought-and-action at inference time, we generate synthetic CoTA data and fine-tune open-source multimodal language models on it (Figure 2).
Figure 2. An overview of the TACO training and inference pipeline
To train TACO, we create a large dataset of 1M+ synthetic CoTA traces generated with multimodal large language model (e.g. GPT-4o) and Python programs.
Figure 2. Model-based (top) and programmatic (bottom) data generation pipelines.
In model-based generation, we take existing image and QA pairs from instruction tuning datasets as inputs and prompt a multimodal large language model (e.g. GPT-4o) to generate either a chain-of-thought-and-action (CoTA) or chain-of-thought (CoT) without actions to answer the questions. Then, we verify that the chains lead to correct final answers and parse successfully; if not, we convert them into the direct answer (Direct) format with groundtruth answers.
In programmatic generation, we first gather image annotations with human annotators or models, and then use the dense annotations to fill in manually written templates and generate QA and the corresponding CoTA with Python programs.
We show that fine-tuning with CoTA data enables multimodal language models to reason and take actions on complex visual tasks, significantly boosting their average performance across 8 benchmarks. The increase in accuracy is 30-50% compared to few-shot prompting in the CoTA format, and 3-4% compared to finetuning with direct answers (Figure 3).
Figure 3. Models' average performance on all 8 benchmarks when prompted or fine- tuned with Direct answer or CoTA format
What's more, TACO consistently beats baseline models instruction tuned with only direct answers by significant margins of up to 20% on MMVet regardless of model backbones and starting checkpoints (Figure 4).
Figure 4. Models' accuracy on MMVet after finetuning with Direct answers only vs. CoTA data
In conclusion, we propose a new framework to solve complex visual tasks with multimodal action models and introduce a new family of multimodal action models named TACO. We train TACOs with large-scale synthetic Chain-of-Thought-and-Action data. We demonstrate both quantitatively and qualitatively that TACO achieves up to 4% gains on average across 8 benchmarks compared to instruction-tuned baselines, and up to 20% on the challenging MMVet benchmark.
With our framework, future works can train new models with different actions for other applications such as web navigation or for other domains such as medical question answering. We also encourage future work to further improve the quality of CoTA data, where the diversity and quality of both thoughts and actions are important.
Full Author List: Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, Silvio Savarese
Zixian Ma is a Ph.D. student in Computer Science at the University of Washington, Seattle. Before that, she was an undergraduate student in Computer Science at Stanford University. Zixian Ma's research interests are multi-modal models and human-AI interaction.
More by ZixianZhiwei Liu is a Senior Research Scientist leading the multi-agent system designing and agent reasoning optimization. He also works on the large action model fine-tuning, multi-modal action model training, personalization agent designing and recommender systems.
More by ZhiweiJianguo Zhang is a Senior Research Scientist leading the large action model (e.g., xLAM) development and large-scale data processing. He also collaborates on the agent system design and multi-modal model development.
More by JianguoJuntao Tan is a research scientist leading work on personalized LLMs and personalized agents. He has also collaborated on research involving Large Action Model data collection, multi-agent system design, on-device models, and on-device deployment of multi-modal models. Before joining the company in... Read More 2024, he earned a bachelor's degree from Huazhong University of Science and Technology and a Ph.D. from Rutgers University.
More by JuntaoShelby is a Senior AI Research Manager, leading a dynamic team that pushes the boundaries of AI innovation. With a focus on AI agents, on-device AI, efficient AI, small language models, and LLMs, Shelby drives impactful advancements at the intersection of research and product development.
More by Dr. ShelbyJuan Carlos Niebles earned a degree in Electronics Engineering in 2002 from Universidad del Norte, Colombia. He later received a Master of Science degree in Electrical and Computer Engineering in 2007 from the University of Illinois at Urbana-Champaign, and a Ph.D. in Electrical Engineering from... Read More Princeton University in 2011. Since 2021, Juan Carlos has been a Research Director at Salesforce and an Adjunct Professor of Computer Science at Stanford. He is co-Director of the Stanford Vision and Learning Lab and previously served as Associate Director of Research at the Stanford-Toyota Center for AI Research. He was also a Senior Research Scientist at the Stanford AI Lab from 2015 to 2021.
More by Juan Carlos