Google LLC

12/05/2025 | Press release | Distributed by Public on 12/05/2025 10:26

Gemini 3 Pro: the frontier of vision AI

Your browser does not support the audio element.

Listen to article
This content is generated by Google AI. Generative AI is experimental
[[duration]] minutes
Voice Speed
Voice
Speed 0.75X 1X 1.5X 2X

Gemini 3 Pro represents a generational leap from simple recognition to true visual and spatial reasoning. It is our most capable multimodal model ever, delivering state-of-the-art performance across document, spatial, screen and video understanding.

This model sets new highs on vision benchmarks such as MMMU Pro and Video MMMU for complex visual reasoning, as well as use-case-specific benchmarks across document, spatial, screen and long video understanding.

1. Document understanding

Real-world documents are messy, unstructured, and difficult to parse - often filled with interleaved images, illegible handwritten text, nested tables, complex mathematical notation and non-linear layouts. Gemini 3 Pro represents a major leap forward in this domain, excelling across the entire document processing pipeline - from highly accurate Optical Character Recognition (OCR) to complex visual reasoning.

Intelligent perception

To truly understand a document, a model must accurately detect and recognize text, tables, math formulas, figures and charts regardless of noise or format.

A fundamental capability is "derendering" - the ability to reverse-engineer a visual document back into structured code (HTML, LaTeX, Markdown) that would recreate it. As illustrated below, Gemini 3 demonstrates accurate perception across diverse modalities including converting an 18th-century merchant log into a complex table, or transforming a raw image with mathematical annotation into precise LaTeX code.

Example 1: Handwritten Complex Table from 18th century Albany Merchant's Handbook (HTML transcription)

Example 2: Reconstructing equations from an image

Example 3: Reconstructing Florence Nightingale's original Polar Area Diagram into an interactive chart (with a toggle!)

Sophisticated reasoning

Users can rely on Gemini 3 to perform complex, multi-step reasoning across tables and charts - even in long reports. In fact, the model notably outperforms the human baseline on the CharXiv Reasoning benchmark (80.5%).

To illustrate this, imagine a user analyzing the 62-page U.S. Census Bureau "Income in the United States: 2022" report with the following prompt: "Compare the 2021-2022 percent change in the Gini index for "Money Income" versus "Post-Tax Income", and what caused the divergence in the post-tax measure, and in terms of "Money Income", does it show the lowest quintile's share rising or falling?"

Swipe through the images below to see the model's step-by-step reasoning.

Visual Extraction: To answer the Gini Index Comparison question, Gemini located and cross-referenced this info in Figure 3 about "Money Income decreased by 1.2 percent" and in Table B-3 about "Post-Tax Income increased by 3.2 percent"

Causal Logic: Crucially, Gemini 3 does not stop at the numbers; it correlates this gap with the text's policy analysis, correctly identifying Lapse of ARPA Policies and the end of Stimulus Payments are the main causes.

Numerical Comparison: To compare the lowest quantile's share rising or falling, Gemini3 looked at table A-3, and compared the number of 2.9 and 3.0, and concluded that "the share of aggregate household income held by the lowest quintile was rising."

Final Model Answer

2. Spatial understanding

Gemini 3 Pro is our strongest spatial understanding model so far. Combined with its strong reasoning, this enables the model to make sense of the physical world.

  • Pointing capability: Gemini 3 has the ability to point at specific locations in images by outputting pixel-precise coordinates. Sequences of 2D points can be strung together to perform complex tasks, such as estimating human poses or reflecting trajectories over time.
  • Open vocabulary references: Gemini 3 identifies objects and their intent using an open vocabulary. The most direct application is robotics: the user can ask a robot to generate spatially grounded plans like, "Given this messy table, come up with a plan on how to sort the trash." This also extends to AR/XR devices, where the user can request an AI assistant to "Point to the screw according to the user manual."

3. Screen understanding

Gemini 3.0 Pro's spatial understanding really shines through its screen understanding of desktop and mobile OS screens. This reliability helps make computer use agents robust enough to automate repetitive tasks. UI understanding capabilities can also enable tasks like QA testing, user onboarding and UX analytics. The following computer use demo shows the model perceiving and clicking with high precision.

4. Video understanding

Gemini 3 Pro takes a massive leap forward in how AI understands video, the most complex data format we interact with. It is dense, dynamic, multimodal and rich with context.

  1. High frame rate understanding: We have optimized the model to be much stronger at understanding fast-paced actions when sampling at >1 frames-per-second. Gemini 3 Pro can capture rapid details - vital for tasks like analyzing golf swing mechanics.

By processing video at 10 FPS-10x the default speed-Gemini 3 Pro catches every swing and shift in weight, unlocking deep insights into player mechanics.

2. Video reasoning with "thinking" mode: We upgraded "thinking" mode to go beyond object recognition toward true video reasoning. The model can now better trace complex cause-and-effect relationships over time. Instead of just identifying what is happening, it understands why it is happening.

3. Turning long videos into action: Gemini 3 Pro bridges the gap between video and code. It can extract knowledge from long-form content and immediately translate it into functioning apps or structured code.

5. Real-world applications

Here are a few ways we think various fields will benefit from Gemini 3's capabilities.

Education

Gemini 3.0 Pro's enhanced vision capabilities drive significant gains in the education field, particularly for diagram-heavy questions central to math and science. It successfully tackles the full spectrum of multimodal reasoning problems found from middle school through post-secondary curriculums. This includes visual reasoning puzzles (like Math Kangaroo) and complex chemistry and physics diagrams.

Gemini 3's visual intelligence also powers the generative capabilities of Nano Banana Pro. By combining advanced reasoning with precise generation, the model, for example, can help users identify exactly where they went wrong in a homework problem.

Prompt: "Here is a photo of my homework attempt. Please check my steps and tell me where I went wrong. Instead of explaining in text, show me visually on my image." (Note: Student work is shown in blue; model corrections are shown in red). [See prompt in Google AI Studio]

Medical and biomedical imaging

Gemini 3 Pro 1 stands as our most capable general model for medical and biomedical imagery understanding, achieving state-of-the-art performance across major public benchmarks in MedXpertQA-MM (a difficult expert-level medical reasoning exam), VQA-RAD (radiology imagery Q&A) and MicroVQA (multimodal reasoning benchmarks for microscopy based biological research).

Input image from MicroVQA - a benchmark for microscopy-based biological research

Law and finance

Gemini 3 Pro's enhanced document understanding helps professionals in finance and law tackle highly complex workflows. Finance platforms can seamlessly analyze dense reports filled with charts and tables, while legal platforms benefit from the model's sophisticated document reasoning.

6. Media resolution control

Gemini 3 Pro improves the way it processes visual inputs by preserving the native aspect ratio of images. This drives significant quality improvements across the board.

Additionally, developers gain granular control over performance and cost via the new media_resolution parameter. This allows you to tune visual token usage to balance fidelity against consumption:

  • High resolution: Maximizes fidelity for tasks requiring fine detail, such as dense OCR or complex document understanding.
  • Low resolution: Optimizes for cost and latency on simpler tasks, such as general scene recognition or long-context tasks.

For specific recommendations, refer to our Gemini 3.0 Documentation Guide.

Build with Gemini 3 Pro

We are excited to see what you build with these new capabilities. To get started, check out our developer documentation or play with the model in Google AI Studio today.

Google LLC published this content on December 05, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on December 05, 2025 at 16:26 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]