Splunk LLC

04/22/2025 | News release | Distributed by Public on 04/22/2025 17:09

What Is Computer Vision & How Does It Work

From the domains of artificial intelligence and computer science comes computer vision.

Computer vision allows machines to interpret, infer, and understand visual information. Just like humans can see objects, computer vision can extract knowledge from a visual image, all thanks to applied mathematics.

So, how does computer vision work - how can machines see like humans? And what is computer vision relevant today? Let's begin the discussion with the basic concepts.

How computer vision works

At the most fundamental level, computer vision is about extracting knowledge from an image frame or a sequence of frames (like a video). So, what exactly is an image?

An image is a structured collection of pixels, where each pixel value defines the intensity or color at its location in the image. For example:

  • A grayscale image of 1080p would be a 2D matrix (1920 rows, 1080 columns) with each pixel location containing an intensity value between 0-255. Think of this like a checkerboard, with rows and columns shaded with black, white, and all the shades of gray.
  • A color image, also 1080p resolution, would be a 3D matrix with three channels: Red, Green, Blue (RGB). This is like mixing colors together to create a specific picture or piece of art.

In other words, each RGB channel will consist of a 1920x1080 matrix with pixel values (0-255) representing the intensity of that channel.

Now, to create the image: in simple terms, an LED screen takes this matrix and lights up an LED element according to the corresponding location and power intensity. The result is a visual image - one that our eyes can see and our brain can comprehend based on our knowledge of the world.

That's how images work for humans. But how do machines interpret pixel values?

How machines can "see" pixel values

Computer vision (also known as machine vision) helps machines to comprehend and infer knowledge from visual information. But how?

Computer vision is focused on three main problem categories in the pipeline of teaching machines how to see and interpret a visual image:

  • Representation
  • Learning
  • Recognition

Representation

Representation creates a description of the image, describing it either mathematically or feature based. The goal of this task is to simplify the problem, by converting raw image pixels into meaningful structures.

These structures have "meaning" because we can define them mathematically, as described above - and that can be learned by computer vision learning algorithms.

Auto-encoders are a great example. An auto encoder simplifies complex images into basic shapes or features, similar to identifying key landmarks on a map. An auto-encoder model may decode (compress) raw image matrix into a lower dimension latent space that sees features such as:

  • Textures
  • Edges
  • Corners
  • Object parts of the image

These features are abstract and meaningful. They can be used for downstream tasks such as object detection, image classification, and more!

The details: A lot of image preprocessing and image processing goes into the computer vision pipeline before the dataset is ready for extracting representative features and then learning an AI model from those features. Preprocessing tasks may include:

Image processing tasks may include feature extraction and gradient detection-based techniques to extract features such as corners, edges, contrasts, and distinctive shapes. Common algorithms used here include:

  • SIFT Scale-invariant feature transform
  • HOG Histogram of oriented gradients
  • SURF Speeded-up robust features

Learning and recognition

Next, your models must learn, be trained. Computer vision relies on a variety of AI learning methodologies, including supervised and unsupervised learning.

Supervised learning maps an image distribution with its known class labels. These labels may be annotated manually. Datasets such as MNIST, COCO, ImageNet, and other domain-specific datasets do a great job for training models.

Once your models are trained, you can finetune them on problem-specific domains that may be similar to the datasets they are trained on - this is the recognition piece. For example:

  • A facial recognition engine may be trained on the VGG-Face dataset.
  • A general-purpose CV model may be trained on ImageNet that contains labeled images across thousands of object categories.

Any standard deep learning model using CNNs and Transformers based models may be used for this task.

The unsupervised learning approach does not rely on labeled images. With no ground-truth label, the models learn the patterns and representations across object categories. (The computer figures out patterns on its own, not unlike how a child learns to sort by color or size.)

Common examples include:

  • Image segmentation techniques, such as clustering
  • Dimensionality reduction techniques like PCA, which is similar to a long article summarized into a few key points
  • Generative models, such as GANs and diffusion models, which can create new scenes from "imagination", just as human creators can

Supervised vs unsupervised learning. Supervised learning is more common in real-world applications of machine vision, particularly due to reliable performance and accuracy.

However, most current state-of-the-art AI techniques focus on unsupervised learning. The reason is simple: image annotation is a tedious manual task and not scalable.

In solving complicated problems - like machine vision for autonomous vehicles - vehicle sensors must extract knowledge from virtually infinite image scenarios. This is where generative modeling techniques such as diffusion models can potentially help produce synthetic data on out-of-distribution image scenarios: all to help train robust models using supervised learning.

Real-world uses for computer vision

So, computer vision - machines seeing images - sounds like it could be very useful in the real world. And indeed it is. Here are some examples.

Healthcare: In medical imaging, computer vision assists in diagnosing diseases by analyzing X-rays, MRIs, and CT scans. It can detect anomalies, measure growths, and track changes over time, aiding healthcare professionals in making informed decisions.

Automotive: Autonomous vehicles rely heavily on computer vision to navigate and interpret the environment. Vision systems identify road signs, detect pedestrians, and monitor traffic conditions, ensuring safe and efficient driving.

Retail: Computer vision enhances the shopping experience by enabling features like virtual try-ons and automated checkout systems. It also helps in inventory management by monitoring stock levels and detecting misplaced items.

Security: In physical security and surveillance, computer vision identifies and tracks individuals, detects unusual activities, and analyzes crowd behavior. Facial recognition systems enhance security by verifying identities and granting access to authorized personnel. (Now, the legal and privacy ramifications of how this is used - that's a different topic.)

Agriculture: Vision technology monitors crop health, detects diseases, and assesses soil conditions. Drones equipped with cameras provide valuable insights into large farming areas, optimizing resource allocation and improving yield.

What's next in computer vision?

The latest trends in computer vision are focused on three research domains: Agentic AI, spatial computing and multimodal LLMs.

  • Agentic AI relies on computer vision algorithms to perform vision-based tasks. Think robot navigation. With vision-based systems, machines can smartly interact with the surrounding environment.
  • Spatial computing is focused on understanding the environment space. Think digital twins: 3D models of ancient buildings from a camera input.
  • Multimodal LLMs combine intelligence from visual data, text, speech, audio, and other formats. It can interface with a machine or human for interactive applications as visual agents, such as Google Gemini and the latest OpenAI models.