How does computer vision work?

Computer vision thrives on visual data. In fact, without visual data, computer vision cannot work; that is the most simple explanation, but of course, it goes much deeper than that. 

There are two key parts to making use of Computer Vision. Firstly the learning stage, and secondly the processing stage. In the first stage, Computer Vision analyzes hundreds or even thousands of ‘known good’ or ‘known bad’ images and videos over and over, at machine speed, until it learns, for example, the difference between a fake Gucci bag and a real one, or a normal cell and a cancerous cell. 

Deep learning and neural networks are the two core technologies that power computer vision, using algorithmic models to empower technology to learn what it should be looking for, apply context and so on. If there isn’t enough data, it can’t make the distinctions between a real object and something that looks similar and will therefore deliver incorrect results. A famous example of this is a computer vision engine not being able to distinguish between a muffin and a chihuahua.

The key piece of information here is the ability forComputer Vision, and AI in general, to self-learn, rather than being programmed to identify a specific object. This means that the technology is more flexible and versatile because once it has learned an object it can successfully identify it again, even if key aspects of the object are different from object to object.

Computer Vision Object Recognition

In this simple example above we see two examples of a Labrador Retriever. A Computer Vision System, once trained, can easily identify that both are Labrador Retrievers despite the different angles and how much can be seen of each. But it could also recognise them if the image was zoomed in on the face or a side profile, etc. Programming a computer system to do the same would be virtually impossible.

This learning process is known as ‘Model Training’. Once complete the model can be applied to any Computer Vision system, allowing it to instantly begin detecting whatever objects are included in the model.

So, although Computer Vision systems don’t need to be constantly reprogrammed by an engineer, they can be improved. This may be in order to improve accuracy or to recognise more attributes of an object. In essence, the more it “sees” the more it understands and the better it can do its job. 

In order for Computer vision to serve its purpose it must be able to receive data to be processed and send the results back. To enable this, providers will create APIs (Application Programming Interfaces). The fundamental basis of APIs is to allow systems to send a request to another system and get a response. That response will typically be in JSON or XML format (although other formats are also emerging for specific applications).

In the world of computer vision, an API is used to send visual media (images and videos) to a computer vision engine so it can process that media using the applicable technologies. The API also allows the user to define many aspects of how they want the media processed, such as the mix of technologies to use and the parameters to apply during processing, which is then reflected in the data output they get back.

Computer Vision comprises numerous technologies to achieve specific outcomes and these can be used standalone or combined for multiple uses. For example, object detection might be combined with logo recognition to apply context to brand placements on social media posts. 

So far we have only covered the area of extracting data from visual media. This is the enabling technology that creates the labelled/annotated data to be analysed. The more critical element is what happens with that data once it is available. Decisions must be made and actions need to be taken on the data, and again Visual-AI can do this at machine speed. For instance, a Computer Vision component in a self-driving car can detect the road markings in the live video stream it sees with each element labelled with type and position information. The labelled data is then fed into the AI’s decision engine in real-time, allowing it to make precise adjustments in steering to keep the car in the middle of a lane. When the system then sees a slower-moving vehicle ahead, that data is fed into the decision engine and the car can automatically change lanes.

Computer Vision Technologies

Computer Vision is a broad technology area with many subset and superset technologies within them. They can be used individually or combined for greater effect in an almost limitless number of ways (as covered in our Use Cases page). Below we have outlined the most popular technologies, but a broader list of technologies can be found here.

Object and Scene Detection

One of the most used technologies within the umbrella of Computer Vision is Object and Scene Detection. It is an implementation of Visual-AI which allows a wide range of objects to be detected and tagged in images and videos. It also enables the capability to apply context to these objects by detecting the setting or scene in which it appears. 

This can be as basic as labelling an object as a “Vehicle” or as specific as labelling it as “Car” or even “Car > Ford Escort”. It can provide annotations not just in single frames or stills, but for each frame of video, and also in real-time, which is appropriate for autonomous cars and traffic or productivity analysis.

A great example of this is in tracking foot traffic through a building to optimise space and for safety and security. Alternatively, a single camera can track all the players in a match, allowing coaches to review good vs. bad play. 

The computer vision system not only detects the objects but can then make recommendations and/or predictions on what to do next or improvements. So when it detects too many people in an area, the system can either send an alert for action to be taken or even autonomously prevent access to that area.

Object Tracking

Object Tracking is effectively an add-on to Object Detection that allows a computer vision system to track a detected object within its field of view. Used extensively in autonomous driving applications it can also be used in manufacturing, supply chain, security and sports analysis applications.

Computer vision examples

Logo and Mark Detection/Recognition

Logo and Mark detection is a very popular computer vision technology and it can be applied to a huge number of use cases. It is a specific application of Visual-AI that allows logos, industry marks, icons, and other unique graphical elements to be detected. This can be executed on any visual media so it can be used for applications as diverse as brand monitoring, counterfeit detection and retail shelf tracking. 

Visual Search

Visua Search is an implementation of Visual-AI that allows users to search for visually identical or similar components by uploading or taking a similar picture or video. Think of Google Lens, which many Android users have on their phones. Visual Search is used extensively in retail applications to allow thematically or stylistically related items to be suggested to shoppers rather than forcing them to struggle with text queries or endless browsing. Simply upload an image of what you’re looking for and visual search can show you where you can buy it or can show you stylistically similar items you might like.

Visual Search is also used in the area of counterfeit detection and copyright infringement where it can detect the unlawful use of an image or portions of an image, or visually similar fake products.

Text Detection

Text Detection is an aspect of Visual-AI which enables the analysis of text embedded in images and videos. Also known as Optical Character Recognition (OCR), it detects the presence of text within this media and converts it into machine-readable text. 

Typically, it would be used in applications of computer vision where analysis of real-world images and videos is needed, such as user-shared social media videos, rather than images on documents and so on. It is often considered a critical piece of technology for phishing detection, content moderation, brand protection and content moderation for the detection of hate speech. It should be noted that Text Detection can be used to identify and convert handwritten text, as well as typed text, into machine-readable text.

Hologram Authentication

Holograms have become the authentication device of choice for product manufacturers in recent decades. They have become so ubiquitous, however, that bad actors have started counterfeiting them to make illegitimate products appear closer to the real thing. Computer vision-enabled hologram authentication is a very specialised form of object detection that enables a fast and simple way to authenticate genuine holograms in seconds. This not only hardens the supply chain but also creates opportunities to engage end customers with special offers and up/cross-sell opportunities as well as customer data gathering through connected warranty forms and surveys.

Facial Recognition

Facial Recognition

Facial recognition is probably the technology that the general public (and the creators of heist movies) most associate with computer vision technology. Facial detection and facial recognition use the principles of object detection to analyse and recognise the key features of a face. This can be applied to security applications on phones, and workplaces and most recently it has even made its way into payment processing applications, allowing individuals to pay with your face. It’s probably the aspect of computer vision that most divides people, with some thinking it is key for safety and security in many situations, while others feel it is another assault on our personal security and privacy. However, not all implementations of Facial Recognition involve recognizing individual faces. In some applications, it may be sufficient to simply detect a face and individual elements of a face, such as eye movement and mouth movement. 

Although of all the computer vision technologies, this is the one that is most likely to be exploited, there are many legitimate uses that make our lives easier, safer and more productive.

Facial Recognition

2D Code Reading/Variable Data Label Detection

Being able to detect and decode 2D code labels (QR Codes) and Variable Data Labels (Bar Codes and Data Matrix Codes) can be very useful in manufacturing, supply chain and brand protection/counterfeit detection applications. The first part is the detection of the specific label/s, and once detected the system can decode the data contained within it. A camera can take a live feed of a production line with products flashing past at speed and the system can detect and read the label, no matter its orientation. Similarly, images found on marketplaces and other eCommerce sites can be analysed to detect and decode the data labels in order to verify their authenticity, whether counterfeit or grey market.

Unlike dedicated apps that require the label to fill the frame and will only detect these types of labels, Variable Data Label Detection can be combined with other technologies, such as Object Detection, Logo Detection and Visual Search to allow the processing of a single frame or image to deliver multiple interconnected insights, such as the brand related to the label or damage to packaging to be logged against that label, etc.

conveyor belt scanner

Related Content


All you need to know about Logo Detection: What is it? Who uses it? What can it do Logo Detection is the […]

Visual Artificial Intelligence: a powerful addition to any technology You know that we are living in a visual world and we are […]

TLDR: Image recognition has a long history, going all the way back to 1956, however, this is most likely the age when […]

A close look at text detection for content moderation Content moderation is at crisis point. People employed to carry out the task […]

Trusted by the world's leading platforms, marketplaces and agencies

Integrate Visual-AI Into Your Platform

Seamlessly integrating our API is quick and easy, and if you have questions, there are real people here to help. So start today; complete the contact form and our team will get straight back to you.

  • This field is for validation purposes and should be left unchanged.