Imint Video Primer – Object Tracking and Auto Zoom

Welcome to the second lesson of the Imint Video Primer! This part will describe more about our R&D process in object tracking and Live Auto Zoom.


The history of object tracking

Object tracking is the process of locating and following one or more objects over time using a camera. It has a variety of uses, including human-computer interaction, security and surveillance, video communication, augmented reality, traffic control, medical imaging, video editing, and even compression – remember we talked about detecting movement from frame to frame?

This is quite different from object detection, generally used to automatically detect a predetermined set of types of objects in a video, and object recognition, used to recognize and identify objects in a video – distinguishing people from cars and so on. It is interesting to note that Machine Learning and Deep Learning (ML/DL) methods completely dominate detection and recognition algorithms today, whereas they play a smaller role for object tracking, although this is bound to increase over time.

All three fields are still open problems in image and video analysis, each with many different approaches.  Both detection and recognition build upon techniques of object tracking, but we will limit ourselves to just object tracking in this text. Trackers are steadily becoming better and better, but common problems remain. Trackers can easily fail in cluttered scenes, occlusion (the tracked object disappears behind another object) is a difficult problem, and good initialization (the tracker’s starting region) still matters a good deal.

Different object tracking approaches

As stated above, there are many approaches to object tracking, and an abundance of variations exist within each approach. Which method that may be considered “best” is largely dependent on a particular application. The cutting-edge methods constantly change with new research and breakthroughs in related fields.

Below, I will outline four very different object tracking fields, based on feature clouds and correlation filters, to try to visualize different techniques. Obviously, these are in reality much more technical and complicated, but bear with me. Typically, the art of finding the location of the object and the art of estimating its new size (in case it is moving towards or away from the camera), are considered separate processes.

Different trackers are constantly benchmarked. A paper with a lot of benchmarking details can be found here [1]. More recent, unpublished data is here.

Feature cloud trackers try to follow the target by following many small “features”: points of interest in the object region that are easy to find again in the next frame. These are often sharp edges and corners. Then, mapping features from one frame to the next reveals an approximation for how the object has moved: On average the object has moved as the combined movement of all the features.

This method has the advantage of knowing when the target is lost by being very discriminative. The problem is to handle occlusion, rotation, and motion blur where the small features may not be visible. An example of this type of trackers are those based on Lucas-Kanade matching, first presented as a stereo matching solution in 1981.

Correlation filter trackers are based on the convolution theorem which allows the template image to be used as a large convolution filter. They try to identify a template pattern of the current object, and then looks for an area in the next video frame which is most similar to that pattern, i.e. finding the region with minimal difference. The center of this new region is then assumed to be the updated center of the object. This method more easily determines the accuracy of the estimation and better handles occlusion. A complex multiplication between two images in the Fourier space will generate the Fourier representation of the convolution between the images. The problem with convolution-based filters is finding features that are stable even when the background and light conditions are changing.

Kernelized Correlation Filter (KCF) has been a very popular method ever since it started winning tracking competitions in 2014. The method is both accurate and fast enough to be run in real-time on regular computers. In its integral form, KCF does not perform any scale analysis and always assumes the same size and shape as when initiated. Of course, the object changes its size and shape during the tracking and a good tracker should follow this change over time.

Color histogram trackers are commonly used for auto focus, where precision does not matter at all. They don’t stand a chance in competitions because they can easily follow something else, but their fast redetection works even if the object has changed shape completely. They are not robust against light changes and backgrounds of similar color patterns.

Motion-based trackers, well-suited for mounted cameras in security systems and for following fast moving objects. The method basically just follows a detected region that differs from the trained background. They are better at handling changes in illumination, but will not be able to detect things when they are not moving.

Groundbreaking research at Imint

Most research for visual object tracking is today done on high-performance PCs. There are plenty of competitions online and offline that generally focus on accuracy, and most algorithms can perform better by simply increasing the available computing power and/or ignoring any time constraints. Some competitions have sub-competitions for best real-time trackers, but accuracy is overall the main point of research. Also, most object trackers are benchmarked for datasets with mainly professional still camera-videos, which is quite a different case from using smartphones.

At Imint, we focus on smartphones and other types of low-power hardware. That means we don’t have the luxury to take any half-measures. Every clock cycle counts, not just for performance, but also battery consumption. Having to optimize for performance without compromising on accuracy makes the hard problem of tracking even harder. This doesn’t just mean taking state-of-the-art algorithms and making them faster, it also entails inventing new methods to do the same or even better work with less effort.

One example is our own representation of probability-based optical flow that outperforms state of the art algorithms in quality while consuming 0.08 ms per frame on a mobile CPU compared to the common 200 ms deterministic version to get poor quality.

Video: One of our engineers validating the robustness of our tracker against changes in size, movement, and heavy shaking.

As stated above, there is very little (public) research about trackers taking place on the smartphone side. The curious reader may find one example here [2]. This paper presents how a Kernelized correlation filter (KCF) tracker can be used in realtime (~30 frames per second) on a phone. It is hard to perform a direct comparison as the paper is short on exact implementation details, however, our tracker has better performance and can track objects in up to 300 frames per second on a single mobile CPU. Moreover, a KCF tracker does not perform any scaling analysis (the tracker size and shape is the same during the whole movie), while our tracker with great accuracy adapts to changes in size and shape.

Turning object tracking into a product

Readers familiar with Imint are undoubtedly familiar with our Live Auto Zoom feature. Zooming in on moving objects with smartphones today is practically pointless – it is hard to do a smooth zoom, hard to aim at the object and to keep it in focus. On top of this, any unintended motion is amplified with the level of zoom. The user only needs to select the desired object and Vidhance Live Auto Zoom will automatically track and follow, zooming smoothly and composing the video professionally as the scene changes.

Video input may be huge, but each frame is often downsampled (resized to a smaller resolution) before processing, which makes the quality of the original resolution less important for us. Running software directly integrated into the lower levels of a smartphone fortunately also comes with certain advantages, such as a great motion estimation from the Vidhance stabilizer module.

The tracker is an integral component of Live Auto Zoom and is a great way to showcase our hard work researching. The robustness of our technology in industrial applications is not in question either. The recent SAAB deal also uses our tracking software, but for a completely different scenario: It is made part of their Air Traffic Control solutions in the airport operations industry.

This means that recent versions of Live Auto Zoom use a tracker considered good enough for large-scale air traffic control. That is pretty darn cool to carry with you in a smartphone!


[1] Wu, J. Lim and M. H. Yang, “Object Tracking Benchmark,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1834-1848, Sept. 1 2015. doi: 10.1109/TPAMI.2014.2388226 URL:

[2] Danijel Peso, Alfred Nischwitz, Siegfried Ippisch, Paul Obermeier, Kernelized correlation tracker on smartphones, Pervasive and Mobile Computing, Volume 35-2017, Pages 108-124, ISSN 1574-1192, URL:


Marcus Näslund