Computer Vision for the Masses

Bringing Computer Vision to the Open Web Platform*

The Web is the world’s most universal compute platform and the foundation for the digital economy. Since its birth in early 1990s, Web capabilities have been increasing in both quantity and quality. But in spite of all the progress, computer vision isn’t yet mainstream on the Web. The reasons include:

  • The lack of sufficient performance of JavaScript*, the standard language of the Web
  • The lack of camera support in the standard Web APIs
  • The lack of comprehensive computer vision libraries

These problems are about to get solved―resulting in the potential for a more immersive and perceptual Web with transformational effects including online shopping, education, and entertainment, among others.

Over the last decade, the tremendous improvements in JavaScript performance, plus the recent emergence of WebAssembly*, close the Web performance gap with native computing. And the HTML5 WebRTC* API has brought camera support to the Open Web Platform*. Even so, a comprehensive library of computer vision algorithms for the Web was still lacking. This article outlines a solution for the last piece of the problem by bringing OpenCV* to the Open Web Platform.

OpenCV is the most popular computer vision library, with a comprehensive set of vision functions and a large developer community. It’s implemented in C++ and, up until now, was not available in Web browsers without the help of unpopular native plugins.

We’ll show how to leverage OpenCV efficiency, completeness, API maturity, and its community’s collective knowledge to bring hundreds of OpenCV functions to the Open Web Platform. It’s provided in a format that’s easy for JavaScript engines to optimize and has an API that’s easy for Web programmers to adopt and use to develop applications. On top of that, we’ll show how to port OpenCV parallel implementations that target single instruction, multiple data (SIMD) units and multiple processor cores to equivalent Web primitives―providing the high performance required for real-time and interactive use cases.

The Open Web Platform

The Open Web Platform is the most universal computing platform, with billions of connected devices. Its popularity in online commerce, entertainment, science, and education has grown exponentially―as has the amount of multimedia content on the Web. Despite this, computer vision processing on Web browsers hasn’t been a common practice. The lack of client-side vision processing is due to several limitations:

  • A lack of standard Web APIs to access and transfer multimedia content
  • Inferior JavaScript performance
  • Lack of a comprehensive computer vision library to develop apps

The approach we outline here, along with other recent developments on the Web front, will address those limitations and empower the Web with proper computer vision capabilities.

Adding Camera Support and Plugin-Free Multimedia Delivery

HTML5 introduced several Web APIs to capture, transfer, and present multimedia content in browsers without the need for third-party plugins. One of these, Web Real-Time Communication* (WebRTC*), allows acquisition and peer-to-peer transportation of multimedia content and video elements to display videos.

Recently, the immersive Web with access to virtual reality (VR) and augmented reality (AR) content has begun delivering new, engaging user experiences.

Improved JavaScript* Performance

JavaScript is the dominant language of the Web. Because it’s a scripting language with dynamic typing, its performance is inferior to that of native languages such as C++. Multimedia processing often involves complex algorithms and massive amounts of computation. With client-side technologies such as just-in-time (JIT) compilation, and with the introduction of WebAssembly* (WASM*), a portable, binary format for the Web, Web clients can reach a near-native performance with JavaScript and handle more demanding tasks.

A Comprehensive Computer Vision Library

Although there are several computer vision libraries developed in native languages such C++, they can’t be used in browsers without relying on unpopular browser extensions, which pose security and portability issues. There have been a few efforts to develop computer vision libraries in JavaScript, but these are limited to select categories of vision functions. Expanding those efforts with new algorithms, and optimizing the implementation, are challenging tasks. Previous work lacked either functionality, performance, or portability.

As an alternative approach, we take advantage of an existing comprehensive computer vision library developed in C++ (i.e., OpenCV) and make it work on the Web. This approach works great on the Web for several reasons:

  • It provides an expansive set of functions with optimized implementation.
  • It performs more efficiently than normal JavaScript implementations, and performance will further improve
    through parallelism.
  • Developers can access a large collection of existing resources such as tutorials and examples.

OpenCV.js*

OpenCV1 is the de facto library for computer vision development. It’s an open-source library that started at Intel Labs back in 2000. OpenCV is very comprehensive and has been implemented as a set of modules (Figure 1). It offers a large number of primitive kernels and vision applications, ranging from image processing, object detection, and tracking to machine learning and deep neural networks (DNN). OpenCV provides efficient implementations for parallel hardware such as multicore processors with vector units. We translate many OpenCV functionalities into JavaScript and refer to it as OpenCV.js.

Figure 1 – OpenCV implemented as a collection of modules

Table 1 categorizes and lists the functions currently included in OpenCV.js. It omits several OpenCV modules, for two reasons:

  1. Not all of OpenCV’s offerings are suitable for the Web. For instance, the high-level GUI and I/O module (highgui)―which provides functions to access media devices such as cameras and graphical user interfaces―is platform-dependent and can’t be compiled to the Web. Those functions, however, have alternatives using HTML5 primitives, which are provided by a JavaScript module (utils.js). This works, for instance, to access files hosted on the Web and media devices through getUserMedia and to display graphics using HTML Canvas*.
  2. Some of the OpenCV functions are only used in certain application domains that aren’t common in typical Web applications. For instance, the camera calibration module (calib3d) is often used in robotics. To reduce the size of the generated library for general use cases, based on OpenCV community feedback, we have identified the least commonly used functions from OpenCV and excluded them from the JavaScript version of the library.

Table 1. OpenCV.js provided functionalities

Module Provided Function
Core components Image manipulation and basic arithmetic
Image processing Numerous functions to process and analyze images
Image processing Video processing algorithms such as tracking, background segmentation, and
optical flow
Object detection HAAR*- and HOG*-based cascade classifiers
DNN Inference of trained Caffe*, Torch*, or TensorFlow* models
GUI features Helper functions to access frames from HTML Canvas*, video elements, and
cameras

 

Because there are still many functions that might be useful for special use cases, we’ve provided a way to build the library with user-selected functions.

Translating OpenCV to JavaScript* and WebAssembly*

The emergence of Emscripten*2, an LLVM-based source-to-source compiler developed by Mozilla, has made it possible to port many programs and libraries developed in C++ to the Web. Originally, Emscripten targets a typed subset of JavaScript called asm.js that, because of its simplicity, allows JavaScript engines to perform extra levels of optimization. In fact, it’s even possible to compile asm.js functions before execution.

While performance is impressive, parsing and compiling large JavaScript files could become a bottleneck, especially for mobile devices with weaker processors. This was one of the main motivations for development of WASM3. WASM is a portable size- and load-time-efficient binary format designed as a target for Web compilation. We used Emscripten to compile OpenCV source code into both asm.js and WASM. They offer the same functionality and can be used interchangeably.

During compilation with Emscripten, the C++ high-level language information such as class and function identifiers are replaced with mangled names. Because it’s almost impossible to develop programs through mangled names, we provide binding information of different OpenCV entities such as functions and classes and expose them to JavaScript. This enables the library to have a similar interface to normal OpenCV, with which many programmers are already familiar. Because OpenCV is large, and grows continuously through new contributions, continuously updating the port by hand is impractical. So we developed a semi-automated approach that takes care of the tedious parts of the translation process while allowing expert insights that can enable high-quality, efficient code production.

Figure 2 lists the steps involved in converting OpenCV C++ code to JavaScript. First, OpenCV source code is configured to disable components and implementations that are platform-specific, or are not optimized for the Web. Next, information about classes and functions that should be exported to JavaScript will be extracted from OpenCV source code. We use a white list of OpenCV classes and functions that should be included in the final JavaScript build. It’s possible to update the list before building to include or exclude OpenCV modules and/or functions. For efficiency, binding information for the OpenCV core module, which includes the OpenCV main data structure (i.e., cv::Mat), is manually provided. By using the binding information and function white list, we generate a glue code that maps JavaScript symbols to C++ symbols and compiles it with Emscripten along with the rest of the OpenCV library into JavaScript. The output of this process will be a JavaScript file (OpenCV.js) that serves as the library interface along with a WASM or asm.js file that implements OpenCV functions. utils.js, which includes GUI, I/O, and utility functions, is also linked with OpenCV.js.

Figure 2 – Generating OpenCV.js

Using OpenCV.js in Web Applications

Let’s explore how to use OpenCV.js to develop Web applications. Figure 3 shows an overview of OpenCV.js and its interaction with Web applications. Web applications will use the OpenCV.js API to access the functions as listed in Table 1. While the vision functions from OpenCV are compiled either into WASM or asm.js, we have developed a JavaScript module that provides GUI features and media capture. OpenCV.js utilizes standard Web APIs, such as Web workers and SIMD.js, to achieve high performance, and Canvas and WebRTC* to provide media and GUI capabilities.

The OpenCV.js API is inspired by the OpenCV C++ API and shares many similarities with it. For instance, C++ functions are exported to JavaScript with the same name and signature. Function overloading and default parameters are also supported in the JavaScript version. This makes migration to JavaScript easier for users who are already familiar with OpenCV development in C++.

Although OpenCV C++ classes are ported to JavaScript objects with the same member functions and properties, basic data types are different between the two versions. Table 2 shows equivalent JavaScript data types for basic C++ data types. JavaScript engines use Garbage Collector (GC) to manage program memory. However, GC activity has a negative impact on performance so OpenCV.js uses static memory management. Programmers are responsible for freeing OpenCV.js objects when they are no longer being used. Because manual memory management is tedious, we’ve used JavaScript types for primitive OpenCV types such as cv::Point. All std::vectors are translated into JavaScript arrays, except for vectors of cv::Mat. This is particularly helpful, because by removing the vector, it will remove all the cv::Mat elements.

Figure 3 – OpenCV.js components and its interaction with applications and web APIs

Table 2. Exported JavaScript types for basic C++ types

C++ Type JavaScript Type
Numerical types (e.g., int and float) Number
bool Boolean
enum Constant
primitive structures (e.g., cv::Point) Value object
std::vector (of cv::Mats) cv.Vector
std::vector (of primitive types) Array
std::string String

 

We’ll present several examples to demonstrate various computer vision tasks using OpenCV.js. All of these examples work on top of a simple HTML Web page. We only present the logic part of the programs that deal with the OpenCV.js API.

Figure 4 shows how to apply the Canny algorithm to find the edges in an image. Input images will be loaded from an HTML Canvas. For this purpose, we’ve provided a helper function that takes the Canvas name and returns a color image. Because the Canny algorithm works on grayscale images, we have to do the extra step at line 3 to invoke cv.cvtColor to convert the input image from color to grayscale. Finally, after getting the result of Canny algorithm, we can render the image in the output canvas (line 6). Figure 5 shows a snapshot of this program running inside a browser.

1
2
3
4
5
6
7
let src = cv.imread (‘canvasInput’);
let dst = new cv.Mat ();
cv.cvtColor (src, src, cv.COLOR_RGB2GRAY, 0);
// You can try more different parameters
cv.Canny (src, dst, 50, 100, 3, false);
cv.imshow (‘canvasOutput’, dst );
src.delete (); dst.delete ();

Figure 4 – How to apply the Canny algorithm to find the edges in an image

Figure 5 – Rendering the image in line 6

The next example (Figures 6 and 7) uses Haar cascades to detect faces in an image. Because this algorithm works on grayscale images, the input image is converted at line 3. At line 7, we initialize a cascade classifier and load it with a model for detecting faces. Other models trained to detect different objects, such as cats and dogs, can be used as well. At line 9, we invoke detectMultiscale, which searches in multiple copies of input images scaled with different sizes. When finished, it returns a list of rectangles for possible faces in the image. At line 10, we iterate over those rectangles and use the cv.rectangle function to highlight that part of the image.

Figure 6 – Face detection using cascade classifiers

We’ve seen how to process single images in Web applications using OpenCV.js. Processing video boils down to processing a sequence of individual frames. The next example (Figures 8 and 9) demonstrates:

  • How to capture frames from a video element
  • How to subtract background from input video using the MOG2 algorithm
  • How to display the processed frame on an HTML Canvas

The cv.VideoCapture object provided by utils.js enables WebRTC to access and manage camera resources. This examples assumes the input video contains 30 frames per second. So, at every 1/30 of a second, it invokes the processVideo function. This function:

  • Reads the next video frame (line 18)
  • Applies background extraction function (line 19)
  • Displays the output foreground mask (line 20)

Finally, at line 23 the next invocation of the function is scheduled.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
let src = cv.imread ('canvasInput’);
let gray = new cv.Mat();
cv.cvtColor (src, gray, cv.COLOR_RGBA2GRAY, 0);
let faces = new cv.RectVector ();
let faceCascade = new cv.CascadeClassifier ();
// load pre-trained classifiers
faceCascade.load (‘haarcascade_frontalface_default.xml’);
// detect faces
faceCascade.detectMultiScale (gray, faces);
for (let i = 0; i < faces.size(); ++i) {
  let roiGray = gray.roi (faces.get(i));
  let roiSrc = src.roi (faces.get(i));
  let point1 = new cv.Point (faces.get(i).x, faces.get(i).y);
  let point2 = new cv.Point (faces.get(i).x + faces.get(i).width,
      faces.get(i).y + faces.get(i).height);
   cv.rectangle (src, point1, point2, [255, 0, 0, 255], 3);
   roiGray.delete (); roiSrc.delete ();
}
cv.imshow (‘canvasOutput’, src);
src.delete (); gray.delete (); faceCascade.delete ();
faces.delete ();

Figure 7 – Subtracting background for different frames of input video using the MOG2 method

Figure 8 – Rendering the image

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
let video = document.getElementById (‘videoInput’);
let cap = new cv.VideoCapture (video);

let frame = new cv.Mat (video.height, video.width, cv.CV_8UC4);
let fgmask = new cv.Mat (video.height, video.width, cv.CV_8UC1);
let fgbg = new cv.BackgroundSubtractorMOG2 (500, 16, true);

const FPS = 30;
function processVideo () {
   try {
      if (! streaming) {
         // clean and stop
         frame.delete (); fgmask.delete (); fgbg.delete ();
         return;
      }
      let begin = Date.now ();
      // start processing
      cap.read (frame);
      fgbg.apply (frame, fgmask);
      cv.imshow (‘canvasOutput’, fgmask);
      // schedule the next one
      let delay = 1000/FPS - (Date.now () - begin);
      setTimeout (processVideo, delay);
 } catch (err) {
     utils.printError (err);
 }
};
// schedule the first frame
setTimeout (processVideo, 0);

Figure 9 – Capturing a video frame and subtracting the background

The last example (Figures 10 and 11) demonstrates using a pre-trained DNN in Web applications. While in this example we use a DNN to recognize objects, they can also be specialized to do other recognition tasks, such as background segmentation. At line 2, the program reads a Caffe framework model of GoogleNet*. Other formats, such as Torch and TensorFlow, are also supported. In the next step, at line 6, we convert the input image into a blob that fits the network. Then, at lines 9 and 10, we forward the blob along the network and find the highest probability class. Figure 10 shows a snapshot of the program categorizing a sample image. (You can see more examples of OpenCV.js usage at https://docs.opencv.org.)

Figure 10 – Object detection example using GoogLeNet model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
let src = cv.imread (‘canvasInput’);
let net = cv.readNetFromCaffe (‘bvlc_googlenet.prototxt’,
                               ‘bvlc_googlenet.caffemodel’);
if (net.empty ());
   throw “Failed to read net”;
let inputBlob = cv.blobFromImage (src, 1, new cv.Size (224, 224),
                                  new cv.Scalar (104, 117, 123));

net.setInput (inputBlob, data);
let prob = net.forward (“prob”);
let minMax = cv.minMaxLoc (prob);

console.log (“Best class: #” + minMax.maxLoc.x + “ “ +
       keywords [minMax.maxLoc.x] + Probability:  + minMax.maxVal);

prob.delete (); inputBlob.delete ();
net.delete (); src.delete ();

Figure 11 – Using a pre-trained DNN in Web applications

Performance Evaluation

OpenCV.js brings a lot of computer vision capabilities to the Web. To demonstrate their performance, we have selected a number of vision benchmarks including primitive kernels and more sophisticated vision applications including:

  • Canny’s algorithm for edge detection
  • Finding faces using Haar cascades
  • Finding people using a histogram of gradients

We used a Firefox* browser running on an Intel® Core™ i7-3770 processor with 8GB of RAM with Ubuntu* 16.04 for our setup and ran experiments over sequences of videos.

Figure 12 shows the average speedup of both simple JavaScript kernels and vision applications compared to their native equivalent that use an OpenCV scalar build (i.e., not using parallelism). As you can see, the JavaScript performance is competitive. While we found WASM and asm.js performance to be close, the WASM version of the library is significantly faster to initialize and is more compact. Its size is about 5.3 MB compared to the asm.js version, which is 10.4 MB.

Figure 12 – Performance evaluation of primitive kernels and vision applications running on Firefox

Making it Even Faster

Computer vision is computationally demanding. A lot of computations need to be performed on a massive number of pixels. For instance, each iteration of the baseline Canny, face, and people benchmarks takes on average 7 ms, 345 ms, and 323 ms, respectively, to process a single 800 x 600 resolution image. While it’s fast to compute Canny, face, and people detection, they are still very expensive and cannot be used in real-time, interactive use-cases.

Fortunately, computer vision algorithms are inherently parallel, and good algorithm design and an optimized implementation can lead to significant speedups on parallel hardware. OpenCV comes with parallel implementations of algorithms for different architectures. We take advantage of two methods that target multicore processors and SIMD units to make the JavaScript version faster. We’ve skipped GPU implementations at the moment due to their complexity. The upcoming WebGPU* API can potentially be used to accelerate OpenCV on GPUs.

SIMD.js

SIMD.js4, 5 is a new Web API to expose processor vector capabilities to the Web. It is based on a common subset of the Intel SSE2 and ARM NEON* instruction sets that runs efficiently on both architectures. They define vector instructions that operate on 128-bit-wide vector registers, which can hold four integers, four single-precision floating-point numbers, or 16 characters. Figure 13 shows how vector registers can add four integers with one CPU instruction.

 

 

 

 

Figure 13 – Scalar versus SIMD for the addition of four integers

SIMD is proven to be very effective in improving performance, especially for multimedia, graphics, and scientific applications.5, 6 In fact, many OpenCV functions, such as core routines, are already implemented using vector intrinsics. We’ve adapted the work done by Peter Jensen, Ivan Jibaja, Ningxin Hu, Dan Gohman, and John Mc-Cutchan7 to translate OpenCV vectorized implementations using SSE2 intrinsics into JavaScript with SIMD.js instructions. Inclusion of the SIMD.js implementation will not affect the library interface. Figure 14 shows the speedups that are obtained by SIMD.js on selected kernels and applications running on Firefox. Up to 8x speedup is obtained for primitive kernels. As expected, the speedup is higher for smaller data types. There are fewer vectorization opportunities in complex functions such as Canny, face, and people detection. Currently, SIMD.js can only be used in the asm.js context and is supported by the Firefox and Microsoft Edge* browsers. SIMD in WebAssembly is currently planned to have the same specification as SIMD.js. Hence, similar performance numbers are expected.

Figure 14 – Performance improvement using SIMD.js

Multithreading using Web Workers

JavaScript programs use Web workers for parallel processing of heavy computing tasks. Web workers communicate by message passing, which could incur significant cost, especially when passing large messages such as images. SharedArrayBuffer8 has recently been proposed as a storage system that can be shared between Web workers. They can use it to implement a shared-memory parallel programming model. OpenCV uses its parallel_for_ framework to implement multithreaded versions of vision functions. The parallel_for_ framework can target multiple multithreading models including POSIX* threads (Pthreads). With recent Emscripten* developments, it’s possible to translate the Pthreads. API into equivalent JavaScript using Web workers with shared array buffers. OpenCV.js with multithreading support will use a pool of Web workers and allocate a worker when a new thread is being spawned. Also, it exposes OpenCV APIs to dynamically adjust the concurrency such as changing number of concurrent threads (i.e., cv.SetNumThreads) to JavaScript.

To study the performance impact of using multiple Web workers, we measured the performance of three application benchmarks that did not gain significantly from SIMD vectorization using different numbers of workers (up to 8). The OpenCV load balancing algorithm divides the workload evenly among threads. As shown in Figure 15, on a processor with eight logical cores, we obtained a 3x to 4x speedup. Note that a similar trend is observed on native Pthreads implementations of the same functions.

Figure 15 – Speedup achieved using multiple Web workers

Computer Vision for the Masses

This work brings years of OpenCV development in computer vision processing to the Web with high efficiency. It provides a collection of carefully selected functions including image processing, object detection, video analysis, features extraction, and DNNs, among others. The results of our experiments show the framework’s high capability. Thanks to JavaScript portability, for the first time, a large collection of vision functions can be used not only on Web browsers but also on embedded devices and desktop applications. For instance, it provides computer vision for Node.jsbased Internet of Things (IOT) devices and JavaScript desktop development frameworks such as Electron*. Combined with the recent developments in Web platform, it’s a more efficient way to make real new Web applications and experiences like emerging virtual and augmented reality. We’ve also provided a large collection of computer vision tutorials using OpenCV.js that we hope will be a good asset for education and research purposes.

The authors are grateful to Congxiang Pan, Gang Song, and Wenyao Gan for their contributions through the Google Summer of Code program, and would also like to thank the OpenCV founder, Dr. Gary Bradski, its chief architect, Vadim Pisarevsky, and Alex Alekin for their support and helpful feedback.

Learn More

We’ve developed extensive online resources to help developers and researchers learn more about OpenCV.js and computer vision in general:

OpenCV.js can also be used in Node.js-based en vironments. It’s published on the Node Package Manager (NPM) at https://www.npmjs.com/package/opencv.js.

References

1. Gary Bradski and Adrian Kaehler. Learning OpenCV: Computer vision with the OpenCV library. O’Reilly Media, Inc., 2008.
2. Alon Zakai. Emscripten: An LLVM-to-JavaScript compiler. In Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion, pages 301–312. ACM, 2011.
3. Andreas Haas, Andreas Rossberg, Derek L Schuff, Ben L Titzer, Michael Holman, Dan Gohman, Luke Wagner, Alon Zakai, and JF Bastien. Bringing the Web up to speed with WebAssembly. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 185– 200. ACM, 2017.
4. SIMD.js specification: http://tc39.github.io/ecmascript_simd/. 2017.
5. Ivan Jibaja, Peter Jensen, Ningxin Hu, Mohammad R Haghighat, John McCutchan, Dan Gohman, Stephen M Blackburn, and Kathryn S McKinley. Vector parallelism in JavaScript: Language and compiler support for SIMD. In Parallel Architecture and Compilation (PACT), 2015 International Conference on, pages 407–418. IEEE, 2015.
6. Sajjad Taheri. Bringing the power of simd.js to gl-matrix: https://hacks.mozilla.org/2015/12/bringingthe-power-of-simd-js-to-gl-matrix/. 2015.
7. Peter Jensen, Ivan Jibaja, Ningxin Hu, Dan Gohman, and John Mc-Cutchan. SIMD in JavaScript via C++ and Emscripten. In Workshop on Programming Models for SIMD/Vector Processing, 2015.
8. ECMAScript 2018 language specification: https://tc39.github.io/ecma262/. 2017.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.