Self-Supervised Video Anomaly Detection

Factories could improve worker safety and reduce costs from machine, robot and worker error through incisive use of state-of-the-art deep learning techniques. Specifically, deep learning can be used to detect anomalies in video recordings of factory workers. Such anomalies may include:

  • faults in the machines (ex: a machine catching fire),
  • worker behavior (fatigue, unusual latency in movements or injury),
  • improper use of safety uniforms and
  • anomalies in robot performance

Overall, deep learning affords novel and powerful techniques for video prediction and analysis. Accordingly, it is important to summarize the current state-of-the-art for video analysis using deep learning techniques and the associated challenges.


PredNet (Lotter, Kreiman, Kox, 2016) applied to factory video. Chart plots the reconstruction error of frame-to-frame prediction. Reconstruction error spikes during an accident.


Single frames – Convolutional Neural Networks

Before discussing videos, we will quickly review analyses of single frames (i.e., images).  Modeled after the biological structure of visual processing found in primates, convolutional neural networks (or CNNs) have become the state of the art for analyzing and categorizing images (4824 – pdf just sent me). CNN’s have several layers (e.g., GoogLeNet has 22), and thus they are a subset of what are known as deep learning algorithms. The layers overall have two main phases, the first being a series of convolutions, which is then followed by a classification phase.  In a basic CNN, a single convolution is constructed as two nodes paired across two layers consisting of several nodes. The first node of the pair is the actual convolution (roughly, a filtering) over small parts of the input (i.e., the image), iterated across/applied over the entire image.  

The convolution may be thought as assessing the presence of a specific visual feature (e.g., vertical bars of a certain width) within localized parts of the image (a receptive field).  The second of the node pair is max-pooling; essentially this signals whether the visual feature is found anywhere within the image.

For each convolution step (of 2 layers) there are several node pairs with different convolutions.  One important aspect of these node pairs is that they are independent of each other, and thus represent only that particular convolution associated with that node pair. The output of max-pooling is then fed to all nodes of the subsequent convolutional layer.  At the end of each convolutional step the original images are represented as visual features of increasing complexity.  At the end of the convolutional phase typically the outputs are sent to a fully connected neural network for classification, for example, scene recognition represented as a caption (short text description) of the scene (Karpathy et. al, 2014).

Going from single frames to video

While CNNs have been highly successful in the visual processing of images, the addition of temporal information in videos potentially represent a much richer and more complex set of data.  In particular, one might consider the 3D trajectories of objects moving within the video.  The physical constraints of moving objects allow for prediction of future locations of the objects.  Also, moving stimuli might offer cues or features for classification not available in single images.  This might be most clearly demonstrated in the phenomenon of biological motion.


Consider a stimulus in which you only see a number of dots placed upon a person’s limbs and torso (similar to what is done in motion capture for movie special effects).  A single snapshot may not indicate the action of the person, or even that the points represent a person.  However, animating these dots can clearly indicate the action of a person, such as dancing or walking.  We are highly tuned to this kind of motion signal, from a very early age, and we can easily identify the gender and even the individual from this degraded representation.  In terms of factory surveillance, it is highly plausible that the temporal component of video might be key to classifying anomalous human or robot behavior.  

One of the main issues, then, is how to expand or modify existing machine learning algorithms for image processing (i.e., CNNs) to optimally employ the spatiotemporal information (e.g., motion) found in video.  In the following sections we will review some approaches: optic flow, Kalman filters, 3D convolutions, and Long Short Term Memory (LSTM) models.

Optic Flow

One approach is to add a representation of optic flow fields (Brox, Bruhn, Papenberg, and Weickert, 2004; Simonyan & Zisserman, 2014), which can be described as the localized motion signals across the entire video image.  The main issue in measuring optic flow is tracking the moving objects in the video. This is also known as the correspondence problem, or finding which group of pixels in one frame corresponds to the same group of displaced pixels in the next frame. There are several strategies and assumptions used to solve the correspondence problem (Brox, Bruhn, Papenberg, & Weickert, 2004). First, one might assume the corresponding pixels match or change slowly in grey values (the grey value constancy assumption, the gradient constancy assumption).  Second, one might apply a smoothness constraint on the motion (smoothness assumption). Third, one might assume that neighboring pixels have common motion (in Gestalt psychology, common fate).  

Finally, one might analyze the video at multiple scales, starting with a larger scale to find overall matches, and then to smaller scales to fine-tune the matches (coarse-to-fine).

Kalman Filters

A standard technique for predicting object trajectories is the Kalman filter.  Its use is actually much broader than this and includes other temporally-based phenomena, such as econometrics and signal processing.  Generally the Kalman filter predicts the state of a given moving object on its current physical characteristics (say position, velocity, and acceleration), and also any outside agents operating on the object.  It then updates its information from a noisy sensor to predict the next time interval.  

Some advantages of the Kalman filter include:

  • ability to incorporate a physical model
  • ability to account for noise and smooth the trajectory, hence its status as a filter
  • use of Bayesian inference, thus optimally weighting past information to make future predictions, and d. the relatively rapid calculation, as calculations are only based on the current state.

3D Convolutions

One approach is to apply spatiotemporal filtering or 3D convolutions to the video input.  The main principle of 3D convolutions is to simply extend the 2D convolutions in space in CNNs to the third dimension of time.  As CNNs do with spatial filters (i.e., convolutions), the model learns the spatiotemporal filters (i.e., integrating temporal information with 2D spatial information) for the given classification task.  

Also, the moment-to-moment positions of objects (i.e., the correspondence problem) generally do not need to be explicitly specified.  The 3D convolutions may be fixed in time, or might increase in spatial and temporal extent with time to incorporate more global information in the later frames (Karpathy et. al, 2014).  

LSTM (Long Short Term Memory) models

A final approach has been to incorporate Long Short Term Memory (LSTM) models with CNN’s and the approaches above. LSTM’s are in a class of neural networks known as Recurrent Neural Networks (RNN’s) that allow for the passing of information (the "memory") from one iteration to the next.  The iterations in this case would be the different frames of the video.

Specifically, LSTM’s explicitly control the passing of information through input, output, and forget gates, the strengths of which are learned to optimize the length and decay of the remembered information.  The explicit gating of information in LSTM’s has generally been more effective than standard RNN’s, and in theory it is an ideal structure to integrate information across the frames of a video sequence.  

In particular, LRCN’s (Long-term Recurrent Convolutional Networks) combine hierarchical visual feature extractors like CNN’s with LSTM models that can recognize and analyze temporal dynamics for sequential data (Donahue, Hendricks, Guadarrama and Rohrback, 2015).  LRCN’s are useful for video activity recognition, image caption generation and video description tasks. LRCN’s are particularly useful for videos where there are long latencies of unknown length between important events.

Experimental Findings


Donahue, Hendricks, Guadarrama and Rohrback, (2015) explored two variants of the LRCN architecture for activity recognition they. One where the LSTM is placed after the first fully connected layer of the CNN (LRCN-fc6) and another where the LSTM is placed after the second fully connected layer of the CNN (LRCN-fc7). The LRCN-fc6 network yielded the best results for RGB and flow. It improved upon the baseline network by 0.49 % and 5.27%. For image description, Flickr30k and COCO2014 were used, both of which had five sentence annotations/image.

Authors’ integrated LCRN approach outperformed baseline models (m-RNN, deFrag) on retrieval and sentence generation merics. For video description, authors’ LSTM decoder model outperformed (BLEU-4 score) SMT architectures where the simpler the decoder the better the performance, on the TACOs dataset.

Supervised LSTM’s/Optical Flow

Ng et. al (2015), examined the efficacy of various pooling architectures in CNN’s and and a LSTM connected to a CNN on performance on the Sports 1-mil and UCF-101 data sets. Their goals were to quantify the effect of the number of frames and frame rates on classification performance, and understand the importance of motion information through optical flow models. Authors posit that incorporating information across longer video sequences will enable better video classification. They explore two major classes of models to achieve this end, 1)Feature Pooling methods which max-pool local information through time and 2)LSTM whose hidden state evolves with each subsequent frame. State-of-the-art performance was achieved Sports-1M and UCF-101 benchmarks, supporting their claim that learning should occur over longer video sequences. Authors found that taking optical flow into account was necessary to obtain such results, however in some cases use of LSTM is critical for obtaining the best results possible.

Unsupervised Learning

Srivastava, Mansimov and Salakhutdinov (2015) explored unsupervised LSTM models for learning video representations. Authors’ claim that current supervised techniques, while useful, have gaps that only unsupervised learning could fill. Since videos are inherently structured spatially and temporally they are ideal candidates for such models. Specifically, four variants of the LSTM Encoder-Decoder models were explored:

  • LSTM Autoencoder
  • LSTM Future Predictor
  • Conditional Decoder
  • Composite Model

UCF-101 and HMDB-51 data sets were used. Improved performance over baseline was found for action recognition. For MNIST and image patches, the four model variants were compared. Authors’ found that the conditional model with conditional future predictor outperformed > composite model > future predictor. On the moving MNIST digits dataset, even though the model was trained for much shorter time scales the model it was able to generate persistent motion over long periods of time into the future. Similar results were found when comparing variants on unsupervised pretraining methods. When compared with state-of-the-art action recognition benchmark the Composite LSTM model performed on par but not always better.

PredNet (Lotter, Kreiman, Kox, 2016) is a unsupervised predictive neural network that utilizes four basic parts: an input convolutional layer, a recurrent representation layer, a prediction layer, and an error representation. The representation recurrent convolutional layer, predicts the layer output for the subsequent frame. Then the difference between the input and prediction layers are taken and an error representation is generated. This is then split into positive and negative error populations. This error is passed forward through a convolutional layer to become the subsequent input layer and copy of the error signal is sent to the prediction layer along with input from the representation of the next level of the network. PredNet is tolerant of object transformations and performed well compared to other models at the training set. To test natural image sequences PredNet was trained on the KITTI dataset, which was captured by a roof-mounted camera on a car driving around an urban environment in Germany. It outperformed baselines. Overall, PredNet demonstrates the potential for unsupervised learning techniques for many facets of video prediction.

In the Wild Actions

Guadarrama et. al. (2016) explored annotating video clips for out of domain, “in the wild” actions utilizing broad verb and object coverage. Here, training did not have to occur one exact videos for a given activity. If an accurate prediction could not be generated for a pretrained model a less specific answer that is also plausible is found. Semantic hierarchies of SVO triplets and zero shot detection were utilized as the basis for these models. A corpus of short, diverse Youtube videos of activities “in the wild” were chosen as the data set. With regards to binary accuracy, authors found that visual classifiers performed significantly better than a triplet-prior baseline and semantic grouping improved performance. Their model also performed significantly better than baseline for WUP accuracy scores, generation of sentences and activity recognition and human evaluation using mechanical turk.


By default video offers a potentially richer and more complex source due to its temporal information, but also the challenge within machine learning of how to capture that information.  Generally CNN’s have been successful with image processing, while attempts of video processing have integrated CNN’s with several approaches.  These include Kalman filters, optic flow, 3D convolutions, and LSTM’s, with LSTM’s seeming to hold the most promise.  In the context of factory surveillance, such algorithms could be highly beneficial in the detection of anomalous behavior of humans and robots.  This could lead to systemwide improvements in maintenance, repairs, safety compliance, and worker injury detection.  


Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011, November). Sequential deep learning for human action recognition. In International Workshop on Human Behavior Understanding (pp. 29-39). Springer Berlin Heidelberg.

Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004, May). High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision (pp. 25-36). Springer Berlin Heidelberg.

Ciresan, D. C., Meier, U., Masci, J., Maria Gambardella, L., & Schmidhuber, J. (2011, July). Flexible, high performance convolutional neural networks for image classification. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence (Vol. 22, No. 1, p. 1237).

Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2625-2634).

Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale Video Classification with Convolutional Neural Networks.

Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (pp. 568-576).

Sonntag, D., Zillner, S., van der Smagt, P., & Lörincz, A. (2016). Overview of the CPS for Smart Factories Project: Deep Learning, Knowledge Acquisition, Anomaly Detection and Intelligent User Interfaces. In Industrial Internet of Things (pp. 487-504). Springer International Publishing.

Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMS. CoRR, abs/1502.04681, 2.

Wang, L., & Sng, D. (2015). Deep Learning Algorithms with Applications to Video Analytics for A Smart City: A Survey. arXiv preprint arXiv:1512.03131.

Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4694-4702).


Satyugjit Virk, Gurkaran Buxi, Arshak Navruzyan

Steve Shimozaki