Multimodal video classification with deep neural networks

Postgraduate Thesis uoadl:2800089 337 Read counter

Unit:
Κατεύθυνση / ειδίκευση Επεξεργασία-Μάθηση Σήματος και Πληροφορίας (ΕΜΠ)
Πληροφορική
Deposit date:
2018-09-29
Year:
2018
Author:
Pittaras Nikiforos
Supervisors info:
Σταύρος Περαντώνης, Διευθυντής Έρευνας, Ινστιτούτο Πληροφορικής και Τηλεπικοινωνιών, ΕΚΕΦΕ Δημόκριτος
Original Title:
Πολυτροπική κατηγοριοποίηση βίντεο με βαθιά νευρωνικά δίκτυα
Languages:
English
Greek
Translated title:
Multimodal video classification with deep neural networks
Summary:
The recent abundance of video data, automatic video classification tools have become important components in multiple video machine learning tasks. Given the rich multimodal qualities of video, it offers a variety of information sources that can be utilized to further aid classification. In this study we examine research questions adhering to the effect of the visual, audio and temporal video modalities on video classification. To process the visual and audio modalities, we extract frame and audio spectrogram sequences from random video segments. We adopt a shared deep representation approach for the visual and audio data, using deep features extracted from a fully-connected layer of Alexnet-based DCNN. Regarding multimodal fusion, we examine a variety of early direct-fusion methods, i.e. approaches that aggregate information from the visual and audio modality into a single, multimodal representation. Specifically, we use averaging, concatenation and max pooling. In addition, we attempt to apply sequence bias methods borrowed from image description, which we call input-bias and state-bias fusion. Finally, we perform a late fusion of video-level classification scores, examining linear combination and max pooling of the marginal predictions. Regarding the temporal information present in video data, we examine its contribution by comparing the fully-connected, feed-forward softmax classification layer – which processes input sequence in an aggregation-based manner – to the sequence-aware LSTM model that is sensitive to and able to model temporal input interdependencies. We apply these approaches (named FC and LSTM workflows, respectively) both in separate visual and audio modality data and in the multimodal fusion schemes. A set of experimental evaluations are performed on multiple video classification datasets to examine the performance of each research question. The experimental results indicate that the LSTM workflow performs better on visual data, with the FC approach faring better on the audio modality. The relationship between the visual and audio modalities relies on the underlying dataset and annotation, as reflected by the superiority of the audio modality in the audio-inclined Audioset, and its inferior results, compared to the visual modality, in the other datasets in the multimodal experiments. Regarding multimodal fusion approaches, results show that simple late late-video linear combination fusion works best, despite its practical disadvantages with the maximum pooling variant performing close to single-modality baselines. Excluding that, averaging or concatenation of modality encodings works best for the FC and LSTM workflows respectively, while the sequence-bias approaches do not perform as well as in the image description task. We verify the complementarity of the visual and audio modalities, with multimodal techniques outperforming single-modality baselines per dataset, extract guidelines towards achieving it and establish a multimodal DNN baseline per dataset and workflow.
Main subject category:
Science
Keywords:
Machine Learning,Neural Networks,Multimodal,Classification,Deep Learning
Index:
Yes
Number of index pages:
4
Contains images:
Yes
Number of references:
141
Number of pages:
100
msc_thesis.pdf (2 MB) Open in new window