Feature extraction and encoding for video action recognition

Zuo, Zheming (2020) Feature extraction and encoding for video action recognition. Doctoral thesis, Northumbria University.

Text (Doctoral Thesis)
zuo.zheming_phd.pdf - Submitted Version

Download (101MB) | Preview


Video action recognition, including Third-person Action Recognition (TAR) and Egocentric Action Recognition (EAR), is one of the essential tasks within the realm of computer vision. It can be regarded as the capability of determining whether a given human action occurs in the video or not. Albeit advances made by machine and deep learning techniques significantly improve the performance of action classification, some open questions are far less than comprehensively resolved, such as (1) uncertainty management in feature extraction; (2) estimated attention might not be concordant with the subjective in feature extraction; (3) dimensions may blow up during feature encoding, making visual systems with low practicalities and applicabilities.
This PhD project, in the first part, presents a Histogram of Fuzzy Local Saptio-Temporal Descriptors (HFLSTD) to support uncertainty management in extracting a conventional gradient-based local feature descriptor via estimating the contribution of each pixel towards each angular-based bin adaptively controlled by a penalty parameter. The efficiency and efficacy of the HFLSTD have been confirmed by the domain benchmarks of two large scale data sets with competitive performance yielded even in comparison to some recently proposed deep feature descriptors. Then, in order to extract feature descriptors from more informative 3-D attentional regions, the Gaze-Informed Descriptors (GD) are sparsely devised by utilising human eye fixation in conjunction with estimated attention to inform the process of generating a 3-D region of interest, and hence help to extract more informative visual feature descriptors in the context of the EAR. The Saliency Descriptors (SD), ), on which the membership is based, are also developed in a dense manner for the situation where the human eye fixation information is not available. The effectiveness of GD and SD in enhancing the classification performance is demonstrated through not only a collected EAR data set but also a real-time memory aid system for Dementia and Parkinson’s patients to support health care.
In addition, in the second part of this work, the Saliency-Informed Spatio-Temporal Vector of Locally Aggregated Descriptor and Fisher Vector (SST-VLAD and SST-FV) are developed to address the inherent redundancy of not only video action data sets but also extracted feature descriptors by mitigating the curse of dimensionality in the super-vector-based encoding schemes. This is contributed to by a tentative proposition of selecting the minimum number of videos from the data set, thereby a small portion of feature descriptors via the ranked video-wise saliency-based spatio-temporal scores, which in turn guide the process of codebook generation. Extensive experimental results identified that SST-VLAD and SST-FV have much lower space- and time-complexity and relative higher action classification performance, in contrast with VLAD and FV, on one TAR and one EAR data set.

Item Type: Thesis (Doctoral)
Uncontrolled Keywords: feature extraction, feature encoding, action recognition, uncertainty management, visual attention
Subjects: G400 Computer Science
Department: Faculties > Engineering and Environment > Computer and Information Sciences
University Services > Graduate School > Doctor of Philosophy
Depositing User: John Coen
Date Deposited: 04 Jan 2021 11:25
Last Modified: 31 Jul 2021 14:18
URI: http://nrl.northumbria.ac.uk/id/eprint/45073

Actions (login required)

View Item View Item


Downloads per month over past year

View more statistics