A Hierarchical and Regional Deep Learning Architecture for Image Description Generation

Kinghorn, Philip, Zhang, Li and Shao, Ling (2019) A Hierarchical and Regional Deep Learning Architecture for Image Description Generation. Pattern Recognition Letters, 119. pp. 77-85. ISSN 0167-8655

Accepted_manuscript.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives 4.0.

Download (501kB) | Preview
Official URL: https://doi.org/10.1016/j.patrec.2017.09.013


This research proposes a distinctive deep learning network architecture for image captioning and description generation. Specifically, we propose a hierarchically trained deep network in order to increase the fluidity and descriptive nature of the generated image captions. The proposed deep network consists of initial regional proposal generation and two key stages for image description generation. The initial regional proposal generation is based upon the Region Proposal Network from the Faster R-CNN. This process generates regions of interest that are then used to annotate and classify human and object attributes. The first key stage of the proposed system conducts detailed label description generation for each region of interest. The second stage uses a Recurrent Neural Network (RNN)-based encoder-decoder structure to translate these regional descriptions into a full image description. Especially, the proposed deep network model can label scenes, objects, human and object attributes, simultaneously, which is achieved through multiple individually trained RNNs

The empirical results indicate that our work is comparable to existing research and outperforms state-of-the-art existing methods considerably when evaluated with out-of-domain images from the IAPR TC-12 dataset, especially considering that our system is not trained on images from any of the image captioning datasets. When evaluated with several well-known evaluation metrics, the proposed system achieves an improvement of ∼60% at BLEU-1 over existing methods on the IAPR TC-12 dataset. Moreover, compared with related methods, the proposed deep network requires substantially fewer data samples for training, leading to a much-reduced computational cost.

Item Type: Article
Subjects: G400 Computer Science
Department: Faculties > Engineering and Environment > Computer and Information Sciences
Depositing User: Becky Skoyles
Date Deposited: 02 Oct 2017 11:22
Last Modified: 01 Aug 2021 13:05
URI: http://nrl.northumbria.ac.uk/id/eprint/32015

Actions (login required)

View Item View Item


Downloads per month over past year

View more statistics