Deep Learning-based Regional Image Caption Generation with Refined Descriptions

Kinghorn, Phil (2017) Deep Learning-based Regional Image Caption Generation with Refined Descriptions. Doctoral thesis, Northumbria University.

Kinghorn.philip_phd.pdf - Submitted Version

Download (3MB) | Preview


Image captioning in recent research generally focuses upon small, relatively high-level captions. These captions are generally without detail, or insight. Missing out information which we as humans could easily, and would generally, report. This restricts the usefulness of existing systems within real-world applications. Within this thesis, we propose the following solutions to address these problems.

The first stage proposes a region-based approach, focusing upon regions within images and describing them with attributes. These attributes add more meaning to standard classification labels. Improving the classification label, ‘dog’, produced by existing systems, to the more detailed label ‘white spotted dog’. This adds a large degree of detail when used within template-based description generation. The area of healthcare is also explored in which the system is paired with a visual agent. The agent can describe the environment and report potential hazards, as well as socialising through conversation.

The second stage improves upon the previous architecture, by proposing another region-based architecture which removes the rigidity of templates. Instead sentences are generated through a Recurrent Neural Network. Training this architecture on multiple smaller datasets allows for a quicker training stage, with less computing power required during both training and testing. An encoder-decoder structure is proposed to translate the detailed region labels into full image descriptions. This produces natural sounding descriptive phrases that accurately depict the contents of an image.

The third stage proposes a hierarchically trained, end-to-end style system to generate an image description with the same required functionality to describe detections in detail but without the need for multiple models. This system can utilise the humanoid robot’s vision and voice synthesis capabilities. Overall, the above proposed systems within this research outperform many state-of-the-art methods for the refined image description generation task, especially with complex and out-of-domain images, such as images of paintings.

Item Type: Thesis (Doctoral)
Subjects: G400 Computer Science
Department: Faculties > Engineering and Environment > Computer and Information Sciences
University Services > Graduate School > Doctor of Philosophy
Depositing User: Ellen Cole
Date Deposited: 18 Feb 2019 15:35
Last Modified: 20 Sep 2022 10:00

Actions (login required)

View Item View Item


Downloads per month over past year

View more statistics