Αυτόματη δημιουργία περιγραφών εικόνων : ποιοτική ανάλυση των περιγραφών
Natural language description of images : a Qualitative analysis
View/ Open
Abstract
Image captioning is a challenging problem that lies at the intersection of computer vision
and natural language generation. The task involves the generation of a fully-fledged natural
language sentence that accurately summarizes the contents of an image. Image captioning
is also the cornerstone towards real-world applications with significant practical impact,
ranging from aiding visually impaired users to personal assistants to intuitive human-robot
interaction.
The advance in image captioning has been marked as a prominent success of Artificial
Intelligence. It has been reported that with certain metrics, like BLUE or CIDEr, state-of-the-art
techniques surpass human’s performance. Thus, a natural question that rises is: Do
humans and machines speaking the same language?
An observation that well established in linguistics, is that different human speakers or the
same speaker produce different descriptions when presented with an image. This observation
has been overlooked by today’s systems. However, this poses serious questions for both
the development of algorithms and their evaluation. Therefore this thesis tries to answer
on which premises the state-of-the-art algorithms for the generation of image captions
are build upon. Are they trying to emulate or predict the behaviour of individual speakers
in a given situation? With the aim of shedding light on this question, a model based on
the encoder-decoder model was implemented. The output of the model was qualitatively
analyzed towards two factors: (1) whether is biased towards frequent captions in the training
set; (2) and whether better image representations enrich the language production.