Authors: Desmond Elliott and Ákos Kádár
In this paper, an extension of a NMT system is proposed. The image information is used to learn visually grounded representations.
The system is now a multi-task system: the first decoder generates a textual translation of the source text and the second one generates an image which description is given at input. While the first decoder corresponds to standard NMT, the second one corresponds to image generation from a caption (the input text sequence). The image decoder is a simple feed-forward layer which aims at transforming the average of bi-directional annotations of the source text into a image vector by maximizing a discriminative (margin-based) objective function. The goal is to minimize the cosine distance between the estimated image vector and the ground-truth image while maximizing the cosine distance between this image vector and some other contrastive instances (see eq. 15)
They reach SOA performance (Moses PBSMT)
The results differ depending on the predicted image vectors. Inception.v3 gives better results than ResNet and VGG.
My two cents:
This work is an example of transfer learning in which a task helps another one. In our system, we are not satisfied by the attention over the images and we found that the most relevant attention is captured by the input text (the image is not so important). This model avoid this problem.
I am surprised that generating image vectors with a simple feed-forward network from average textual annotations provides improvement.