The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.
Source: Guiding Long-Short Term Memory for Image Caption Generation
Image Source: Dual-Path Convolutional Image-Text Embedding with Instance Loss