Automated Image Captioning

Published: August 20, 2025

This project explores the use of Vision Language Models for Automated Image Captioning. We stack a ResNet image encoder with Attention and LSTMs to predict token sequences from image inputs. We successfully implement the architecture from Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, train the model on the Flickr8k dataset with a custom tokenizer and observe a 25% improvement on the BLEU metric.

Tech Stack: Python, PyTorch, Kubernetes, NLTK

Detailed report and code here.

Share on

Twitter Facebook LinkedIn

Pratik Doshi

Share on