Video captioning aims to describe every scene given the video using natural language, and it is one of the most challenging tasks in computer vision because it requires the association of the video to ...