3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in dense-captioning-7
Use these libraries to find dense-captioning-7 models and implementations
No subtasks available.
This work proposes to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs that can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on.
This work proposes a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, and introduces a new captioning module that uses contextual information from past and future events to jointly describe all events.
A model that decomposes both images and paragraphs into their constituent parts is developed, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language.
A Fully Convolutional Localization Network (FCLN) architecture is proposed that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with asingle round of optimization.
A new model pipeline based on two novel ideas, joint inference and context fusion, is proposed, which achieves state-of-the-art accuracy on Visual Genome for dense captioning with a relative gain of 73% compared to the previous best algorithm.
The Joint Event Detection and Description Network (JEDDi-Net) is presented, which encodes the input video stream with three-dimensional convolutional layers, proposes variable- length temporal events based on pooled features, and then uses a two-level hierarchical LSTM module with context modeling to transcribe the event proposals into captions.
This technical report presents a brief description of the submission of a multi-event captioning model to the dense video captioning task of ActivityNet Challenge 2020, which achieves a 9.28 METEOR score on the test set.
This paper proposes MORE, a Multi-Order RElation mining model, to support generating more descriptive and comprehensive captions in 3D dense captioning, and outperform the current state-of-the-art method.
Adding a benchmark result helps the community track progress.