3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in instruction-following-20
Use these libraries to find instruction-following-20 models and implementations
Self-Instruct is introduced, a framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations by generating instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model.
The comparison between learning and SLAM approaches from two recent works are revisited and evidence is found -- that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and the first cross-dataset generalization experiments are conducted.
QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA, and current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots.
This paper presents LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding and introduces GPT-4 generated visual instruction tuning data, the model and code base publicly available.
Tk-Instruct is built, a transformer model trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples) that outperforms existing instruction-following models such as InstructGPT by over 9% on the authors' benchmark despite being an order of magnitude smaller.
A zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge on traditional vision and language tasks, demonstrating the superior generalization capacity of the approach.
A model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them is designed.
Point-LLM is presented, the first 3D large language model (LLM) following 3D multi-modal instructions, which injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi- modal question-answering capacity.
This work augments LLaMA-Adapter by unlocking more learnable parameters and proposes an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.
Adding a benchmark result helps the community track progress.