Generative Multimodal Models are In-Context Learners
The main objectives of this work include demonstrating that large multimodal models can enhance their task-agnostic in-context learning capabilities through effective scaling-up.
The central problem tackled is the current multimodal systems’ struggle to imitate the human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions).
Emu2, a 37B generative multimodal model, trained on large-scale multimodal sequences with a unified autoregressive objective, is proposed. Emu2 consists of a visual encoder/decoder, and a multimodal transformer. Images are tokenized with the visual encoder to a continuous embedding space, interleaved with text tokens for autoregressive modeling.
Emu2 is first pretrained only on the captioning task with both image-text and video-text paired datasets. Emu2’s visual decoder is initialized from SDXL-base, and can be considered a visual detokenizer through a diffusion model. Here, VAE is kept frozen while the weights of a diffusion U-Net are updated. Emu-chat is derived from Emu by fine-tuning the model with conversational data, and Emu-gen is fine-tuned with complex compositional generation tasks.
Results indicate that Emu2 achieves state-of-the-art few-shot performance on multiple visual question-answering datasets and demonstrates a performance improvement with an increase in the number of examples in context. Emu2 also learns to follow visual prompting in context, showcasing strong multimodal reasoning capabilities for tasks in the wild. When instruction-tuned to follow specific instructions, Emu2 further achieves new benchmarks on challenging tasks such as question answering for large multimodal models and open-ended subject-driven generation.
Join Upaspro to get email for news in AI and Finance