AcademicCodeMachine Learning

Sequential Modeling in  Large Vision Models

The authors introduce a visual language model (LVM) without making use of any linguistic data trained on 1.64 billion unlabeled images to perform conditional image generation through visual prompting at test time. To do this, thry define a common format, “visual sentences”, in which they can represent raw
images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Then the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, they provide empirical evidence that the models scale effectively

Traditional vision models rely heavily on language data, limiting their direct interpretation of visual information. This research tackles the need for a more direct, pixel-based understanding in vision models.

The authors tokenize raw images into discrete visual tokens using VQGAN, concatenate tokens from images and annotations into “visual sentences”, and train a 3 billion parameter Transformer model to predict the next token.

The LVM model demonstrates strong scalability as model size and data diversity increase. It can perform visual reasoning for various downstream tasks through suitable visual prompt construction at test time, including video prediction (49.8 perplexity), segmentation (50.8 mIOU), and object detection (49.6 mIOU).

Join Upaspro to get email for news in AI and Finance

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses User Verification plugin to reduce spam. See how your comment data is processed.