AcademicCodeConceptTechnology

Large word model and YOLOV9

A new open-source bomb dropped when researchers from the Berkeley AI research lab, led by Prof Pieter Abbeel, released Large World Model (LWM), a family of general-purpose large-context multimodal autoregressive models. 

These models were trained on several multimodal datasets (text, images, videos). Using next token prediction, they can generate data across all these modalities, over a context of up to 1M tokens. 

Problem: 

Current language models have limitations in understanding our world because they are restricted to short sequences of text, images and clips. That is because learning from long videos represents a challenge due to memory constraints, computational complexity and limited datasets. 

Solution: 

The RingAttention technique was used to gradually and cost-efficiently scale the training context from 4k to 1M tokens. The solution to the challenges associated with training on both video and language was to effectively train on different sequence lengths, with a weighted contribution of language and vision. 

RingAttention refers to an advanced attention mechanism that improves how language models handle large context sizes. By distributing the input sequence across multiple devices, the attention matrix can be computed without materializing it entirely. 

Performance: 

LWM beats Gemini Pro on single needle retrieval, and ties with GPT-4. Other tests like multi needle retrieval and long video understanding are also performed.

Similarly to its closed source counterparts, LWM is also capable of generating high quality videos (Sora) and answering questions over a 1 hour video (Gemini 1.5 Pro).  

The TPU optimized code, with reproducible tests and models, is available.

YOLOv9 is out! It’s a real-time object detection model that surpasses all convolution and transformer-based models. 

Problem: 

Current deep learning methods lose critical information during data processing, leading to suboptimal model predictions. This issue stems from the information bottleneck and inefficient network architectures that fail to preserve data through layer transformations, impacting accuracy and model efficiency.

Solution: 

The researchers developed PGI to maintain complete input data information, ensuring accurate gradient updates for weight optimization. They then designed GELAN, a lightweight architecture that utilizes gradient path planning for efficient information flow. Together, these innovations address data loss and optimize network performance.

Performance: 

YOLOv9, leveraging PGI and GELAN, demonstrated remarkable improvements on the MS COCO dataset, showcasing enhanced parameter utilization and outperforming existing models. Specifically, it achieved better accuracy with lower computational resources, validating the effectiveness of the proposed methods in real-world object detection tasks.

Join Upaspro to get email for news in AI and Finance

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses User Verification plugin to reduce spam. See how your comment data is processed.