AlgorithmFEATUREDMachine LearningpaperSeries

Deep dive: Llama3 from scratch, LinearBoost, LoRA Learns and Forgets Less

In this post, we’ll explore three groundbreaking advancements that are pushing the boundaries of AI and machine learning. First, dive into the intricacies of LLaMa 3, implemented from scratch in Python, where every aspect, from attention mechanisms to tokenization, is meticulously explained, making it a must-see for anyone interested in model architecture. Next, discover how LinearBoost, a new linear classifier-based algorithm, outperforms traditional GBDTs like CatBoost and XGBoost, showcasing superior accuracy and response time across five benchmark datasets. Lastly, we’ll delve into the debate on Low-Rank Adaptation (LoRA) in fine-tuning large language models, revealing why LoRA might not match full fine-tuning in specialized domains but offers remarkable regularization benefits. These insights are not only educational but also essential for staying at the forefront of AI research and application.

LLaMa 3 Implemented From Scratch in Python

A new repository, praised by Andrej Karpathy, provides a detailed implementation of Llama3, breaking down each component from matrix multiplication in attention mechanisms to positional encoding. Users can load tensors directly from Meta’s official Llama3 model file after downloading the weights.

Core Components Covered

  • Tokenization Process: Uses the tiktoken library for tokenization with Meta-Llama-3-8B’s tokenizer model. Special tokens and mergeable ranks are loaded, converting text into tokens and embeddings after RMS normalization.
  • Attention Mechanism: Details weights for queries, keys, values, and outputs. Explains splitting query vectors into pairs, rotating them using RoPE (rotary positional embedding), and obtaining complex numbers for each token’s query element. Uses the rope_theta parameter from the model’s config, resulting in a rotated query vector.
  • Multi-Head Attention Operations: Includes matrix multiplication for query-key scores and masking future tokens during training. The attention scores matrix maps token relationships. Values are computed similarly, resulting in a final attention vector.
  • Feedforward Network: Uses SwiGLU to process edited embeddings further. Each layer performs these operations, with the final embedding normalized and decoded into token predictions after 32 transformer layers.
  • Visualizations and Practical Examples: Offers practical examples and visualizations, such as heatmaps for attention scores, to aid in understanding Llama3’s architecture.

This detailed, step-by-step implementation makes Llama3’s architecture and functioning accessible for further experimentation and research.

LinearBoost outperforms CatBoost, XGBoost, LightGBM on five benchmark datasets

LinearBoost is based on boosting a linear classifier to significantly enhance performance. The testing shows it outperforms traditional GBDT algorithms in terms of accuracy and response time across five well-known datasets.
The key to LinearBoost’s enhanced performance lies in its approach at each estimator stage. Unlike decision trees used in GBDTs, which select features sequentially, LinearBoost utilizes a linear classifier as its building block, considering all available features simultaneously. This comprehensive feature integration allows for more robust decision-making processes at every step.

LoRA Learns Less and Forgets Less

Problem: Low-Rank Adaptation (LoRA) is a parameter-efficient finetuning method for large language models, but its performance compared to full finetuning is unclear, particularly in specialized domains like programming and mathematics.

Solution: The study compares LoRA and full finetuning on programming and mathematics, using instruction finetuning (≈100K prompt-response pairs) and continued pretraining (≈10B tokens). It evaluates performance and regularization effects, analyzing perturbation ranks and comparing with techniques like weight decay and dropout.

Results: LoRA underperforms full finetuning in target domains but better preserves base model performance. LoRA achieves a rank 10-100X lower than full finetuning, explaining performance gaps. Applying LoRA to all layers yields better results, and it offers stronger regularization, reducing overfitting more effectively than dropout and weight decay.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses User Verification plugin to reduce spam. See how your comment data is processed.