AcademicCodeConceptVideos

Deep Dive: Multiprocessing with Pool,  Pure C Implementation of GPT-2, Attention in Transformers

Unlock the power of Python’s multiprocessing library to execute tasks in parallel, effectively utilizing multiple CPU cores and bypassing the Global Interpreter Lock (GIL). By leveraging the Process and Pool classes, you can significantly reduce execution times for parallelizable tasks. Learn how to manage processes and facilitate inter-process communication through practical examples. Discover how multiprocessing can enhance your Python applications’ performance with this comprehensive guide.

Multiprocessing with Pool

Python’s multiprocessing library enables task execution in parallel by creating separate processes for each task, effectively utilizing multiple CPU cores.

This method bypasses the Global Interpreter Lock (GIL), allowing for concurrent task processing. The library includes Process and Pool classes for managing processes, and it supports inter-process communication through pipes and queues.

By dividing tasks into independently executable units, multiprocessing can significantly reduce execution times for parallelizable tasks.

from multiprocessing import Pool
import time

def process_request(request):
    time.sleep(1)
    return f"Processed {request}"

requests = ['req1', 'req2', 'req3', 'req4', 'req5']

# Sequential processing
start = time.time()
results_seq = [process_request(req) for req in requests]
print(f"Sequential: {time.time() - start:.2f} seconds")

# Output: Sequential: 5.00 seconds

Running the same processes in parallel will speed up the execution. Here is how you can do it using multiprocessing.Pool.

# Concurrent processing with multiprocessing Pool
start = time.time()
with Pool(5) as p:
    results_pool = p.map(process_request, requests)
print(f"Pool: {time.time() - start:.2f} seconds")

# Output: Pool: 1.11 seconds

Karpathy is Back with a Pure C Implementation of GPT-2 in <1000 Lines

Andrej Karpathy, previously on OpenAI’s founding team and Director of AI at Tesla, released his second educational project on Language Models (LLMs).

This project focuses on training a GPT-2 model with 124 million parameters on a CPU using only C/CUDA, avoiding PyTorch.

The codebase contains around 1,000 lines of code in a single file, allowing for the training of GPT-2 on a CPU with 32-bit precision.

This is a phenomenal resource to understand how language models are trained under the hook.

Karpathy selected GPT-2 because its model weights are publicly available. The project uses C for its simplicity and direct hardware interaction.

First, the repo allows to download and tokenize a small dataset on which the model is trained. In principle, the model could be trained directly on this dataset.

However, the current CPU/fp32 implementation is still inefficient which makes it not practical to train these models from scratch yet. Instead, the GPT-2 weights released by OpenAI are initialized and fine-tuned on the tokenized dataset.

Karapthy is currently working on:

  • direct CUDA implementation, which will be significantly faster and probably come close to PyTorch.
  • speed up the CPU version with SIMD instructions, AVX2 on x86 / NEON on ARM (e.g. Apple Silicon).
  • more modern architectures, e.g. Llama2, Gemma, etc.


Karpathy’s work contributes significantly to the open-source community and the field of AI. This second educational project goes one step further in democratizing AI by showing how a model can be trained and optimized using one single file of code.

Writing the llm.c training code would in my opinion be a very interesting, impressive, self-contained and very meta challenge for LLM agents. Andrey Karpathy

I hope more developers rediscover the elegant efficiency of C, especially now that llm copilots help reduce the memory-intensive barriers of recalling syntax and the many built-in functions.” Dave Deriso

3Blue1Brown Episode 2: Attention in Transformers, Visually Explained

The latest @3blue1brown video on YouTube delves into the attention mechanism of transformers. It explains how the model represents tokens as vectors and how these vectors gain meaning from context. 

This episode focuses on the technical aspects of the attention mechanism in the transformer architecture.

YouTube player

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses User Verification plugin to reduce spam. See how your comment data is processed.