AcademicAlgorithmCodeConceptMachine Learning

Better Images, Less Training

The longer text-to-image models train, the better their output — but the training is costly. Researchers built a system that produced superior images after far less training.

Independent researcher Pablo Pernías and colleagues at Technische Hochschule Ingolstadt, Université de Montréal, and Polytechnique Montréal built Würstchen, a system that divided the task of image generation between two diffusion models. 

Diffusion model basics: During training, a text-to-image generator based on diffusion takes a noisy image and a text embedding. The model learns to use the embedding to remove the noise in successive steps. At inference, it produces an image by starting with pure noise and a text embedding, and removing noise iteratively according to the text embedding. A variant known as a latent diffusion model uses less processing power by removing noise from a noisy image embedding instead of a noisy image.

Key insight: A latent diffusion model typically learns to remove noise from an embedding of an input image based solely on a text prompt. It can learn much more quickly if, in addition to the text prompt, a separate model supplies a smaller, noise-free version of the image embedding.  Working as a system, the two models can learn their tasks in a fraction of the usual time.

How it works: Würstchen involves three components that required training: the encoder-decoder from VQGAN, a latent diffusion model based on U-Net, and another latent diffusion model based on ConvNeXt. The authors trained the models separately on subsets of LAION-5B, which contains matched images and text descriptions scraped from the web. 

  • The authors trained VQGAN to reproduce input images. The encoder produced embeddings, to which the authors added noise.
  • To train U-Net, the authors used EfficientNetV2 (a convolutional neural network pretrained on ImageNet) to produce embeddings one-tenth the size of the VQGAN embeddings. Given this smaller embedding, a noisy VQGAN embedding, and a text description, U-Net learned to remove noise from the VQGAN embedding. 
  • To train ConvNeXt, EfficientNetV2 once again produced small embeddings from input images, to which the authors added noise. Given a noisy EfficientNetV2 embedding and a text description, ConvNeXt learned to remove the noise.
  • At inference, the components worked in opposite order of training: (i) Given noise and a text prompt, ConvNeXt produced a small EfficientNetV2-sized image embedding. (ii) Given that embedding, noise, and the same text prompt, U-Net produced a larger VQGAN-sized embedding. (iii) Given the larger embedding, VQGAN produced an image.

Results: The authors compared Würstchen (trained on subsets of LAION-5B for 25,000 GPU hours) to Stable Diffusion 2.1 (trained on subsets of LAION-5B for 200,000 GPU hours). The authors generated images based on captions from MS COCO (human-written captions of pictures scraped from the web) and Parti-prompts (human-written captions designed to reflect common prompts for generative models). They asked 90 people which output they preferred. The judges expressed little preference regarding renderings of MS COCO captions: They chose Würstchen 41.3 percent of the time, Stable Diffusion 40.6 percent of the time, and neither 18.1 percent of the time. However, presented with the results of Parti-prompts, they preferred Würstchen 49.5 percent of the time, Stable Diffusion’s 32.8 percent of the time, and neither 17.7 percent of the time. 

Why it matters: Training a latent diffusion model to denoise smaller embeddings accelerates training, but this tends to produce lower-quality images. Stacking two diffusion models — one to generate smaller embeddings, and the other to generate larger embeddings based on the smaller ones — enabled Würstchen to match or exceeded the output quality of models with large embeddings, while achieving the training speed of models with small embeddings.

Stacking models in this fashion could be useful in training video generators. They could use a speed boost because video is much more data-intensive than still images.

Join Upaspro to get email for news in AI and Finance

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses User Verification plugin to reduce spam. See how your comment data is processed.