The Stable Diffusion Model: An Introductory Guide

Shunya Vichaar
7 min readMay 16, 2024

--

Intro Image

Stable Diffusion falls under a category of deep learning models called diffusion models. These models are designed to generate new data similar to what they’ve seen in training, specifically images in the case of Stable Diffusion.

Why “Diffusion”?

The name “diffusion” comes from the mathematical resemblance to diffusion in science. Let’s break down the idea.

Imagine training a diffusion model with just two types of images: penguin and bear.

Forward Diffusion

Forward diffusion turns a photo into noise. During this process, noise is gradually added to a training image, transforming it into unrecognizable noise. This process will turn any penguin or bear image into noise, making it impossible to distinguish between the two. This is analogous to how an ink drop diffuses in water, spreading randomly throughout the water, making it impossible to trace its initial location.

Reverse Diffusion

The reverse diffusion process restores an image. Starting from a noisy, meaningless image, reverse diffusion recovers either a penguin or bear image. The reverse diffusion drifts towards either penguin or bear images but nothing in between, resulting in a clear image of a penguin or bear.

Training Process

To reverse the diffusion, a neural network model (U-net), known as the noise predictor in Stable Diffusion, is trained to predict the noise added to an image. This is achieved by tuning its weights and showing it the correct answers. After training, the noise predictor can estimate the noise added to an image.

Using the Noise Predictor

Once the noise predictor is trained, it can be used to generate images. A completely random image is generated, and the noise predictor estimates the noise. This estimated noise is then subtracted from the original image, and this process is repeated a few times to generate an image of either a cat or a dog.

Stable Diffusion Model

However, the above diffusion process is not how Stable Diffusion works. The image space is computationally very slow, making it impractical for single GPUs. Models like Imagen and DALL-E operate in pixel space, which is faster but still challenging.

Stable Diffusion tackles the issue of speed by working smarter, not harder. Here’s how it does it.

Instead of dealing directly with high-dimensional images, Stable Diffusion compresses them into a smaller, more manageable space called the latent space. This space is 48 times smaller, making computations much faster.

This compression is achieved using a technique called the variational autoencoder (VAE). This neural network has two parts: an encoder and a decoder. The encoder compresses an image into the latent space, while the decoder restores the image from the latent space.

The latent space in Stable Diffusion is significantly smaller than the image pixel space. All the image transformations, like forward and reverse diffusions, happen in this compact latent space.

During training, instead of generating noisy images directly, Stable Diffusion generates random tensors in the latent space (latent noise). This noise is then used to corrupt the image representation in the latent space. This process is much faster because the latent space is smaller.

Image resolution affects the size of the latent image tensor. Generating larger images requires more time and memory.

The VAE can compress images into a smaller latent space without losing information because natural images have high regularity. For example, a face follows a specific spatial relationship between features like eyes, nose, and ears. This property allows the VAE to compress images efficiently.

Reverse diffusion in Stable Diffusion works by generating a random latent space matrix, estimating its noise, subtracting the noise, and repeating this process to generate the final image.

Conditioning is crucial for Stable Diffusion to generate specific images based on text prompts. The text is tokenized, converted into embeddings, processed by a text transformer, and then used to steer the noise predictor for image generation.

Understanding these concepts is key to unlocking the full potential of Stable Diffusion for text-to-image generation.

Conditioning

In Stable Diffusion, the text prompt guides the model on what image to create. Without this guidance, the model might produce random images that don’t match the prompt. Conditioning ensures that the model generates the desired image by using the text prompt as a reference.

Here’s how it works:

  1. Tokenization: The text prompt is first broken down into tokens using a CLIP tokenizer. Each word in the prompt is converted into a token, which is a numerical representation of the word.
  2. Embedding: These tokens are then converted into embeddings, which are fixed-length numerical vectors that represent the meaning of the words. For example, words like “women” and “queen” might have similar embeddings because they are related concepts.
  3. Text Transformer: The embeddings are processed by a text transformer, which further refines them and prepares them for use by the noise predictor. This transformer helps the model understand the context and meaning of the words in the prompt.
  4. Noise Predictor: The processed embeddings are then fed into the noise predictor, which uses this information to steer the image generation process. By conditioning the noise predictor with the text embeddings, the model can generate images that align with the text prompt.

CLIP, or Contrastive Language-Image Pre-training, is a powerful neural network model developed by OpenAI. It’s designed to understand the relationship between images and natural language text. Here’s a simpler explanation:

What CLIP Does: CLIP learns to associate images with relevant text descriptions. For example, it can understand that a picture of a cat should be associated with the word “cat.”

How CLIP Works:

Training Data: CLIP is trained on a large dataset of image and text pairs. It learns to map both images and text descriptions to high-dimensional vectors in a shared space.

Contrastive Learning: CLIP uses a contrastive learning algorithm. It learns to make similar images and text descriptions closer to each other in the shared space, while pushing dissimilar pairs apart.

Shared Space: In this shared space, similar images and text descriptions are located close to each other. This allows CLIP to understand the relationship between them.

CLIP can be used for various tasks like image classification, object detection, and image captioning. It’s particularly useful because it can generalize to new image and text domains, even if it hasn’t seen them during training.

Overall, conditioning plays a crucial role in ensuring that Stable Diffusion generates the desired images based on the given text prompt, making it a powerful tool for text-to-image generation.

Cross-attention

Cross-attention in Stable Diffusion is a critical mechanism where the text prompt influences the image generation process. Here’s a simplified explanation:

Cross-attention allows the model to focus on specific parts of the text prompt while generating the image. It ensures that the image aligns with the meaning of the text.

How it Works:

  • Example: Consider the prompt “A women with green eyes.” The model pairs the words “green” and “eyes” together to understand that the women should have green eyes, not a green shirt.
  • Steering the Generation: This information steers the image generation process towards creating an image of a man with blue eyes, rather than focusing on other details.

Cross-attention is crucial for generating accurate and relevant images based on the text prompt. Techniques like LoRA use this mechanism to fine-tune the model and insert styles, highlighting its significance in the image generation process.

LORA

LoRA (Low-Rank Adaptation) is a method used to fine-tune Stable Diffusion models effectively. LoRA models are compact (2–200 MB) and offer decent training power.

Stable Diffusion enthusiasts often face storage issues due to the large file sizes of different models. LoRA helps alleviate this problem by being more storage-friendly while still providing good customization capabilities. However, LoRA models need to be used with a model checkpoint file, as they only modify styles by making small changes to the main model.

How does LoRA work?

LoRA fine-tunes the most critical part of Stable Diffusion models: the cross-attention layers, where the image and text prompt intersect. By focusing on these layers, LoRA can effectively train the model. The key to LoRA’s smaller file size lies in how it stores data. Instead of keeping large matrices (big collections of numbers), LoRA breaks them into two smaller matrices, significantly reducing the amount of data stored.

The weights of a cross-attention layer are arranged in matrices. Matrices are just a bunch of numbers arranged in columns and rows, like on an Excel spreadsheet. A LoRA model fine-tunes a model by adding its weights to these matrices.

For example, if a model has a matrix with 1,000 rows and 2,000 columns (2,000,000 numbers), LoRA breaks it into two matrices of 1,000x2 and 2x2,000 (only 6,000 numbers), making storage 333 times smaller. This technique, called low-rank adaptation, doesn’t compromise much on training power, making LoRA an efficient way to fine-tune models without taking up excessive storage space.

Reference — https://www.youtube.com/watch?v=ltLNYA3lWAQ

--

--