Training a Custom Model with Stable Diffusion

By Daniel Voigt Godoy

Embarking on the journey of diffusion models, this post initiates a seven-article series setting the stage for the in-depth exploration that will unfold in subsequent posts. Throughout the series, we’ll delve into an array of topics:

Training a Custom Model with Stable Diffusion: the Diffusion Process and Schedulers
Training a U-Net Model from Scratch
Fine-tuning a Pretrained Model with Your Own Data
Driving Image Generation Your Way Through Guidance
Bridging the Gap Between Text Descriptions and Image Generations with CLIP
Conditioning Models With Your Inputs
Going Latent: Diffusion in Latent Space and Linear Diffusion

Let’s start with the Stable Diffusion model.

Diffusion models power several popular image generation tools. This post, along with the succeeding ones in the series, is designed to explain the individual components powering the diffusion model. We’ll guide you step by step with examples on how to implement, fine-tune, guide, and condition a diffusion model using PyTorch and Hugging Face.

The Power of Stable Diffusion

Behold the mighty Stable Diffusion model! Open source and versatile, the stable diffusion model distinguishes itself from other image generation models by providing users access to its source code and the ability to train custom models.

The model has several moving parts, and may look intimidating at first sight, but we’ll take it apart, piece by piece, to make it more digestible.

Before we start dissecting this model, let’s see what it is capable of with a short example.

This example assumes we are running the code in Google Colab, which already has PyTorch and other typical packages installed. You will need a Google account. Let’s use the “Article_1_Diffusion_Process” Google Colab notebook to follow along.

First, we need to install a few packages :

!pip install diffusers==0.16.1 accelerate open_clip_torch transformers

The following code is a slightly modified version taken from a notebook in Unit 3 of Hugging Face Diffusion Models class.

import torch
from diffusers import DiffusionPipeline

device = 'cuda' if torch.cuda.is_available() else 'cpu'
pipe = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to(device)
# Set up a generator for reproducibility
generator = torch.Generator(device=device).manual_seed(42)

# Run the pipeline, showing some of the available arguments
pipe_output = pipe(
    prompt="impressionist painting of an autumn cityscape", # What to generate
    negative_prompt="Oversaturated, blurry, low quality", # What NOT to generate
    height=480, width=640,     # Specify the image size
    guidance_scale=8,          # How strongly to follow the prompt
    num_inference_steps=35,    # How many steps to take
    generator=generator        # Fixed random seed
)

# View the resulting image:
pipe_output.images[0]

Executing this code allows us to produce images similar to the one presented below.

Output

Amazing, isn’t it? Let’s unpack the several components that make up this pipeline. We import libraries and modules useful for diffusion models, such as PyTorch and Hugging Face’s diffusers. We’ll be using the Modified National Institute of Standards and Technology (MNIST) dataset throughout the series. Once we have our imports and dataset, it’s time to use the model.

During training of diffusion models, noise transforms the image during generation. Essentially, the model begins with data, such as an image, and adds noise to it. During generation, the model uses a scheduler to gradually reduce the image’s noise to create a clear image. In the following sections, we’ll cover these concepts in more detail with examples.

Imports

To create our own image generation model, we’ll first import necessary packages. Let’s take care of the imports and a helper function to plot the images we’ll generate along the way:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader

import torchvision
from torchvision.transforms import Compose, Resize, ToTensor, ToPILImage

from matplotlib import pyplot as plt
from PIL import Image
from tqdm import tqdm
import numpy as np

from diffusers import DDIMScheduler, DDPMPipeline, DDIMPipeline

def plot_images(images, n=8, axs=None):
    if axs is None:
        fig, axs = plt.subplots(1, n, figsize=(10, 3))
    assert len(axs) == len(images)
    for i, img in enumerate(images):
        axs[i].axis('off')
        if isinstance(img, torch.Tensor):
            img = ToPILImage()((img/2+0.5).clamp(0, 1))
        axs[i].imshow(img.resize((64, 64), resample=Image.NEAREST), cmap='gray_r', vmin=0, vmax=255)

Now we’re good to go!

Dataset

Diffusion models can be large and training can be time consuming. For educational purposes, we will utilize a dataset comprising small images. This way, we can train and fine-tune models using Google Colab in a matter of minutes instead of hours, allowing us to experiment with diverse setups and configurations.

In the following example, we are utilizing the frequently used MNIST dataset, which stands for Modified National Institute of Standards and Technology.

Let’s resize the images from their original 28×28 pixels to 32×32 pixels and create tensors from them using Torchvision transforms directly within the dataset:

composed = Compose([Resize(32), ToTensor()])
dataset = torchvision.datasets.MNIST(root="mnist/", train=True, download=True, transform=composed)
train_dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
 Let's use the plot_images helper function to visualize eight digits from the dataset:
images = next(iter(train_dataloader))[0][:8]
plot_images(images)

This code produces the following set of images.

Output

Typical MNIST data, nothing new to see here, so let’s move on!

Noise

Let’s make some noise, literally! The snippet below generates eight images of pure Gaussian noise:

torch.manual_seed(13)
noise = torch.randn_like(images)
noise.shape
Output
torch.Size([8, 1, 32, 32])
Now, let’s add a line to the end of the code to visualize the Gaussian noise:
plot_images(noise)

Output

Believe it or not, diffusion models are able to transform images of pure Gaussian noise into real images, like the handwritten digits of MNIST above, or fancy artwork like the first example from Hugging Face’s Stable Diffusion pipeline.

You can’t help but be amazed by the thought of this incredible transformation from noise to image – it’s kind of like “discovering” images inside the noise!

Let’s dig a bit deeper into this process. First, let’s imagine that noise is incrementally added to an image (we’ll start with a blank image for now) in 1,000 small steps. Every time we take a step, we get a little bit closer to pure Gaussian noise.

steps = 1000
fractions = torch.linspace(0, steps-1, 5)/999
fractions
Output
tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000])

We can illustrate this process through a few selected steps. Initially, the image is blank. It gets progressively noisier, until it reaches the full level of noise we generated at the beginning. It’s akin to the noise gradually “fading in”!

fig, axs = plt.subplots(len(fractions), 8, figsize=(10, 5))
for i, f in enumerate(fractions):
    plot_images(noise*f, axs=axs[i])

Output

Now, let’s do the exact opposite with the original MNIST images. We’ll “fade out” the images over 1,000 steps. The images start as the original ones, and get progressively fainter, until they disappear completely into a blank image.

fig, axs = plt.subplots(len(fractions), 8, figsize=(10, 5))
for i, f in enumerate(fractions):
    plot_images((1-f)*images, axs=axs[i])

Output

Diffusion

Now, what happens if we add them up together, the progressively noisier images and fainter digits?

That’s a simplified diffusion process!

fig, axs = plt.subplots(len(fractions), 8, figsize=(10, 5))
for i, f in enumerate(fractions):
    plot_images(noise*f+(1-f)*images, axs=axs[i])

Output

There is more to this process than meets the eye, though. The diffusion does not have to follow a linear path, as demonstrated in the example above. There are different ways to weigh images and noise, and these are called schedules.

Unsurprisingly, the objects managing these schedules are called schedulers, marking the first component within the Stable Diffusion pipeline that we’ll explore. These schedulers handle the intricate task of both adding and removing noise to and from images, shouldering the heavy lifting for us (more on this later).

Scheduler

The image below, from the “Denoising Diffusion Probabilistic Models” paper by Jonathan Ho, Ajay Jain, and Pieter Abbeel, illustrates both processes:

Adding noise to a clean image, from right to left, using q
Removing noise from a noisy image, from left to right, using p

Adding noise is relatively straightforward, thanks to some convenient mathematical properties that simplify determining the amount of noise to be added based on the current timestep. While we won’t delve into the details here, you can check Lilian Weng’s insightful blog post, “What Are Diffusion Models?” for more specifics.

Source: Denoising Diffusion Probabilistic Models by Jonathan Ho, Ajay Jain, Pieter Abbeel

The diffusers library from Hugging Face’s implements several schedulers, so we can leverage them to seamlessly add noise to our images. First, let’s create a scheduler that uses 1,000 timesteps:

from diffusers import DDPMScheduler

noise_scheduler = DDPMScheduler(num_train_timesteps=1000)
Then, let's evenly divide the timesteps into eight parts:
timesteps = torch.linspace(0, noise_scheduler.config.num_train_timesteps-1, 8).long()
timesteps
Output
tensor([  0, 142, 285, 428, 570, 713, 856, 999])

Next, let’s use the eight images we retrieved from our dataset and, for each image, add the noise corresponding to a given timestep using the aptly named add_noise() method.

It takes three arguments:

clean images
generated noise
timesteps

torch.manual_seed(13)
noise = torch.randn_like(images)

noisy_images = noise_scheduler.add_noise(images, noise, timesteps)
fig, axs = plt.subplots(3, 8, figsize=(10, 4))
plot_images(images, axs=axs[0])
plot_images(noisy_images, axs=axs[2])
for i, ax in enumerate(axs[1]):
    ax.axis('off')
    ax.text(.35, .5, str(timesteps[i].item()))

Output

As we progress from left to right, there’s more and more noise added to the image. Additionally, it appears that the transition from “clean” to “noisy” image happens more rapidly than in our previous example. Why is that the case? It’s because the “fading in” of the noise and the “fading out” of the original image do not follow a linear schedule.

Let’s go through some mathematical details after all to illustrate the process.

The expression below shows us how a given (noisy) image at timestep t is a composition of both the original image (x0) and pure Gaussian noise (epsilon):

They are weighted by coefficients based on the cumulative product of alpha. But what is this alpha? They are computed by the scheduler based on the defined schedule. Let’s take a look at them:

noise_scheduler.alphas
Output
tensor([0.9999, 0.9999, 0.9999, 0.9998, 0.9998, 0.9998, 0.9998, 0.9998, 0.9997,
        0.9997, 0.9997, 0.9997, 0.9997, 0.9996, 0.9996, 0.9996, 0.9996, 0.9996,
		...
		0.9804, 0.9803, 0.9803, 0.9803, 0.9803, 0.9803, 0.9802, 0.9802, 0.9802,
        0.9802, 0.9802, 0.9801, 0.9801, 0.9801, 0.9801, 0.9801, 0.9800, 0.9800,
        0.9800])

Let’s incorporate a graph to visualize the relationship between the original image and noise as t increases:

plt.plot(noise_scheduler.alphas_cumprod ** 0.5, label=r"${\sqrt{\bar{\alpha}_t}}$")
plt.plot((1 - noise_scheduler.alphas_cumprod) ** 0.5, label=r"$\sqrt{(1 - \bar{\alpha}_t)}$")
plt.legend(fontsize="x-large");

Output

The original image (x0) is weighted by the blue line, while the yellow line drives the noise (epsilon). In the beginning (t = 0), there’s only the original image. In the end (t = 1000), there’s only noise. You may also notice that this schedule is quite different from the naive linear schedule we used to illustrate the diffusion process.

Reverse Diffusion

That’s where the magic, or better yet, the model, happens! A noisy image at a given timestep is the weighted sum of the original image and the noise:

In the diffusion process, we use two variables (the original image and the noise we create) to obtain a third variable (the noisy image).

In the reverse diffusion process, we have one variable (the noisy image) and we’d like to obtain another (the clean image).

But we’re still missing the third variable: the noise (in red). In order to generate a clean image, we need to know the noise, but how?

What if we build a model to predict the noise?

Easy enough, right? Well, in theory, yes. In practice, the model won’t be that good to predict the right amount of noise in one shot! So, it is actually done incrementally: we move one step at a time, from the noisy image towards the clean image using a weighted sum of the noisy image (xt) and the predicted clean image (^x0).

Both coefficients are based on the alpha variable that drives the schedule, but we won’t be going into any further details here.

Cheating

Now, let’s take a slight shortcut and construct a “model” that impeccably predicts the noise being added to the images. This approach is considered a bit of a cheat, as we use this “model” both as a noise generator AND a predictor!

First, we use it to generate some noise, and feed the noise to the scheduler.

def model(x, t):
    torch.manual_seed(13)
    noise = torch.randn_like(x)
    return noise

noise = model(images, None)
sample = noise_scheduler.add_noise(images, noise, torch.ones(8).long()*999)
plot_images(sample)

Output

These are noisy MNIST images and they are unrecognizable.

Then, let’s use our “model” to predict epsilon, and feed it to the scheduler’s step() method, which also takes three arguments:

predicted noise
timesteps
noisy images

After calling the step method, we can either retrieve the noisy image at the previous (t-1) step using the prev_sample attribute or the predicted original (clean) image using the pred_original_sample attribute.

We know our model is perfect, so let’s take the predicted original sample right away:

t = 999
epsilon = model(sample, t)
pred_x0 = noise_scheduler.step(epsilon, t, sample).pred_original_sample
plot_images(pred_x0)

Output

Perfect digits! The noise was completely removed, as expected, since we’re “cheating”.

Less Noise in the Previous Step

Of course, perfect models do not exist. In reality, we would be iteratively generating better and better samples as we move backwards in time, using the (hopefully) less noisier sample predicted for t-1 as input for the next step, until we reach t = 0.

In code, it looks like this:

for i, t in enumerate(noise_scheduler.timesteps):
    with torch.no_grad():
        epsilon = model(sample, t)

    sample = noise_scheduler.step(epsilon, t, sample).prev_sample

It is time for a real model now, and the typical model used with diffusion processes is the UNet model. In our next post in this series, we’ll explore the diffusers library UNet model class and break down examples showcasing the model’s efficiency.

We included a bonus section below, highlighting a few more impressive use cases of diffusion models. We will share content like this throughout the series, so be sure to check eviltux.com for more in-depth guides on diffusion models.

Bonus

Generating images out of pure noise is incredible, but it is only the tip of the iceberg! Here we present several other amazing use cases of diffusion models.

Image2Image

As the name implies, it begins with an existing image and transforms it into a different one. We can think of it as giving the model a helping hand, so it doesn’t have to start from pure noise.

Before following the example below, ensure your Google Colab notebook is using the GPU resource. To check, click on the RAM and Disk icons in the top right corner of the page:

A “Resources” sidebar will appear. At the bottom left of this sidebar, you can click “change runtime type”. Then, you can ensure the runtime type is selected as T4 GPU as depicted below:

The following example, drawn from a fast.ai notebook on diffusion, utilizes a rough sketch of a wolf howling at the moon as a starting point. Through the Image2Image process, it evolves into a refined image:

from diffusers import StableDiffusionImg2ImgPipeline

device = 'cuda' if torch.cuda.is_available() else 'cpu'

pipe = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16)
pipe.to(device)
Output
StableDiffusionImg2ImgPipeline {
  "_class_name": "StableDiffusionImg2ImgPipeline",
  "_diffusers_version": "0.16.1",
  "feature_extractor": [
    "transformers",
    "CLIPFeatureExtractor"
  ],
  "requires_safety_checker": true,
  "safety_checker": [
    "stable_diffusion",
    "StableDiffusionSafetyChecker"
  ],
  "scheduler": [
    "diffusers",
    "PNDMScheduler"
  ],
  "text_encoder": [
    "transformers",
    "CLIPTextModel"
  ],
  "tokenizer": [
    "transformers",
    "CLIPTokenizer"
  ],
  "unet": [
    "diffusers",
    "UNet2DConditionModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKL"
  ]
}
Below, the example opens the wolf sketch:
import shutil
import requests

url = 'c'
response = requests.get(url, stream=True)
with open('img.png', 'wb') as out_file:
    shutil.copyfileobj(response.raw, out_file)

from PIL import Image
init_image = Image.open('img.png').convert("RGB")
init_image

Output

torch.manual_seed(1000)
prompt = "Wolf howling at the moon, photorealistic 4K"
images = pipe(prompt=prompt, num_images_per_prompt=3, image=init_image, strength=0.8, num_inference_steps=50).images
init_image = images[2]
init_image

Output

torch.manual_seed(1000)
prompt = "Oil painting of wolf howling at the moon by Van Gogh"
new_images = pipe(prompt=prompt, num_images_per_prompt=3, image=init_image, strength=1, num_inference_steps=70).images
new_images[2]

Output

Textual Inversion

Textual inversion, proposed in the An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion paper and repo, is a technique that allows you to consistently place yourself, or anything you want, in the generated images. So, how would you condition a model to generate such images? Unless you’re famous enough to be part of the training set used to train a CLIP model, chances are you’re unknown to CLIP.

But, even if you don’t want to place yourself, but your dog, the diffusion process will generate “a” dog, not “your” dog. The generated image may get the breed correct, but it is only a general depiction of the dog breed.

The underlying issue is that both you and your dog are unfamiliar to CLIP.. Textual inversion fixes that! How? It takes a specific, rarely used token (for whatever reason “sks” is a popular choice), and overfits it to a selection of images of yourself, or your dog. That way, CLIP gets to know you, and you can start calling yourself “sks” for the purpose of image generation.

You can also check Hugging Face’s textual inversion fine-tuning example, but the example below comes from a fast.ai notebook on diffusion:

pipe = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16) 
pipe = pipe.to(device)

embeds_url = "https://huggingface.co/sd-concepts-library/indian-watercolor-portraits/resolve/main/learned_embeds.bin"
response = requests.get(embeds_url, stream=True)
with open('learned_embeds.bin', 'wb') as out_file:
    shutil.copyfileobj(response.raw, out_file)

embeds_dict = torch.load('learned_embeds.bin', map_location=device)
tokenizer = pipe.tokenizer
text_encoder = pipe.text_encoder
new_token, embeds = next(iter(embeds_dict.items()))
embeds = embeds.to(text_encoder.dtype)
new_token
Output
'<watercolor-portrait>'
assert tokenizer.add_tokens(new_token) == 1, "The token already exists!"
text_encoder.resize_token_embeddings(len(tokenizer))
new_token_id = tokenizer.convert_tokens_to_ids(new_token)
text_encoder.get_input_embeddings().weight.data[new_token_id] = embeds
torch.manual_seed(1000)
image = pipe("Woman reading in the style of <watercolor-portrait>").images[0]
image

Output

DreamBooth

DreamBooth is named after the idea of having a photobooth where you enter your dreams, placing yourself (or your dog) anywhere you want using image generation. It is the evolution of the textual inversion idea, but fine-tuning the whole model instead of only the textual embeddings. Before, you focused on making CLIP aware of your identity; now, you can make the entire stable diffusion pipeline aware of your existence. 🙂

Check the official DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation for more details, and Hugging Face’s notebook from the Diffusers course for a working example.

In this article, we dissected each component of the diffusion model and generated images using the MNIST dataset to illustrate the diffusion process. In the next post in the Diffusion Model series, we’ll build a model from scratch and leverage noise prediction to accelerate image generation.

About the Author

Daniel is a data scientist, developer, speaker, writer, and teacher. He is the author of the “Deep Learning with PyTorch Step-by-Step” series of books, and he has taught machine learning and distributed computing technologies at Data Science Retreat, the longest-running Berlin-based bootcamp for several years, helping more than 150 students advance their careers.
He has been a speaker at the Open Data Science Conference since 2019, delivering PyTorch and Generative Adversarial Networks (GANs) workshops for beginners. Daniel is also the main contributor of HandySpark, a Python package developed to allow easier data exploration using Apache Spark.
His professional background includes 20 years of experience working for companies in several industries: banking, government, fintech, retail, and mobility. He won four consecutive awards (2012, 2013, 2014, 2015) at the prestigious Prêmio do Tesouro Nacional (Brazilian’s National Treasury Award).

Training a Custom Model with Stable Diffusion

The Power of Stable Diffusion

Imports

Dataset

Noise

Diffusion

Scheduler

Reverse Diffusion

Cheating

Less Noise in the Previous Step

Bonus

Image2Image

Textual Inversion

DreamBooth

About the Author

You might also like:

From the Cradle to the OS 3: RISC-V Conventions

Kubeflow Series 7: Composable, Scalable, and Portable OH MY!

The Kubeflow Blog Series