Transform-a-Pic: Stable Diffusion and ControlNet based model to transform Asset Images

5 min readJun 10, 2024

Ever wondered how your favorite jacket would look if it had a different style or pattern? Well, hold on to your hats (or jackets, in this case) because we’re diving into a nifty tool that does just that! Let’s break down the magic of the “Change Asset Style Tool” and see how it works with some fun examples.

Tech Behind the Tool

Our tool combines several advanced technologies, including computer vision, machine learning, and natural language processing. Here’s a breakdown of the components used:

ControlNet Model and Stable Diffusion Pipeline: These are deep learning models trained to understand and transform images based on specific prompts. They help in maintaining the content of the image while changing its style. ControlNet is an advanced model that understands and manipulates the structural aspects of images. The Stable Diffusion pipeline helps in generating high-quality images based on prompts. Together, they form a duo capable of creating stunning visual transformations.
Canny Edge Detection: A popular edge detection algorithm that highlights the edges within an image, which helps in defining the structure that needs to be preserved during the style transformation.
Morphological Operations: Techniques used to process images based on shapes. These operations are critical for closing small gaps in the edges detected by the Canny algorithm.

Step-by-Step Breakdown

Loading the Models

The first step involves loading the pre-trained ControlNet and Stable Diffusion models. These models are capable of generating high-quality images based on given prompts and are optimized for efficient performance using techniques like model CPU offload and memory-efficient attention.

import cv2
import numpy as np
from PIL import Image
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
import gradio as gr

# Load ControlNet and pipeline
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
pipe.enable_xformers_memory_efficient_attention()

Closing the Boundary: A Lesson in Image Processing

Our first function, close_boundary, aims to close small gaps in the canny edge-detected image. This is essential for creating a clean and continuous edge map, which is crucial for accurate cropping.

def close_boundary(canny_image, kernel_size, iterations):
    kernel = np.ones((kernel_size, kernel_size), np.uint8)
    closed_boundary = cv2.morphologyEx(canny_image, cv2.MORPH_CLOSE, kernel, iterations=iterations)
    return closed_boundary

Kernel Size and Iterations:

Kernel Size: Think of the kernel as a tiny, square cookie cutter that moves over your image, shaping it according to your needs. A larger kernel size means a bigger cookie cutter, affecting more pixels at once.
Iterations: This parameter tells us how many times we apply the cookie cutter action. More iterations result in more pronounced effects.

Cropping the Object: Precision and Detail

Next, we need to extract the object from the generated image using the edge map. This is where our crop_object_from_generated function comes into play.

def crop_object_from_generated(generated_image, canny_image, kernel_size, iterations):
    generated_image_np = np.array(generated_image)
    closed_canny_image = close_boundary(canny_image, kernel_size, iterations)
    contours, _ = cv2.findContours(closed_canny_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    largest_contour = max(contours, key=cv2.contourArea)
    x, y, w, h = cv2.boundingRect(largest_contour)
    cropped_image = generated_image_np[y:y+h, x:x+w]
    
    mask = np.zeros_like(closed_canny_image)
    cv2.drawContours(mask, [largest_contour], -1, 255, thickness=cv2.FILLED)
    mask = cv2.cvtColor(mask, cv2.COLOR_GRAY2BGRA)
    mask[:, :, 3] = mask[:, :, 0]

    cropped_image_with_alpha = cv2.cvtColor(cropped_image, cv2.COLOR_RGB2RGBA)
    cropped_image_with_alpha[:, :, 3] = mask[y:y+h, x:x+w, 3]
    return cropped_image_with_alpha

This function identifies the largest contour in the edge map, which usually corresponds to the main object. It then creates a mask and extracts the object from the image, preserving its alpha (transparency) channel.

Resizing Images: Keeping It Proportional

To ensure that our images look good and fit within our interface, we need to resize them while maintaining their aspect ratios.

def resize_image(image, target_height):
    aspect_ratio = image.width / image.height
    new_width = int(target_height * aspect_ratio)
    return image.resize((new_width, target_height))

This function keeps the proportions of the image intact, preventing any awkward stretching or squishing.

The Main Event: Processing the Image

Now comes the main function that ties everything together: process_image. This function takes the input image and style prompts, processes them, and returns the stylized and cropped images.

def process_image(input_image, major_style, style1, style2, style3, style4, kernel_size, iterations):
    # Open and resize the original image while maintaining aspect ratio
    original_image = Image.open(input_image).convert("RGB")
    aspect_ratio = original_image.width / original_image.height
    new_width = int(512 * aspect_ratio)
    original_image = original_image.resize((new_width, 512))

    # Convert image to numpy array for processing
    image_np = np.array(original_image)
    # Create canny edge image
    low_threshold = 100
    high_threshold = 200
    canny_image = cv2.Canny(image_np, low_threshold, high_threshold)
    closed_canny_image = close_boundary(canny_image, kernel_size, iterations)
    # Convert canny image to RGB for ControlNet input
    canny_image_rgb = np.stack([closed_canny_image]*3, axis=-1)
    canny_image_pil = Image.fromarray(canny_image_rgb)
    # Define prompts and generator
    prompt_base = major_style
    styles = [style1, style2, style3, style4]
    prompts = [style + " " + prompt_base for style in styles]
    generator = [torch.Generator(device="cuda").manual_seed(2) for _ in range(len(prompts))]
    # Generate images using ControlNet
    output = pipe(
        prompts,
        canny_image_pil,
        negative_prompt=["human hands, human dummy, human body, face, neck, arms, monochrome, lowres, worst quality, low quality, hallucinate"] * len(prompts),
        num_inference_steps=50,
        generator=generator,
    )
    # Crop and collect resultant images
    cropped_images = [crop_object_from_generated(img, closed_canny_image, kernel_size, iterations) for img in output.images]
    
    # Resize images to height of 200 pixels while maintaining aspect ratio
    canny_image_resized = resize_image(canny_image_pil, 256)  # Resizing Canny image to 256 pixels
    cropped_images_resized = [resize_image(Image.fromarray(img), 256) for img in cropped_images]
    
    # Return all required outputs in a gallery format
    return canny_image_resized, cropped_images_resized

This function does the heavy lifting: it preprocesses the image, creates edge maps, generates stylized images, crops the objects, and resizes the results for display.

Building the Interface: A Gateway to Art

Finally, we create a user-friendly interface using Gradio. This allows anyone to interact with our tool without needing to understand the underlying code.

interface = gr.Interface(
    fn=process_image,
    inputs=[
        gr.Image(type="filepath", label="Upload Image"),
        gr.Textbox(label="Major Style Prompt"),
        gr.Textbox(label="Style 1 Prompt"),
        gr.Textbox(label="Style 2 Prompt"),
        gr.Textbox(label="Style 3 Prompt"),
        gr.Textbox(label="Style 4 Prompt"),
        gr.Slider(1, 10, step=1, default=2, label="Kernel Size"),
        gr.Slider(1, 10, step=1, default=2, label="Iterations")
    ],
    outputs=[
        gr.Image(label="Canny Image"),
        gr.Gallery(label="Cropped Images").style(grid=2)
    ],
    title="Stylize and Crop Images",
    description="Upload an image, enter style prompts, and get stylized and cropped images."
)

# Launch the interface
interface.launch(share=True)

This interface includes:

Image Upload: Upload your image file.
Text Prompts: Enter the style prompts for image transformation.
Sliders: Adjust kernel size and iterations for boundary closing.
Outputs: View the Canny image and the gallery of stylized and cropped images.

Wrapping Up

In a nutshell, the “Change Asset Style Tool” is a blend of computer vision, machine learning, and user interface design in python. It’s an example of how technology can be used to create art, making it accessible to everyone.

Whether you’re a machine learning engineer marveling at the intricate algorithms or an artist looking to transform your work, this tool offers something for everyone. And remember, as they say in the world of AI, “May the models be ever in your favor!”

So, what are you waiting for? Upload an image, play around with styles, and watch as your creations come to life.