LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing

1University of Verona, 2Fondazione Bruno Kessler 3Polytechnic of Turin

LOTS: Compose your outfit through text and sketch pairing.

Our Contributions

Localized sketch-text image generation

Advancing state-of-the-art conditioning with multiple localized sketch-text pairs and a global description.

LOcalized Text and Sketch adapter

A novel diffusion adapter mitigating attribute confusion via modularized, paired attention-based processing.

The Sketchy dataset

A new fashion dataset to facilitate model training and evaluation for the localized sketch-text image generation problem.

State-of-the-Art performance

LOTS achieves state-of-the-art performance in image quality, sketch-text conditioning and attribute localization.


Abstract

Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details.

In this paper, we present LOcalized Text and Sketch (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel multistep-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process.

To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple sketch-text pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.


Background

In fashion design, sketches and natural language descriptions associated with the same garment convey complementary information for depicting the final design. As a complete outlook design is composed of several clothing garments, multiple descriptions are often collected together to outline an outfit. Each sketch-text pair specifies a localized part of design, in terms of silhouette shapes, materials, and textual details, allowing fine-grained localized control over the generation.

We frame this problem as a conditional image generation task, where the conditioning consists of a set of localized sketch-text pairs. LOTS is designed to enable fashion image generation with an unprecedented level of localized control.

Evolution of the image generation task Figure 1: Fundamental difference between previous methods and our approach. LOTS represents the natural evolution of fashion design methodologies, progressing from global text and sketches (IP-Adapter) to localized sketches with global text (Multi-T2I). Our approach leverages a global description (omitted here for brevity) alongside a set of localized sketch-text pairs (the coloured boxes), effectively defining both the layout and appearance of individual garment items.


Method

We propose LOTS, a novel approach leveraging multiple localized sketch-text pairs for image conditioning.

Teaser

1. Pair-Centric Representation

The Modularized Pair-Centric Representation module independently encodes sketches and text into a shared latent space, preserving localized semantics and minimizing cross-pair information leakage. The Pair-former then integrates sketch and text features within each pair, enabling spatially grounded alignment and accurate modeling of fine-grained, instance-specific attributes through sketch-informed structural guidance.

2. Deferred Diffusion Pair Guidance

The localized representations are fed as conditioning inputs to a pre-trained diffusion model, alongside a global textual representation specifying general appearance properties (style, background). Our approach defers this operation to the diffusion process itself, breaking down the task across multiple denoising steps via a cross-attention strategy.


Sketchy Dataset

Teaser

Starting from whole-body item (light colors) and garment parts (dark shades) annotations, we build a hierarchical structure by pairing the garment-part annotations to their related whole-body garment. We then use this structure to generate garment-level sketches and natural language descriptions with off-the-shelf models.


Qualitative Results

Teaser

Promo Video


Related Links

AIDA Project

AI-Driven Intelligent Diagnostics and Analytics in Healthcare.

Acknowledgements

This study was supported by LoCa AI, funded by Fondazione CariVerona (Bando Ricerca e Sviluppo 2022/23), PNRR FAIR - Future AI Research (PE00000013) and Italiadomani (PNRR, M4C2, Investimento 3.3), funded by NextGeneration EU. We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources and support. We acknowledge EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain. Finally, we acknowledge HUMATICS, a SYS-DAT Group company, for their valuable contribution to this research.

BibTeX

@inproceedings{girella2025lots,
  author    = {Girella, Federico and Talon, Davide and Lie, Ziyue and Ruan, Zanxi and Wang, Yiming and Cristani, Marco},
  title     = {LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing},
  journal   = {Proceedings of the International Conference on Computer Vision},
  year      = {2025},
}