📚 Infinite-Story: A Training-Free Consistent Text-to-Image Generation

DGIST
🎉 AAAI 2026 Oral 🎉

*Equal Contribution, Corresponding Author
Infinite-Story Teaser

Abstract

We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6× faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.

✨ Method Overview

MY ALT TEXT
Overall pipeline of our method. The text encoder \( E_T \) processes a set of text prompts \( \mathbf{t} \), producing contextual embeddings \( \mathbf{T} \) that condition the transformer. Before generation, Identity Prompt Replacement is applied to \( \mathbf{T} \) to ensure consistent identity representation across prompts. During generation, Unified Attention Guidance (UAG), which consists of Adaptive Style Injection and Synchronized Guidance Adaptation, is applied to early-stage self-attention layers to achieve consistent identity appearance and overall style alignment while preserving prompt fidelity. The transformer autoregressively produces residual feature maps, which are decoded into the final image \( \mathbf{I} \) via the image decoder.

Key Components

BibTeX


      @article{park2025infinite,
        title={Infinite-Story: A Training-Free Consistent Text-to-Image Generation},
        author={Park, Jihun and Lee, Kyoungmin and Gim, Jongmin and Jo, Hyeonseo and Oh, Minseok and Choi, Wonhyeok and Hwang, Kyumin and Kim, Jaeyeul and Choi, Minwoo and Im, Sunghoon},
        journal={arXiv preprint arXiv:2511.13002},
        year={2025}
      }