LEDiT: Your Length-Extrapolatable Diffusion Transformer
without Positional Encoding

Shen Zhang1*, Yaning Tan2*, Siyuan Liang1*, Zhaowei Chen1, Linze Li1, Ge Wu3, Yuhao Chen1,
Shuheng Li1, Zhenyu Zhao1, Caihua Chen2, Jiajun Liang1 Yao Tang1
1JIIOV Technology, 2Nanjing University, 3Nankai University
* Indicates equal contribution, † Indicates corresponding author

High-quality Images beyond the Training Resolution

MY ALT TEXT Arbitrary-resolution samples (5122, 512x256, 256x512, 3842, 2562, 1282). Generated by LEDiT trained on ImageNet 256x256.
LEDiT can generate high-quality images with fine details beyond the training resolution.

Abstract

Diffusion transformers (DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings(PE), such as RoPE, need extrapolation which degrades performance when the inference resolution differs from training. In this paper, we propose a Length-Extrapolatable Diffusion Transformer(LEDiT), a simple yet powerful architecture to overcome this limitation. LEDiT needs no explicit PEs, thereby avoiding extrapolation. The key innovations of LEDiT are introducing causal attention to implicitly impart global positional information to tokens, while enhancing locality to precisely distinguish adjacent tokens. Experiments on 256x256 and 512x512 ImageNet show that LEDiT can scale the inference resolution to 512x512 and 1024x1024, respectively, while achieving better image quality compared to current state-of-the-art length extrapolation methods(NTK-aware, YaRN). Moreover, LEDiT achieves strong extrapolation performance with just 100k steps of fine-tuning on a pretrained DiT, demonstrating its potential for integration into existing text-to-image DiTs.

MY ALT TEXT

Method

MY ALT TEXT

Our LEDiT model does not require explicit position encoding. The main difference lies in the incorporation of causal attention and convolution after patchification.

Comparison

MY ALT TEXT Qualitative comparison with other methods beyond the training resolution. MY ALT TEXT Qualitative comparison with other methods when generating non-square images.

Samples beyond the Training Resolution

MY ALT TEXT Arbitrary-resolution samples (5122, 512x384, 384x512, 3842). Generated from LEDiT-XL/2 trained on ImageNet 256x256. MY ALT TEXT Arbitrary-resolution samples (10242, 1024x768, 768x1024, 7682). Generated from LEDiT-XL/2 trained on ImageNet 512x512.

Component Ablations

MY ALT TEXT Incorporating causal attention yields structurally coherent objects, While further introducing convolution provides adequate high-frequency details.

BibTeX

@article{zhang2025ledit,
        title={LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding},
        author={Zhang, Shen and Tan, Yaning and Liang, Siyuan and Chen, Zhaowei and Li, Linze and Wu, Ge and Chen, Yuhao and Li, Shuheng and Zhao, Zhenyu and Chen, Caihua and Liang, Jiajun and Tang, Yao},
        journal={arXiv preprint arXiv:2503.04344},
        year={2025}
      }