BitDance: Scaling Autoregressive Generative Models with Binary Tokens

Yuang Ai*1,2,4, Jiaming Han*1,2, Shaobin Zhuang*1,3, Weijia Mao1,5, Xuefeng Hu1,
Ziyan Yang1, Zhenheng Yang1, Huaibo Huang4†, Xiangyu Yue2†, Hao Chen*1†‡
*Equal contribution · †Corresponding author · ‡Project lead
1ByteDance, 2MMLab, The Chinese University of Hong Kong, 3Shanghai Jiao Tong University
4Institute of Automation, Chinese Academy of Sciences, 5National University of Singapore
High-resolution BitDance samples
Click Image Gallery to view more images.

Abstract

For visual generation, discrete autoregressive models often struggle with poor tokenizer reconstruction, difficulties in sampling from large vocabularies, and slow token-by-token generation speeds. We present BitDance, which addresses these challenges via a large-vocabulary binary tokenizer, a binary diffusion head for sampling in large discrete space, and a next-patch diffusion paradigm that enables efficient multitoken prediction. BitDance is an open-source discrete autoregressive foundation model with 14B parameters, trained on large-scale multimodal tokens. While maintaining the standard language modeling paradigm for text tokens, BitDance employs a next-patch diffusion paradigm for visual tokens to predict multiple tokens in parallel—up to 64 per step. This unified multimodal framework is simple, scalable, and capable of efficiently generating high-resolution, photorealistic images.


Speed of BitDance
Performance vs. efficiency compared with SOTA diffusion models and autoregressive models.

Method

Architecture of BitDance

BitDance unifies a binary visual tokenizer, a diffusion-based sampling head, and a next-patch prediction strategy into a single autoregressive pipeline. Images are encoded into compact binary latents, flattened into patch-wise sequences, and decoded with joint multi-token sampling to balance quality and speed at high resolution.

The design targets three bottlenecks in autoregressive vision models: limited token expressiveness, sampling instability in large vocabularies, and slow decoding. BitDance addresses these with a 2^256 binary latent space, a diffusion head that models bit correlations directly, and patch-wise parallel prediction that preserves spatial dependencies while accelerating generation.

Binary Visual Tokenizer

BitDance scales token entropy via group-wise lookup-free quantization. By partitioning channels for entropy calculation, the tokenizer can grow to a 2^256 vocabulary while staying memory efficient and preserving reconstruction fidelity comparable to continuous VAEs. Tokens are learned with entropy regularization to prevent codebook collapse, and the grouping strategy keeps training tractable even as the vocabulary grows exponentially. This yields compact sequences with strong reconstruction quality, reducing error accumulation in long autoregressive chains. The reconstruction results can be found in the below table.

Method Tokenizer Type Downsample Ratio Codebook Size Compression Ratio PSNR ↑ SSIM ↑
SD-VAE Continuous 8 - 24 23.54 0.68
Cosmos Discrete 16 65536 384 19.93 0.49
Show-o Discrete 16 8192 473 21.34 0.59
LlamaGen Discrete 16 16384 439 20.65 0.54
Open-MAGVIT2 Discrete 16 218 341 22.70 0.64
Infinity Discrete 16 232 192 22.70 -
BitDance-Tok (Ours) Discrete 16 232 192 24.90 0.72
WeTok Discrete 32 232 768 20.77 0.55
DC-AE Continuous 32 - 48 24.81 0.69
DC-AE-SANA Continuous 32 - 48 24.72 0.69
BitDance-Tok (Ours) Discrete 32 2128 192 23.26 0.67
BitDance-Tok (Ours) Discrete 32 2256 96 25.29 0.74

Binary Diffusion Head

Instead of predicting a categorical index, BitDance embeds binary tokens on a continuous hypercube and trains a diffusion head with a velocity-matching objective. This models all bits jointly without exponential classifier parameters and enables high-precision sampling in massive discrete spaces.

The head predicts clean binary latents from noisy inputs and applies a hard sign projection at the end of sampling. This preserves discrete structure while avoiding the independence assumptions and parameter blowups of conventional classification heads.

Binary Diffusion Head

Next-Patch Diffusion

For faster decoding, BitDance predicts a patch of tokens in parallel using a block-wise causal mask that preserves dependencies within a patch. A lightweight DiT-style head aligns the training objective with joint multi-token sampling, improving both quality and throughput. Patch-wise raster ordering maintains global autoregressive structure while enabling intra-patch visibility. This closes the training-inference gap seen in parallel AR methods that sample tokens independently at inference time.

Next-Patch Diffusion

Results

Class-conditional Image Generation

ImageNet 256×256 (Class-Conditional)

Evaluation of class-conditional image generation on ImageNet 256×256.

Model Type Order Tokenizer #Params FID ↓ IS ↑ Pre. ↑ Rec. ↑
Continuous Tokens
DiT-XL/2 Diff. - VAE 675M 2.27 278.2 0.83 0.57
SiT-XL/2 Diff. - VAE 675M 2.06 277.5 0.83 0.59
DiCo-XL Diff. - VAE 701M 2.05 282.2 0.83 0.59
MDTv2 Mask.+Diff. - VAE 675M 1.58 314.7 0.79 0.65
REPA Diff. - VAE 675M 1.42 305.7 0.80 0.65
RAE Diff. - RAE 675M 1.13 262.6 0.78 0.67
MAR-B Mask. random VAE 208M 2.31 281.7 0.82 0.57
MAR-L Mask. random VAE 479M 1.78 296.0 0.81 0.60
MAR-H Mask. random VAE 943M 1.55 303.7 0.81 0.62
SphereAR-B AR raster VAE 208M 1.92 277.8 0.81 0.61
SphereAR-L AR raster VAE 479M 1.54 295.9 0.80 0.63
SphereAR-H AR raster VAE 943M 1.34 300.0 0.80 0.64
xAR-B AR+Diff. raster VAE 172M 1.72 280.4 0.82 0.59
xAR-L AR+Diff. raster VAE 608M 1.28 292.5 0.82 0.62
xAR-H AR+Diff. raster VAE 1.1B 1.24 301.6 0.83 0.64
Discrete Tokens
LlamaGen-L AR raster VQ 343M 3.07 256.1 0.83 0.52
LlamaGen-XL AR raster VQ 775M 2.62 244.1 0.80 0.57
LlamaGen-XXL AR raster VQ 1.4B 2.34 253.9 0.80 0.59
RandAR-L AR random VQ 343M 2.55 288.8 0.81 0.58
RandAR-XL AR random VQ 775M 2.22 314.2 0.80 0.60
RandAR-XXL AR random VQ 1.4B 2.15 322.0 0.79 0.62
RAR-L AR hybrid VQ 461M 1.70 299.5 0.81 0.60
RAR-XL AR hybrid VQ 955M 1.50 306.9 0.80 0.62
RAR-XXL AR hybrid VQ 1.5B 1.48 326.0 0.80 0.63
OpenMAGVIT2-XL AR raster LFQ 804M 2.51 271.7 0.84 0.54
MAGVIT-v2 Mask. random LFQ 307M 1.78 319.4 - -
VAR-d20 VAR - VQ 600M 2.57 302.6 0.83 0.56
VAR-d30 VAR - VQ 2B 1.92 323.1 0.82 0.59
BitDance-B-1x AR raster LFQ 242M 1.68 297.1 0.80 0.62
BitDance-L-1x AR raster LFQ 527M 1.31 300.2 0.80 0.64
BitDance-H-1x AR raster LFQ 1.0B 1.24 304.4 0.81 0.64

ImageNet 256×256 (Parallel Image Generation)

Overall comparison with parallel image generation methods on ImageNet 256×256.

Model Type Order #Params Steps Throughput ↑ FID ↓ IS ↑ Pre. ↑ Rec. ↑
DiT-XL/2 Diff. - 675M 250 1.06 img/s 2.27 278.2 0.83 0.57
DiCo-XL Diff. - 701M 250 2.62 img/s 2.05 282.2 0.83 0.59
MaskGIT Mask. random 227M 8 50.73 img/s 6.18 182.1 0.80 0.51
MAR-B Mask. random 208M 256 1.83 img/s 2.31 281.7 0.82 0.57
MAR-L Mask. random 479M 256 1.39 img/s 1.78 296.0 0.81 0.60
VAR-d20 VAR - 600M 10 71.31 img/s 2.57 302.6 0.83 0.56
VAR-d24 VAR - 1.0B 10 47.22 img/s 2.09 312.9 0.82 0.59
PAR-L AR hybrid 343M 147 15.01 img/s 3.76 218.9 0.81 0.60
PAR-XL AR hybrid 775M 147 8.09 img/s 2.61 259.2 0.80 0.62
PAR-XXL AR hybrid 1.4B 147 5.17 img/s 2.35 263.2 0.80 0.62
NAR-L AR hybrid 372M 31 40.03 img/s 3.06 263.9 0.81 0.53
NAR-XL AR hybrid 816M 31 23.12 img/s 2.70 277.5 0.81 0.58
NAR-XXL AR hybrid 1.5B 31 15.37 img/s 2.58 293.5 0.82 0.57
RandAR-L AR random 343M 88 25.12 img/s 2.55 288.8 0.81 0.58
RandAR-XL AR random 775M 88 16.01 img/s 2.25 317.8 0.80 0.60
RandAR-XXL AR random 1.4B 88 10.39 img/s 2.15 322.0 0.79 0.62
BitDance-B-4x AR raster 260M 64 24.18 img/s 1.69 291.2 0.79 0.63
BitDance-B-16x AR raster 260M 16 90.26 img/s 1.91 283.8 0.78 0.62

Text-to-Image Generation

DPG-Bench

Evaluation of text-to-image generation on DPG-Bench.

Model Global Entity Attribute Relation Other Overall ↑
Proprietary Models
GPT Image 1 88.89 88.94 89.84 92.63 90.96 85.15
Seedream 3.0 94.31 92.65 91.36 92.78 88.24 88.27
Diffusion Models
PixArt-α 86.89 82.89 88.94 86.59 87.68 80.54
FLUX.1-Dev 74.35 90.00 88.96 90.87 88.33 83.84
SD3 Medium 87.90 91.01 88.83 80.70 88.68 84.08
Z-Image-Turbo 91.29 89.59 90.14 92.16 88.68 84.86
BAGEL - - - - - 85.07
HiDream-I1-Full 76.44 90.22 89.48 93.74 91.83 85.89
Lumina-Image-2.0 - 91.97 90.20 94.85 - 87.20
Z-Image 93.39 91.22 93.16 92.22 91.52 88.14
Qwen-Image 91.32 91.56 92.02 94.31 92.73 88.32
Autoregressive Models
Emu3-Gen 85.21 86.68 86.84 90.22 83.15 80.60
Infinity 93.11 - - 90.76 - 83.46
Janus-Pro 86.90 88.90 89.40 89.32 89.48 84.19
Tar 83.98 88.62 88.05 93.98 84.86 84.19
NextStep-1 - - - - - 85.28
GLM-Image 87.74 90.25 89.08 92.15 90.17 84.78
BitDance 89.53 93.76 92.47 91.81 90.26 88.28

GenEval

Evaluation of text-to-image generation on GenEval.

Model Single Obj. Two Obj. Count Colors Pos. Color Attri. Overall ↑
Proprietary Models
GPT Image 1 0.99 0.92 0.85 0.92 0.75 0.61 0.84
Seedream 3.0 0.99 0.96 0.91 0.93 0.47 0.80 0.84
Diffusion Models
PixArt-α 0.98 0.50 0.44 0.80 0.08 0.07 0.48
SD3 Medium 0.98 0.74 0.63 0.67 0.34 0.36 0.62
JanusFlow 0.97 0.59 0.45 0.83 0.53 0.42 0.63
FLUX.1-Dev 0.98 0.81 0.74 0.79 0.22 0.45 0.66
SD3.5-Large 0.98 0.89 0.73 0.83 0.34 0.47 0.71
Lumina-Image-2.0 - 0.87 0.67 - - 0.62 0.73
Show-o2 1.00 0.87 0.58 0.92 0.52 0.62 0.76
Z-Image-Turbo 1.00 0.95 0.77 0.89 0.65 0.68 0.82
HiDream-I1-Full 1.00 0.98 0.79 0.91 0.60 0.72 0.83
Z-Image 1.00 0.94 0.78 0.93 0.62 0.77 0.84
Qwen-Image 0.99 0.92 0.89 0.88 0.76 0.77 0.87
BAGEL 0.98 0.95 0.84 0.95 0.78 0.77 0.88
Autoregressive Models
Emu3-Gen 0.98 0.71 0.34 0.81 0.17 0.21 0.54
Infinity - 0.85 - - 0.49 0.57 0.73
Janus-Pro 0.99 0.89 0.59 0.90 0.79 0.66 0.80
Tar 0.98 0.92 0.83 0.85 0.80 0.65 0.84
NextStep-1 - - - - - - 0.73
BitDance 1.00 0.96 0.71 0.95 0.72 0.83 0.86

OneIG-EN

Evaluation of text-to-image generation on OneIG-EN.

Model Alignment Text Reasoning Style Diversity Overall ↑
Proprietary Models
Imagen 4 0.857 0.805 0.338 0.377 0.199 0.515
Seedream 3.0 0.818 0.865 0.275 0.413 0.277 0.530
GPT Image 1 0.851 0.857 0.345 0.462 0.151 0.533
Diffusion Models
Show-o2 0.817 0.002 0.226 0.317 0.177 0.308
SANA-1.5 0.765 0.069 0.217 0.401 0.216 0.334
BAGEL 0.769 0.244 0.173 0.367 0.251 0.361
FLUX.1-Dev 0.786 0.523 0.253 0.368 0.238 0.434
OmniGen2 0.804 0.680 0.271 0.377 0.242 0.475
HiDream-I1-Full 0.829 0.707 0.317 0.347 0.186 0.477
Z-Image-Turbo 0.840 0.994 0.298 0.368 0.139 0.528
Qwen-Image 0.882 0.891 0.306 0.418 0.197 0.539
Z-Image 0.881 0.987 0.280 0.387 0.194 0.546
Autoregressive Models
Janus-Pro 0.553 0.001 0.139 0.276 0.365 0.267
NextStep-1 0.826 0.507 0.224 0.332 0.199 0.418
GLM-Image 0.805 0.969 0.298 0.353 0.213 0.528
BitDance 0.853 0.937 0.297 0.395 0.177 0.532

OneIG-ZH

Evaluation of text-to-image generation on OneIG-ZH.

Model Alignment Text Reasoning Style Diversity Overall ↑
Proprietary Models
Kolors 2.0 0.738 0.502 0.226 0.331 0.333 0.426
GPT Image 1 0.812 0.650 0.300 0.449 0.159 0.474
Seedream 3.0 0.793 0.928 0.281 0.397 0.243 0.528
Diffusion Models
HiDream-I1-Full 0.620 0.205 0.256 0.304 0.300 0.337
CogView4 0.700 0.193 0.236 0.348 0.214 0.338
BAGEL 0.672 0.365 0.186 0.357 0.268 0.370
Z-Image-Turbo 0.782 0.982 0.276 0.361 0.134 0.507
Qwen-Image 0.825 0.963 0.267 0.405 0.279 0.548
Z-Image 0.793 0.988 0.266 0.386 0.243 0.535
Autoregressive Models
Janus-Pro 0.324 0.148 0.104 0.264 0.358 0.240
GLM-Image 0.738 0.976 0.284 0.335 0.221 0.511
BitDance 0.786 0.961 0.276 0.376 0.159 0.512

TIIF Bench (testmini)

Evaluation of text-to-image generation on TIIF Bench testmini.

Model Overall (Short) Overall (Long) Basic Avg (Short) Basic Avg (Long) Attr. (Short) Attr. (Long) Relation (Short) Relation (Long) Reasoning (Short) Reasoning (Long) Advanced Avg (Short) Advanced Avg (Long) Attr+Rela (Short) Attr+Rela (Long) Attr+Reas (Short) Attr+Reas (Long) Rela+Reas (Short) Rela+Reas (Long) Style (Short) Style (Long) Text (Short) Text (Long) Real World (Short) Real World (Long)
Proprietary Models
Midjourney V7 68.74 65.69 77.41 76.00 77.58 81.83 82.07 76.82 72.57 69.32 64.66 60.53 67.20 62.70 81.22 71.59 60.72 64.59 83.33 80.00 24.83 20.83 68.83 63.61
DALL-E 3 74.96 70.81 78.72 78.50 79.50 79.83 80.82 78.82 75.82 76.82 73.39 67.27 73.45 67.20 72.01 71.34 63.59 60.72 89.66 86.67 66.83 54.83 72.93 60.99
Seedream 3.0 86.02 84.31 87.07 84.93 90.50 90.00 89.85 85.94 80.86 78.86 79.16 80.60 79.76 81.82 77.23 78.85 75.64 78.64 100.00 93.33 97.17 87.78 83.21 83.58
GPT Image 1 89.15 88.29 90.75 89.66 91.33 87.08 84.57 84.57 96.32 97.32 88.55 88.35 87.07 89.44 87.22 83.96 85.59 83.21 90.00 93.33 89.83 86.83 89.73 93.46
Diffusion Models
Lumina-Next 50.93 52.46 64.58 66.08 56.83 59.33 67.57 71.82 69.32 67.07 44.75 45.63 51.44 43.20 51.09 59.72 44.72 54.46 70.00 66.67 0.00 0.83 47.56 49.05
Hunyuan-DiT 51.38 53.28 69.33 69.00 65.83 69.83 78.07 73.82 64.07 63.32 42.62 45.45 50.20 41.57 59.22 61.84 47.84 51.09 56.67 73.33 0.00 0.83 40.10 44.20
PixArt-Σ 62.00 58.12 70.66 75.25 69.33 78.83 75.07 77.32 67.57 69.57 57.65 49.50 65.20 56.57 66.96 61.72 66.59 54.59 83.33 70.00 1.83 1.83 62.11 52.41
SANA 1.5 67.15 65.73 79.66 77.08 79.83 77.83 85.57 83.57 73.57 69.82 61.50 60.67 65.32 56.57 69.96 73.09 62.96 65.84 80.00 80.00 17.83 15.83 71.07 68.83
SD 3 67.46 66.09 78.32 77.75 83.33 79.83 82.07 78.82 71.07 74.07 61.46 59.56 61.07 64.07 68.84 70.34 50.96 57.84 66.67 76.67 59.83 20.83 63.23 67.34
FLUX.1-dev 71.09 71.78 83.12 78.65 87.05 83.17 87.25 80.39 75.01 72.39 65.79 68.54 67.07 73.69 73.84 73.34 69.09 71.59 66.67 66.67 43.83 52.83 70.72 71.47
Z-Image-Turbo 77.73 80.05 81.85 81.59 86.50 87.00 82.88 79.99 76.17 77.77 68.32 74.69 72.04 75.24 60.22 73.33 68.90 71.92 83.33 93.33 83.71 84.62 85.82 77.24
Z-Image 80.20 83.04 78.36 82.79 79.50 86.50 80.45 79.94 75.13 81.94 72.89 77.02 72.91 77.56 66.99 73.82 73.89 75.62 90.00 93.33 94.84 93.21 88.06 85.45
Qwen-Image 86.14 86.83 90.18 87.22 90.50 91.50 88.22 90.78 79.81 79.38 79.30 80.88 79.21 78.94 78.85 81.69 75.57 78.59 100.00 100.00 92.76 89.14 90.30 91.42
Autoregressive Models
LightGen 53.22 43.41 66.58 47.91 55.83 47.33 74.82 45.82 69.07 50.57 46.74 41.53 62.44 40.82 61.71 50.47 50.34 45.34 53.33 53.33 0.00 6.83 50.92 50.55
Infinity 62.07 62.32 73.08 75.41 74.33 76.83 72.82 77.57 72.07 71.82 56.64 54.98 60.44 55.57 74.22 64.71 60.22 59.71 80.00 73.33 10.83 23.83 54.28 56.89
Janus-Pro 66.50 65.02 79.33 78.25 79.33 82.33 78.32 73.32 80.32 79.07 59.71 58.82 66.07 56.20 70.46 70.84 67.22 59.97 60.00 70.00 28.83 33.83 65.84 60.25
GLM-Image 81.01 81.02 - - - - - - - - - - - - - - - - - - - - - -
BitDance 79.64 78.12 78.79 80.44 83.50 87.50 77.22 77.62 75.64 76.21 72.19 71.66 72.89 78.88 68.06 66.10 70.63 67.21 96.67 90.00 87.78 75.57 84.33 83.96

Citation

@article{ai2026bitdance,
  title   = {BitDance: Scaling Autoregressive Generative Models with Binary Tokens},
  author  = {Ai, Yuang and Han, Jiaming and Zhuang, Shaobin and Hu, Xuefeng and Yang, Ziyan and Yang, Zhenheng and Huang, Huaibo and Yue, Xiangyu and Chen, Hao},
  journal = {arXiv preprint arXiv:xxxx.xxxx},
  year    = {2026}
}