2026-01-14 · Research

GLM-Image: Auto-regressive for

Dense-knowledge and High-fidelity

Image Generation

Figure1: GLM-Image General Showcase

Figure2: GLM-Image Dense-Knowledge Showcase

Today we are excited to introduce GLM-Image, the first open-source, industrial-grade discrete auto-regressive image generation model. GLM-Image adopts a hybrid architecture combining an auto-regressive module with a diffusion decoder. The auto-regressive part is partially based on, and initialized from, [GLM-4-9B-0414][1] with 9 billion parameters, while the diffusion decoder follows [CogView4][2] to adopt a single-stream DiT structure with 7 billion parameters. In general image generation quality, GLM-Image aligns with mainstream latent diffusion approaches, but it shows significant advantages in text-rendering and knowledge-intensive generation scenarios. It performs especially well in tasks requiring precise semantic understanding and complex information expression, while maintaining strong capabilities in high-fidelity and fine-grained detail generation. In addition to text-to-image generation, GLM-Image also supports a rich set of image-to-image tasks including image editing, style transfer, identity-preserving generation, and multi-subject consistency.

Background: In recent years, diffusion models have become the mainstream in image generation for their training stability and strong generalization capabilities. Yet even with substantial improvements in diffusion modeling and VAE formulation[3][4][5], etc., end-to-end diffusion models still have shortcomings in complex instruction following and knowledge-intensive scenarios, often falling short in both information expression and semantic alignment. At the same time, some newly released high-quality image generation models have demonstrated outstanding performance in such knowledge-dense cases, producing visually rich detail while exhibiting auto-regressive modeling characteristics. Drawing inspiration from these developments, GLM-Image was designed from the beginning with two decoupled objectives: robust understanding of complex information and the ability to produce high-quality image details. In our approach, the auto-regressive generator produces tokens with low-frequency semantic signals, while the diffusion decoder refines high-frequency details to deliver the final image. This hybrid architecture not only performs reliably in general image generation tasks, but also presents noticeable advantages in creative work that demands intricate knowledge representation, pushing image generation toward a new stage that combines artistic aesthetics with precision in conveying information.

Techniques

Figure3: General Pipeline

Visual Token Selection

In previous visual auto-regressive generation models, the token types used have typically fallen into three categories:

  • Visual codes obtained via discrete reconstruction training (VQVAE[6])
  • Visual codes obtained via discrete semantic training (semantic-VQ[7])
  • Statistical semantic features extracted from 1D vectors (as in DALLE2[8])

These approaches rank from high to low in the order above from an information completeness standpoint, whereas their semantic relevance tends to increase in the reverse order. For visual generation models, the correlation between tokens (or patches) is a crucial factor influencing both model convergence and the final output quality. In latent diffusion models, works as VAVAE[5] and SSVAE[9] have demonstrated the significance. While for auto-regressive generation, training loss comparison shows a clear different magnitude (~7 vs. ~3) for tokens derived from VQVAE and semantic-VQ with similar codebook size, suggesting that modeling with semantic tokens offers superior convergence properties for visual generation training. On the other hand, 1D vectors suffer from insufficient information completeness and correspondence towards a specific image, and are more commonly used in subsequent works for tasks as subject consistency generation (e.g., FLUX.1 Redux[10]).

Building on these conclusions and observations, GLM-Image adopts semantic-VQ as its primary tokenization strategy. To be specific, we implemented the tokenizer scheme from XOmni for better semantic correlation during token modeling, combined with a diffusion decoder subsequently decoding from these tokens to produce the final image outputs.

Auto-regressive Pre-training

The auto-regressive part of GLM-Image initializes from GLM-4-9B-0414 and implements combinitorial training of text-to-image generation and image-to-image generation. We freeze the text word embedding layer of the model while enabling other parts for training, appending an extra vision word embedding layer for vision token projection and replacing the original LM head with a vision LM head for the new task. We implement MRoPE as the positional embedding for the circumstance of interleaving images and texts from both generation tasks of text-to-image and image-to-image, as illustrated in the picture.

We train the model with multiple resolution stages including 256px, 512px and a mixed-resolution training stage spanning from 512px to 1024px. The tokenizer from XOmni patchifies the image with a 16× compression ratio, which means the token count per sample is 256, 1024 and 1024-to-4096, respectively for the three training stages. Given that we set the upscaling factor of our diffusion decoder’s final output to 32, the resulting image resolution ranges from 1024 px to 2048 px.

In the initial 256-token stage of training, we implemented a straightforward raster scan order for token generation strategy. However, as we advanced to higher-resolution training stages, we observed a drop in controllability of model outputs when applying the same generation approach. To address this, we adopted a progressive generation strategy[11]: before generating high-resolution image tokens, we first generate approximately 256 tokens with the same aspect ratio, obtained by tokenizing a down-sampled version of the target image. Considering that these preliminary tokens largely determine the final image layout, but due to their small number they might receive insufficient attention, we increased their training weight in subsequent stages, which effectively improved the overall generation quality.

Decoder Formulation

Figure4: Decoder Formulation

The diffusion decoder receives semantic-VQ tokens generated by the auto-regressive model as conditional inputs to reconstruct the target image. While semantic-VQ tokens carry rich semantic information, they discard high-frequency image details and primarily present relatively low-frequency image layout information. As a result, the diffusion decoder must retain a certain generative capacity to synthesize and recover the missing fine-grained details.

For the backbone design, we follow CogView4 to adopt a single-stream DiT architecture. The decoder employs flow matching as its diffusion scheduling strategy, ensuring stable training and efficient convergence for high-fidelity image generation. For integration, the semantic-VQ tokens are first passed through a projection layer and then concatenated with the VAE latent representation along the channel dimension. This preserves the input sequence length and incurs almost no extra computational overhead. Since the semantic-VQ tokens already provide sufficient semantic information, we remove the prompt input from the decoder’s conditioning. This design eliminates the need for a large-parameter text encoder, thereby reducing both memory usage and computational cost. To strengthen the decoder’s ability to render complex textual content—particularly Chinese characters—we introduce a lightweight Glyph-byT5[12] model that performs character-level encoding for rendered text regions. The resulting glyph embeddings are concatenated with the vision embeddings along the sequence dimension.

For image editing tasks, it is often critical to preserve the high-frequency details present in the reference images. The semantic information provided solely by semantic-VQ tokens is insufficient for modeling fine-grained detail preservation. Therefore, in GLM-Image we use both the semantic-VQ tokens and the VAE latents of the reference images as additional conditioning inputs for the diffusion decoder, as illustrated in Figure4. Unlike concurrent image editing models such as Qwen-Image-Edit[13], which apply full attention between reference images and the generated image, we adopt a block-causal attention mechanism between the reference and the generated image. This follows the attention design pattern used in ControlNet-Reference-Only[14]. The block causal attention can significantly reduce the computation overhead on the reference image tokens by kvcache while keeping competitive detailed preservation.

disentangled Reward for AR+Diffusion Post-training

In the post-training stage, GLM-Image employs a decoupled reinforcement learning strategy to separately optimize its auto-regressive generator and diffusion decoder, enabling improvements to both semantic alignment and visual detail quality. Both modules are trained with GRPO[15] optimization. For the diffusion decoder specifically, GLM-Image adopts flow-GRPO[16], a variant of the standard LLM GRPO adapted for diffusion models.

The auto-regressive module focuses on low-frequency rewards that guide semantic consistency and aesthetics, thereby improving instruction following and artistic expressiveness. It combines multiple reward sources, including HPSv3[17] for aesthetic scoring, OCR for enhancing text rendering accuracy and VLM for overall semantic correctness of generated content. The decoder module targets high-frequency rewards to refine fine-detail fidelity and text precision. It leverages LPIPS[18] to improve perceptual texture and detail similarity, integrates OCR signals to further boost text accuracy, and employs a dedicated hand-scoring model to enhance the correctness of generated hands.

Eval metrics

Text-rendering benchmarks

CVTG-2k

Modelopen-sourceNEDCLIPScoreWord Accuracy 2 regionsWord Accuracy 3 regionsWord Accuracy 4 regionsWord Accuracy 5 regionsWord Accuracy average
GLM-Image0.95570.78770.91030.92090.91690.89750.9116
Seedream 4.50.94830.80690.87780.89520.90830.90080.899
Z-Image0.93670.79690.90060.87220.86520.85120.8671
Qwen-Image-25120.9290.78190.8630.85710.8610.86180.8604
Z-Image-Turbo0.92810.80480.88720.86620.86280.83470.8585
GPT Image 1 [High]0.94780.79820.87790.86590.87310.82180.8569
Seedream 4.00.92240.79750.85850.84840.85380.82690.8451
Qwen-Image0.91160.80170.8370.83640.83130.81580.8288
Nano Banana 2.00.87540.73720.73680.77480.78630.79260.7788
TextCrafter0.86790.78680.76280.76280.74060.69770.737
SD3.5 Large0.8470.77970.72930.68250.65740.5940.6548
Seedream 3.00.85370.78210.62820.59620.60430.5610.5924
FLUX.1 [dev]0.68790.74010.60890.55310.46610.43160.4965
3DIS0.65050.77670.44950.39590.3880.33030.3813
RAG-Diffusion0.44980.77970.43880.33160.21160.1910.2648
TextDiffuser-20.43530.67650.53220.32550.17870.08090.2326
AnyText0.46750.74320.05130.17390.19480.22490.1804

LongText-Bench

ModelLongText-Bench-ENLongText-Bench-ZH
Seedream 4.50.9890.9873
GLM-Image0.95240.9788
Nano Banana 2.00.98080.9491
Qwen-Image-25120.95610.9647
Qwen-Image0.9430.946
Z-Image0.9350.936
Seedream 4.00.92140.9261
Z-Image-Turbo0.9170.926
Seedream 3.00.8960.878
X-Omni0.90.814
GPT Image 1 [High]0.9560.619
Kolors 2.00.2580.329
BAGEL0.3730.31
OmniGen20.5610.059
HiDream-I1-Full0.5430.024
BLIP3-o0.0210.018
Janus-Pro0.0190.006
FLUX.1 [Dev]0.6070.005

General benchmarks

OneIG_EN

ModelAlignmentTextReasoningStyleDiversityOverall
Nano Banana 2.00.8880.9440.3340.4810.2450.578
Seedream 4.50.8910.9980.350.4340.2070.576
Seedream 4.00.8920.9830.3470.4530.1910.573
Z-Image0.8810.9870.280.3870.1940.546
Qwen-Image0.8820.8910.3060.4180.1970.539
GPT Image 1 [High]0.8510.8570.3450.4620.1510.533
Qwen-Image-25120.8760.990.2920.3380.1510.53
Seedream 3.00.8180.8650.2750.4130.2770.53
GLM-Image0.8050.9690.2980.3530.2130.528
Z-Image-Turbo0.840.9940.2980.3680.1390.528
Imagen 40.8570.8050.3380.3770.1990.515
Recraft V30.810.7950.3230.3780.2050.502
HiDream-I1-Full0.8290.7070.3170.3470.1860.477
OmniGen20.8040.680.2710.3770.2420.475
SD3.5 Large0.8090.6290.2940.3530.2250.462
CogView40.7860.6410.2460.3530.2050.446
FLUX.1 [Dev]0.780.5320.2530.3680.2380.434
Kolors 2.00.820.4270.2620.360.30.434
Imagen 30.8430.3430.3130.3590.1880.409
BAGEL0.7690.2440.1730.3670.2510.361
Lumina-Image 2.00.8060.270.270.3540.2160.353
SANA-1.5-4.8B0.6750.0690.2170.4010.2160.334
SANA-1.5-1.6B0.7330.0540.2090.3870.2220.327
BAGEL+CoT0.7450.1740.2060.390.2090.324
SD 1.50.690.2070.2070.3830.4290.319
SDXL0.6880.0290.2370.3320.2960.316
Show-o2-7B0.8170.0020.2260.3170.1770.308
BLIP3-o0.7110.1330.2230.3610.2290.307
Show-o2-1.5B0.7980.0020.2190.3170.1860.304
Janus-Pro0.5530.0010.1390.2760.3650.267

OneIG_ZH

ModelAlignmentTextReasoningStyleDiversityOverall
Nano Banana 2.00.8430.9830.3110.4610.2360.567
Seedream 4.00.8360.9860.3040.4430.20.554
Seedream 4.50.8320.9860.30.4260.2130.551
Qwen-Image0.8250.9630.2670.4050.2790.548
Z-Image0.7930.9880.2660.3860.2430.535
Seedream 3.00.7930.9280.2810.3970.2430.528
Qwen-Image-25120.8230.9830.2720.3420.1570.515
GLM-Image0.7380.9760.2840.3350.2210.511
Z-Image-Turbo0.7820.9820.2760.3610.1340.507
GPT Image 1 [High]0.8120.650.30.4490.1590.474
Kolors 2.00.7380.5020.2260.3310.3330.426
BAGEL0.6720.3650.1860.3570.2680.37
Cogview40.70.1930.2360.3480.2140.338
HiDream-I1-Full0.620.2050.2560.3040.30.337
Lumina-Image 2.00.7310.1360.2210.3430.240.334
BAGEL+CoT0.7190.1270.2190.3850.1970.329
BLIP3-o0.6080.0920.2130.3690.2330.303
Janus-Pro0.3240.1480.1040.2640.3580.24

DPG Bench

ModelGlobalEntityAttributeRelationOtherOverall
Seedream 4.589.2494.392.1492.2393.8388.63
Seedream 4.093.8692.2490.7493.8794.1688.54
Qwen-Image91.3291.5692.0294.3192.7388.32
Seedream 3.094.3192.6591.3692.7888.2488.27
Z-Image93.3991.2293.1692.2291.5288.14
Qwen-Image-251289.0491.9192.3990.8593.0787.2
Lumina-Image 2.0-91.9790.294.85-87.2
Nano Banana 2.09192.8591.5692.3989.9387.16
HiDream-I1-Full76.4490.2289.4893.7491.8385.89
GPT Image 1 [High]88.8988.9489.8492.6390.9685.15
Z-Image-Turbo91.2989.5990.1492.1688.6884.86
GLM-Image87.7490.2589.0892.1590.1784.78
Janus-Pro-7B86.988.989.489.3289.4884.19
SD3 Medium87.991.0188.8380.788.6884.08
FLUX.1 [Dev]74.359088.9690.8788.3383.52
DALL-E 390.9789.6188.3990.5889.8383.5
Janus-Pro-1B87.5888.6388.1789.9888.382.65
Emu3-Gen85.2186.6886.8490.2283.1580.6
PixArt-Σ86.8982.8988.9486.5987.6880.54
Janus82.3387.3887.785.4686.4179.66
Hunyuan-DiT84.5980.5988.0174.3686.4178.47
Playground v2.583.0682.5981.284.0883.575.47
SDXL83.2782.4380.9186.7680.4174.65
Lumina-Next82.8288.6586.4480.5381.8274.63
PixArt-α74.9779.3278.682.5776.9671.11
SD1.574.6374.2375.3973.4967.8163.18

TIFF Bench

ModelOverall shortOverall long
Nano Banana 2.09188.26
Seedream 4.590.4988.52
Seedream 4.090.4588.08
GPT Image 1 [High]89.1588.29
Qwen-Image86.1486.83
Seedream 3.086.0284.31
Z-Image80.283.01
Qwen-Image-251283.2484.93
GLM-Image81.0181.02
Z-Image-Turbo77.7380.05
DALL-E 374.9670.81
FLUX.1 [dev]71.0971.78
FLUX.1 [Pro]67.3269.89
Midjourney V768.7465.69
SD 367.4666.09
SANA 1.567.1565.73
Janus-Pro-7B66.565.01
Infinity62.0762.06
PixArt-Σ6258.12
Show-o59.7258.86
LightGen53.2249.41
Hunyuan-DiT51.3853.28
Lumina-Next50.9352.46

Reference

[1] https://huggingface.co/zai-org/GLM-4-9B-0414
[2] https://huggingface.co/zai-org/CogView4-6B
[3] Liu, Xingchao, Chengyue Gong, and Qiang Liu. "Flow straight and fast: Learning to generate and transfer data with rectified flow." arXiv preprint arXiv:2209.03003 (2022).
[4] Yu, Sihyun, et al. "Representation alignment for generation: Training diffusion transformers is easier than you think." arXiv preprint arXiv:2410.06940 (2024).
[5] Yao, Jingfeng, Bin Yang, and Xinggang Wang. "Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
[6] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017).
[7] Geng, Zigang, et al. "X-omni: Reinforcement learning makes discrete autoregressive image generative models great again." arXiv preprint arXiv:2507.22058 (2025).
[8] Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 1.2 (2022): 3.
[9] Liu, Shizhan, et al. "Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability." arXiv preprint arXiv:2512.05394 (2025).
[10] https://huggingface.co/black-forest-labs/FLUX.1-Redux-dev
[11] Zheng, Wendi, et al. "Cogview3: Finer and faster text-to-image generation via relay diffusion." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
[12] Liu, Zeyu, et al. "Glyph-byt5: A customized text encoder for accurate visual text rendering." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
[13]Wu, Chenfei, et al. "Qwen-image technical report." arXiv preprint arXiv:2508.02324 (2025).
[14]Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." Proceedings of the IEEE/CVF international conference on computer vision. 2023.
[15] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." arXiv preprint arXiv:2402.03300 (2024).
[16] Liu, Jie, et al. "Flow-grpo: Training flow matching models via online rl." arXiv preprint arXiv:2505.05470 (2025).
[17] Ma, Yuhang, et al. "Hpsv3: Towards wide-spectrum human preference score." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025.
[18] Zhang, Richard, et al. "The unreasonable effectiveness of deep features as a perceptual metric." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.