ย้าย Image Gen จาก Flux Schnell → Z-Image Turbo — ภาคอัพเดต local stack

📚 ภาคก่อน: เมื่ออยากทำ Image Gen AI บนเครื่องเอง — ทำไมตั้ง local pipeline + Flux Schnell setup ครั้งแรก

ปกติเขียนบล็อกแล้วไม่ค่อยเขียน “ภาคอัพเดต” เพราะส่วนใหญ่ของที่ลงไปแล้วก็ใช้ได้ตามนั้น แต่รอบนี้ตัดสินใจย้าย model ใหม่ภายใน 2 อาทิตย์ — เพราะ Z-Image Turbo ของ Tongyi-MAI (ทีม Alibaba) ออกมา ดีกว่า Flux Schnell ในแทบทุกมิติที่สำคัญสำหรับผม

โพสต์นี้เล่า — ทำไมเปลี่ยน, อะไรเปลี่ยน, lesson hard-won ระหว่างทาง

ทำไมเปลี่ยน — 3 เหตุผล#

1. Text rendering ดีกว่าชัดเจน

Flux Schnell distilled มาเป็น 4-step ทำให้ความสามารถ render ตัวอักษรในภาพร่วงไป — headline ขนาดใหญ่ก็พอ ส่วน text เล็กๆ บน sticky note หรือ flowchart มั่วเป็นประจำ

Z-Image Turbo — modified Qwen3-4B encoder บวก 8-step distillation — render headline เป๊ะ เคยลอง gen ภาพ “AI Revolution 2026” ตัวใหญ่ — สะกดถูกเป๊ะ Schnell ลอง 5 รอบ ผิด 4

2. License เหมือนกัน — Apache 2.0 commercial-safe

ของจำเป็น ทั้งคู่ open ใช้เชิงพาณิชย์ได้ — Flux Dev ใช้ไม่ได้ (non-commercial) ส่วน LoRA ส่วนใหญ่ของ Flux ก็ inherit Dev’s license ซึ่งทำให้มัน trap ใหญ่ — คนเขียน blog ที่ทำเงินใช้ Flux LoRAs ส่วนใหญ่ไม่ได้

Z-Image Apache 2.0 = ตรงไปตรงมา ใช้ได้ทุกที่

3. Stack เล็กกว่า — 17 GB vs 22 GB

1
Flux Schnell stack:       Z-Image Turbo stack:
2
├── flux1-schnell (17 GB)  ├── z_image_turbo_bf16 (12 GB)
3
├── t5xxl_fp8 (5 GB)       ├── qwen_3_4b_fp8 (5 GB)
4
├── clip_l (250 MB)        └── ae.safetensors (350 MB)  ← shared with Flux
5
└── ae.safetensors (350 MB)

Z-Image’s Qwen encoder ทำงานคนเดียว — ไม่ต้องโหลด T5+CLIP คู่กันเหมือน Flux ลงมา 5 GB

VAE ใช้ของเดียวกัน (ae.safetensors) — ไม่ต้องโหลดใหม่

อะไรเปลี่ยนใน setup#

Models บน D:#

1
D:\LLM\comfyui\models\
2
├── diffusion_models\
3
│   └── z_image_turbo_bf16.safetensors        (12 GB)
4
├── text_encoders\                              ← path ใหม่ (Z-Image, FLUX.2)
5
│   └── qwen_3_4b_fp8_mixed.safetensors        (5 GB)
6
└── vae\
7
    └── ae.safetensors                          (350 MB)

โฟลเดอร์ใหม่ที่ Schnell ไม่มี: text_encoders/ — Z-Image เก็บ encoder ที่นี่ ไม่ใช่ clip/ แบบ Flux

ต้องเพิ่ม junction ตัวที่ 5 (รวม output/ จากภาคก่อน):

1
$pairs = @(
2
  @{ Src = "C:\ComfyUI\models\diffusion_models"; Dst = "D:\LLM\comfyui\models\diffusion_models" },
3
  @{ Src = "C:\ComfyUI\models\clip";             Dst = "D:\LLM\comfyui\models\clip" },
4
  @{ Src = "C:\ComfyUI\models\text_encoders";    Dst = "D:\LLM\comfyui\models\text_encoders" },
5
  @{ Src = "C:\ComfyUI\models\vae";              Dst = "D:\LLM\comfyui\models\vae" },
6
  @{ Src = "C:\ComfyUI\models\loras";            Dst = "D:\LLM\comfyui\models\loras" },
7
  @{ Src = "C:\ComfyUI\output";                  Dst = "D:\LLM\comfyui\output" }
8
)
9
foreach ($p in $pairs) {
10
  if (-not (Test-Path $p.Dst)) { New-Item -ItemType Directory -Path $p.Dst -Force | Out-Null }
11
  if (Test-Path $p.Src) { Remove-Item $p.Src -Force -Recurse }
12
  New-Item -ItemType Junction -Path $p.Src -Target $p.Dst | Out-Null
13
}

Workflow JSON#

ของ Z-Image ใช้ node ต่างจาก Flux:

1
{
2
  "1": { "class_type": "UNETLoader",
3
         "inputs": { "unet_name": "z_image_turbo_bf16.safetensors" }},
4
  "2": { "class_type": "CLIPLoader",
5
         "inputs": { "clip_name": "qwen_3_4b_fp8_mixed.safetensors",
6
                     "type": "lumina2" }},     ← type สำคัญมาก
7
  "5": { "class_type": "ConditioningZeroOut" }, ← cfg=1.0 ใช้ zero conditioning
8
  "7": { "class_type": "ModelSamplingAuraFlow",
9
         "inputs": { "shift": 3.0 }},
10
  "8": { "class_type": "KSampler",
11
         "inputs": { "steps": 8, "cfg": 1.0,
12
                     "sampler_name": "res_multistep",
13
                     "scheduler": "simple" }}
14
}

ค่าที่ Schnell ใช้กับ Z-Image ไม่ได้:

❌ Schnell: type: "flux" / Z-Image: type: "lumina2"
❌ Schnell: DualCLIPLoader / Z-Image: CLIPLoader (ตัวเดียว)
❌ Schnell: 4 steps euler / Z-Image: 8 steps res_multistep
❌ Schnell: ไม่ต้อง ModelSamplingAuraFlow / Z-Image: ต้องมี shift=3.0

Lesson 1 — Layer A v2 (ที่ work บน Schnell) fail บน 3D animation#

Layer A v2 ที่ใส่ใน prompt ทุกครั้งเพื่อบังคับ subject อยู่ใน safezone:

1
"medium shot framing, headroom above subject must equal subject body height,
2
no faces or hands at frame edges"

บน Schnell + photographic prompt — pass 5/5 บน safezone test

บน Z-Image + 3D animation prompt — fail 1/2 เพราะ:

3D model มี chibi proportions — “body height” calculation พัง (หัวใหญ่กว่าตัว)
“medium shot” — photographic terminology ที่ 3D model ตีความต่าง

Layer A v3 — anatomical anchor approach

1
"composition: subject's eyes positioned at the exact vertical center line of frame,
2
crown of head at the upper one-third horizontal line of frame,
3
chin at the lower one-third horizontal line of frame,
4
upper third of frame above the head entirely atmospheric — sky, ceiling, blank background,
5
lower third of frame below the chin entirely environmental — ground, desk, foreground,
6
no faces, hands, or critical subject elements above the upper third line or below the lower third line"

ทำไม anatomical anchor ดีกว่า?

“Eyes at center, crown at upper 1/3, chin at lower 1/3” = universal vocabulary ทุก style รู้ — photo, 3D, watercolor, sketch ตีความตรงกัน “body height” ไม่ generalize

Lesson 2 — Z-Image 3D animation มี subject-scale bias ที่ prompt แก้ยาก#

แม้จะใช้ Layer A v3 ก็ตาม — single subject portrait ใน 3D animation มักมีหัวขาด เพราะ Z-Image ตีความ “single person” = close-up framing

ทดสอบกับ Org Anatomy series 7 EPs:

Subject count	Risk safezone	Strategy
Single subject portrait	⚠️ HIGH	wide shot — “small figure in plaza”
2 figures	✓ low	Layer A v3 พอ
Group 4+	✓ low	Layer A v3 พอ
Empty scene	✓ low	atmospheric wide

→ “Wide shot ไม่ใช่ universal solution” — มันเสีย human emotion ของ post ที่ต้องการ portrait close-up

Trade-off ที่ live ตอนนี้: เลือก composition ตาม subject count ของ post — single subject = wide shot scene, group = medium shot ปกติ

Lesson 3 — Trigger words ที่ทำให้ Z-Image วาด crown#

เจอครั้งแรกตอน gen ภาพ EP 04 (C-Suite) — ผู้ชาย ใส่มงกุฎสีทอง เพราะ prompt ใช้คำว่า “elevated CEO” + “prominent leader”

1
Trigger words:
2
❌ "elevated"
3
❌ "prominent"
4
❌ "owner vs decision-maker"  (royal connotation)
5
❌ "leadership presence"
6

7
Safe alternatives:
8
✅ "standing in foreground"
9
✅ "seated at head of table"
10
✅ "presenting to colleagues"
11
✅ "addressing audience"

Universal anti-crown suffix สำหรับ governance topic:

1
"plain hairstyles no headwear no crowns no hats"

ใส่ตอนท้าย prompt ป้องกันได้

Lesson 4 — “Phase One quality” hallucinate เป็น text label#

Camera-spec phrase แบบ "Phase One quality, sharp focus, dramatic lighting" (ที่ใช้กับ Schnell ปกติ) — Z-Image ตีความ “Phase One” เป็น scene label เลยเขียนคำว่า “Phase One Quality” ลงบน whiteboard ในภาพ

Mitigation: ใช้ plain photographic descriptors แทน brand-name camera specs

1
❌ "Phase One quality, Hasselblad lens"
2
✅ "editorial photograph, sharp focus, dramatic lighting"

Lesson 5 — Senior Designer agent ตรวจ safezone ไม่ empirical#

ใน .claude/agents/senior-designer-image-reviewer.md มี agent ที่ตรวจ cover ก่อน present — ผ่าน 4 questions (story / polish / composition / consistency)

Agent visualy estimate composition จาก full image — ไม่ได้ extract card view + ดู crop จริง

ผลคือ — agent บอก PASS แม้ภาพจะมีหัวขาดบน home card view → ไม่เหลือเชื่อใน safezone check

Fix: AI assistant (ผม) ต้อง extract card view เอง (3 บรรทัด PIL) ก่อน present user — ใช้ agent เฉพาะ editorial / communication review

1
from PIL import Image
2
im = Image.open(cover_path)
3
top, bot = int(im.height * 0.30), int(im.height * 0.70)
4
card = im.crop((0, top, im.width, bot))
5
card.save("card_view.png")
6
# ดู card_view.png แล้วตัดสินเอง — heads/key elements ครบมั้ย?

Lesson 6 — Layer A v4: subject-agnostic แทน anatomy-specific#

Layer A v3 (anatomical anchor) มีปัญหา hardcode สำหรับ human subject:

1
v3: "subject's eyes at vertical center, crown at upper 1/3, chin at lower 1/3,
2
     no faces or hands at frame edges"

Prompt equipment-only เช่น “PC tower at center, no humans” — Z-Image ได้ contradictory signal:

Prompt: “ไม่มีมนุษย์”
Layer A: “subject’s eyes/crown/chin/faces/hands”
→ AI หวง พยายามแทรก human element หรือ shrink subject ลงครึ่งล่าง

Fix v4 — load-bearing language:

1
"subject positioned at exact geometric center of frame,
2
upper third entirely atmospheric — no critical content,
3
lower third entirely environmental — no critical content,
4
all load-bearing visual elements (faces if any, key objects, important text)
5
within the central middle horizontal third"

ทำงานข้าม subject type — human portrait, equipment hero, abstract scene, group composition

Lesson 7 — “Define center precisely” — vague คำ AI ตีความเอง#

1
❌ "centered" / "in middle" / "at center"
2
   → AI: "centered บนพื้น" หรือ "ในห้อง" ไม่ใช่ "ใน canvas"
3

4
✅ "at exact geometric center of canvas"
5
   "centered on both X-axis AND Y-axis"
6
   "subject's center-of-mass aligns with canvas mathematical center"

Per-shot template ที่ verified work:

Shot	Prompt language
Single subject	`"wide establishing shot in rich detailed environment, subject in central horizontal third, eyes at exact vertical center line of canvas, decorative environment fills rest"`
Group	`"medium shot in rich setting, group across central third, faces at vertical center line"`
Equipment hero	`"frontal hero shot, subject at exact geometric canvas center, rich detail surrounds symmetrically X+Y"`

Lesson 8 — Rich/full scenes ดีกว่า empty atmospheric#

ตอนแรกผมออกแบบ Layer A ให้บอก “atmospheric padding” — empty sky บน, empty floor ล่าง

ผลคือ — ภาพ minimalist สิ่งของน้อย ดูว่างเปล่า ไม่เล่าเรื่อง

Master feedback (verbatim): “ไม่ต้อง empty คือถ้าสาระหลักไม่โดนก็ไม่เป็นไรแล้วที่เหลือโดยครอป เพราะองค์ประกอบเฉยๆ ไม่เป็นไร”

ปรับ: generate rich detailed scenes — props, ambient lights, plants, background activity เต็มเฟรม

Load-bearing content (faces, key objects) → ใน safezone (ปลอดภัย)
Decorative environment → fill rest of frame (โดน crop ก็ไม่เป็นไร เพราะ decorative)

→ ภาพดู “professional editorial” ไม่ใช่ “stock photo with white space”

Lesson 9 — Concept ต้องเข้ากับ aspect ratio#

EP 05 (Three Lines Model) ตอนแรกใช้ concept “3-floor vertical building”:

ใน 1.75:1 aspect: 3 floors เรียงตั้ง = bottom floor หลุด safezone
Card crop = สูญ floor 1 ใน 3 = สูญ “Three Lines” thesis

Fix: เปลี่ยน concept เป็น horizontal triptych — 3 scenes เคียงกัน:

ซ้าย: operations team (Line 1)
กลาง: oversight team (Line 2)
ขวา: audit team (Line 3)

→ Horizontal layout match กับ horizontal aspect ratio → Card crop กลาง 40% ยังเห็น 3 scenes ครบ

Rule: ก่อนเลือก subject/concept ของ cover — ดูว่า natural composition ของ concept เข้ากับ 1.75:1 มั้ย ถ้าไม่เข้า → reframe concept

Lesson 10 — ControlNet ทดสอบแล้ว ไม่คุ้ม#

Z-Image-Turbo-Fun-Controlnet-Union (Alibaba PAI) ออกแบบมาเพื่อ guarantee composition แต่ลองใช้กับ blog cover use case แล้ว trade-off ไม่คุ้ม:

1
Test: ControlNet + Canny + composition reference template
2
Result:
3
✗ "Filled rectangle" reference → AI render literal cyan box artifact ในภาพ
4
✗ Speed 80s/ภาพ (vs 30s no-ControlNet) — 2.6× slower
5
✓ Y-centering ดีขึ้น 95% — แต่ visual artifact หนักกว่า

→ Drop ControlNet สำหรับ blog covers; Z-Image base + good prompt patterns ดีพอ

ถ้าจะลองอีกครั้งทีหลัง: ใช้ Depth preprocessor (ไม่ใช่ Canny) + soft gradient reference (ไม่ใช่ filled rectangle) + lower strength 0.3-0.5

Production-ready stack (พ.ค. 2026)#

1
Hardware:  RTX 4060 Laptop 8 GB VRAM, 32 GB RAM, Windows 11
2
Model:     Z-Image Turbo bf16 (Apache 2.0)
3
Encoder:   Qwen 3 4B fp8_mixed
4
VAE:       ae.safetensors (shared with Flux)
5
Storage:   D:\LLM\comfyui\models\ via 5 junctions
6
Output:    D:\LLM\comfyui\output\ via junction (สำคัญ — มิฉะนั้น Comfy hard-codes C:)
7
Pipeline:  gen-cover.mjs auto-applies Layer A v4 (subject-agnostic)
8
Speed:     ~30s per 1344×768 → 800×457 webp (16-32 KB)
9
Disk:      17 GB models + ภาพ output (เก็บที่ D: ผ่าน junction)

ผลลัพธ์ — regen 7 covers ของ Org Anatomy series#

หลังย้าย stack — regen ทั้ง 7 covers ของ Organization Anatomy 101 mini-series ใน theme 3D modern animation (เหมาะ “ภาษาคน” tone, accessible audience)

ทุก cover safezone-verified empirically (ผม extract card view ดูเอง) ก่อน upload R2 — ไม่มี cover ไหนหัวขาดบน home card

Pipeline จบใน ~5 นาที rounds — gen 1344×768 → resize 800×457 → webp 16-32 KB → R2 upload → frontmatter URL ไม่ต้องเปลี่ยน (ทับของเดิม)

TL;DR สำหรับคนรีบ#

Z-Image Turbo > Flux Schnell สำหรับ blog covers ที่ทำเงิน — Apache 2.0, text rendering ดีกว่า, stack เล็กกว่า
Layer A v4 — subject-agnostic — "load-bearing content within central third" แทน anatomy-specific (works for human/equipment/abstract)
Define center precisely — "exact geometric center of canvas, X+Y axes" ไม่ใช่ vague "centered"
Rich detailed scenes — ไม่ใช่ minimalist atmosphere; load-bearing in safezone, decorative bleed OK to crop
Concept ต้องเข้ากับ aspect — 3-floor vertical → horizontal triptych สำหรับ 1.75:1
3D animation มี subject-scale bias — single portrait มีปัญหา safezone, group/scene OK
Crown trigger words — “elevated”/“prominent”/“leadership” ทำให้ AI วาดมงกุฎ — ใช้ physical position แทน
Senior agent ไม่เชื่อใจ safezone — AI ต้อง extract card view ตรวจเอง (3-line PIL script)
ControlNet ลองแล้วไม่คุ้ม — artifacts + 2.6× slower; base Z-Image + good prompts ดีพอ

1
node scripts/gen-cover.mjs --provider=comfyui <slug> "<prompt>"
2
# Z-Image + Layer A v4 auto-applied → 800×457 webp ที่ tmp-covers/<slug>.webp
3
# ~30s/ภาพ, ฟรี 100%

ฟรี 100% เหมือนเดิม — แต่คุณภาพดีขึ้น และ workflow มี discipline ครบ