Multi-Panel Comics with ERNIE-Image: Step by Step
Build a four-panel comic with ERNIE-Image from a single scene layout prompt. Keep your character consistent, land the text bubbles, and see why this beats Flux or SDXL for comic work.
Comics are where most diffusion models fall apart. You want four panels that read left to right, the same character in every frame, and speech bubbles the glyphs actually spell. SDXL hands you mush text. Flux gives you gorgeous single frames that never match each other. ERNIE-Image, the 8B DiT Baidu released under Apache 2.0, was trained on dense multi-panel layouts and glyph-heavy posters from day one.
The workflow in one breath
You pass ERNIE a scene layout prompt that treats the image as a comic page, describe each panel as a sub-scene with identical character cues, and let the model render art and bubble glyphs in one pass. No typography layer, no compositing. The 50-step base endpoint handles this at 1024x1024.

Step 1: lock your character sheet
Before you render a single panel, write a character block you paste verbatim into every prompt. Three visual hooks that survive compression:
- Silver trench coat with torn left sleeve
- Round wire glasses, cracked right lens
- Red scarf, always visible
Ask for nine properties and the model drops half. Three is the sweet spot.
Step 2: the scene layout prompt
ERNIE reads panel language natively:
1import { fal } from "@fal-ai/client";23fal.config({ credentials: process.env.FAL_KEY });45const sheet = "silver trench coat with torn left sleeve, round wire glasses with cracked right lens, red scarf";67const prompt = `Four-panel comic page, 2x2 grid layout, hand-inked line art, noir palette with teal accents.8Panel 1 top-left: detective (${sheet}) stands in rain outside a neon ramen shop, speech bubble reads "She was here an hour ago."9Panel 2 top-right: close-up of the detective's face, glasses reflecting neon, thought bubble reads "The napkin was still warm."10Panel 3 bottom-left: detective pushes through the ramen shop door, owner at the counter, speech bubble reads "You again?"11Panel 4 bottom-right: detective slides a photograph across the counter, speech bubble reads "Tell me everything."12Clean gutters, consistent character across all four frames.`;1314const result = await fal.subscribe("fal-ai/ernie-image", {15 input: { prompt, image_size: "square_hd", num_inference_steps: 50, guidance_scale: 4.5, seed: 42 },16 logs: true17});1819console.log(result.data.images[0].url);
The character sheet repeats inside every panel, so the model has four chances to bind the same visual. Guidance 4.5 is the sweet spot; higher values bleed panels together.
Step 3: bubble glyphs inside the prompt
ERNIE's typography head wins here. Short bubbles of four to eight words land clean nine times out of ten. CJK glyphs render at higher fidelity than Latin ones; a Chinese-language noir strip is the best demo this model has.

Step 4: iterate on seed, not prompt
If panel three looks off, rerun with a new seed. The character sheet stays anchored. Run five in parallel:
1const seeds = [42, 117, 220, 555, 901];2const results = await Promise.all(3 seeds.map((seed) => fal.subscribe("fal-ai/ernie-image", {4 input: { prompt, image_size: "square_hd", num_inference_steps: 50, seed }5 }))6);
At $0.03 per megapixel, a 1024x1024 render is $0.03. Five parallel runs cost $0.15 and you pick the best. For drafts drop to /turbo at 8 steps: $0.01 per megapixel, a full strip for a cent.
Why this beats Flux and SDXL
Flux 2 Pro is sharper on single-subject photorealism. Ask it for a four-panel page and one panel lands, the other three smear. SDXL gives you four distinct panels but every character looks like a different person. ERNIE renders the whole page in one call, respects your grid, keeps the character anchored, and draws the bubbles. One call, one file, $0.03.
Try it with the character you have been sketching. Four panels, fifty steps, one seed. Three cents.