Feature 01Comparison

CJK Text Rendering: Where ERNIE-Image Beats Flux 2 Pro and GPT Image 2

Flux 2 Pro and GPT Image 2 are frontier general purpose models. ERNIE-Image is a specialist. On dense Chinese signage, bilingual menus, and Japanese posters, the specialist wins by a visible margin.

By ernie-api editorial..7 min read

You can measure image models on GenEval or OneIG scores and get a reasonable sense of general quality. You cannot measure typography fidelity from those numbers. The scores that matter for CJK rendering are LongTextBench, where ERNIE-Image posts 0.9733, and the bilingual split of OneIG, where the Chinese half (OneIG-ZH at 0.5543) runs close to the English half (OneIG-EN at 0.5750). Flux 2 Pro and GPT Image 2 do not have comparable public numbers on those benchmarks because CJK fidelity was not a primary training objective for either model.

The training story matters here. ERNIE-Image was trained on a Baidu heavy corpus that over weights Chinese web typography, Chinese poster design, product packaging, street signage, and manga panels. The 8B DiT saw more actual Chinese glyphs during pretraining than any Western frontier image model. That corpus shapes what the model can render, not just what it can describe.

Three test cases

The gap is most visible on three specific rendering tasks. Bilingual menu cards with Simplified Chinese dish names and English translations. Chinese titles with English subtitles in a poster layout. Japanese signage with mixed kanji and kana on a photographic background. In every one of these, you can prompt Flux 2 Pro or GPT Image 2 and get a result that looks stylistically fine but has glyph errors that a native reader will catch instantly.

Bilingual menu cards

You want a menu card with five dishes. Each dish has a Chinese name in roughly 4 characters and an English translation underneath. The Chinese names are specific. 宫保鸡丁 for Kung Pao Chicken. 麻婆豆腐 for Mapo Tofu. 东坡肉 for Dongpo Pork. 鱼香茄子 for Fish Fragrant Eggplant. 担担面 for Dan Dan Noodles.

Flux 2 Pro will render the English translations cleanly. The Chinese characters will look approximately right but will frequently produce noise glyphs, missing strokes, and in some cases completely invented characters that do not exist in Unicode. GPT Image 2 is better than Flux on simpler Chinese strings but falls apart at the density of five dishes on one card. ERNIE-Image, called with enable_prompt_enhancer: true and the strings in quotes, produces clean renderings on the first pass most of the time.

Menu card comparison
Menu card comparison

Chinese titles with English subtitles

The test prompt is a concert poster. Title 东方之声 in large bold Simplified Chinese. Subtitle EASTERN SOUND 2026 in thin English caps 40 percent smaller, directly below the title. Date line March 14 to 16 in a bottom placement.

The hierarchy is the interesting part. Flux 2 Pro often renders the Chinese title cleanly when it is by itself, but gets confused when you add an English subtitle in a smaller weight. The kerning on the subtitle drifts, and the model sometimes substitutes a different Chinese title to match the visual rhythm of the English. GPT Image 2 holds hierarchy better but still produces character errors on the Chinese side. ERNIE-Image handles the layered bilingual layout because it saw thousands of Chinese poster designs with English subtitles during training. This pattern is how Chinese consumer goods have been marketed for the last 15 years, and the Baidu corpus is saturated with it.

Japanese signage

Japanese is harder than Chinese because mixing kanji, hiragana, and katakana in the same string adds two more scripts to render. A storefront sign that reads 駅前居酒屋さくら mixes kanji with hiragana and needs to look hand painted and weathered.

Flux 2 Pro produces beautiful photorealism with often incorrect glyphs. GPT Image 2 is stylistically weaker on the hand painted look but slightly more accurate on glyphs. ERNIE-Image is not as Japanese trained as it is Chinese trained, but the shared CJK base corpus still gives it a clear edge on mixed script strings.

Why Flux and GPT struggle

The short version is sample density. Flux 2 Pro and GPT Image 2 are frontier general models trained on globally scraped corpora. The ratio of English to Chinese typography samples in those corpora skews 20 to 1 or higher. The models learn Chinese as a visual texture pattern rather than a character by character glyph system. That is enough to pass a casual look, but a reader who can actually parse the characters will spot the errors on inspection.

ERNIE-Image was trained by a team whose first product is a Chinese search engine. Every Chinese character that appears in a Baidu web index was available during pretraining at native density. You are not getting a better general model. You are getting a specialist on the one axis that matters for CJK typography.

Script density comparison
Script density comparison

The fal call

Here is the call for a bilingual menu card. Enhancer on, 50 steps, quoted strings for the Chinese names so the model knows exactly what glyphs to render.

TS
1import { fal } from '@fal-ai/client';
2
3fal.config({ credentials: process.env.FAL_KEY });
4
5const menu = await fal.subscribe('fal-ai/ernie-image', {
6 input: {
7 prompt: [
8 'A minimalist bilingual chalkboard menu card for a Sichuan restaurant.',
9 'Header reads "川菜小馆" in bold Simplified Chinese, subheader "Sichuan Kitchen" in thin English caps.',
10 'Five dishes listed in two columns, each with a Chinese name in quotes followed by an English translation.',
11 'Dishes: "宫保鸡丁" Kung Pao Chicken, "麻婆豆腐" Mapo Tofu, "东坡肉" Dongpo Pork, "鱼香茄子" Fish Fragrant Eggplant, "担担面" Dan Dan Noodles.',
12 'Deep black slate background, warm chalk white typography, soft grain texture, no photograph.',
13 ].join(' '),
14 image_size: 'portrait_4_3',
15 num_inference_steps: 50,
16 enable_prompt_enhancer: true,
17 },
18 logs: true,
19});
20
21console.log(menu.data.images[0].url);

When to still reach for Flux or GPT

ERNIE-Image is not the answer for every image. If your prompt has no CJK text, Flux 2 Pro is a stronger general photographic model and the choice is obvious. If your prompt leans on GPT Image 2 specific features like the rendering style instructions it handles, that is the right call. The rule is narrow. When CJK glyph fidelity is a must pass criterion, ERNIE-Image is where your pipeline routes. For everything else, you pick on general quality.


00Back to the archive
Also reading