A new evaluation platform attempts to answer a question most LLM leaderboards ignore: which models write fiction people actually want to read.
Narrator.sh's leaderboard ranks models based on reader engagement metrics (views, bookmarks, comments, forks) rather than traditional NLP scores like BLEU or ROUGE. The platform tests three specialized roles: brainstorming (plot and worldbuilding), writing (chapter generation), and memory (context retention). Nine novels and 15 chapters currently inform the rankings.
Early leaders
Qwen3-8B tops the brainstorming category with 51 average views per story. For actual chapter writing, Qwen3 Next 80B leads at 51 views, followed by GLM-4-32B at 22. OpenAI's o1-pro ranks second for brainstorming but fifth for writing.
The memory rankings show zero engagement across all models, suggesting either insufficient data or that readers don't notice context issues until later chapters.
The behavioral benchmark thesis
This approach addresses a genuine gap. As of early 2026, at least 10 specialized LLM benchmarks exist for coding, math, and function-calling, but none formally evaluate long-form narrative quality. Traditional leaderboards test intelligence through academic proxies (MMLU, TruthfulQA), not creative output that holds reader attention.
Chatbot Arena pioneered user preference scoring for conversational quality. Narrator extends that methodology to fiction, where coherence over thousands of words matters more than single-turn responses.
Trade-offs worth noting
The sample size creates obvious validity concerns. Fifteen chapters across nine novels isn't enough to separate model capability from genre preferences or viral randomness. Reader demographics matter: a platform attracting sci-fi enthusiasts will produce different rankings than one serving romance readers.
The architecture also limits comparability. Each novel uses three different models in fixed roles. You can't directly compare GPT-4o's brainstorming to its writing performance because they appear in different stories with different prompts.
What this signals
The platform's existence matters more than its current rankings. It reflects growing enterprise frustration with benchmarks that ignore deployment realities: latency, cost per token, and whether outputs actually work for the intended task.
For organizations evaluating models for content generation (marketing copy, training materials, summarization), engagement-based metrics are more relevant than academic test scores. A model that aces MMLU but writes boring prose doesn't help you.
The question is whether reader engagement on a niche fiction platform generalizes to other creative writing tasks. We'll see.