Why tokenizers break LLMs on non-English text and math

Tokenizers are the unglamorous infrastructure behind every generative AI deployment. They convert text, code, and images into numerical tokens that models like GPT-4 can process. Get them wrong and your model struggles with basic tasks, especially outside English.

The basics matter

Tokenizers split input into units (words, subwords, or characters) and assign each a unique ID. The choice affects everything: context window usage, processing costs, and output quality. Subword tokenizers like Byte Pair Encoding (BPE) and WordPiece dominate because they balance granularity and vocabulary size. Too many tokens and you hit context limits faster. Too few and the model misses nuances.

ChatGPT processes up to 4,096 tokens per input. That sounds generous until you run non-English text.

The APAC problem

Languages like Thai, Chinese, and Turkish need 6-10x more tokens than English for the same meaning. This isn't academic: it directly impacts API costs and context window utilization. A Thai government agency running chatbots pays more per interaction and hits token limits faster than an English equivalent.

Logographic scripts (Chinese) and agglutinative languages (Turkish) break tokenizer assumptions built around English. The result: higher costs, worse performance, and subtle biases in model behavior.

Math still doesn't work

Tokenizers also explain why GPT-4 struggles with arithmetic. The number "380" might be one token, but "381" becomes two. This inconsistency means the model can't reliably process digits. It's why GPT-4 has incorrectly claimed 7,735 is greater than 7,926 in testing.

What to watch

Enterprises deploying LLMs should audit tokenizer performance across their actual language mix. If you're running multilingual support or serving APAC markets, tokenizer inefficiency is a hidden cost multiplier. SentencePiece and custom BPE implementations can help, but the trade-offs between vocabulary size and token efficiency remain.

A February 2025 arXiv survey notes ongoing research into discrete tokenizers for multimodal models (text, image, video), but production improvements remain incremental. The fundamental constraints haven't changed since 2023.

Tokenizers aren't sexy infrastructure, but they're load-bearing. Ignore them and you'll wonder why your deployment costs are higher and performance is worse than benchmarks suggested.

The basics matter

The APAC problem

Math still doesn't work

What to watch

Related Articles

AI agent payment infrastructure: what works, what doesn't, and the $1T gap

Coding with AI assistants: engineer reports cognitive trade-offs in deep problem-solving

Open-source Android assistant replaces Google with local-first voice control