Tokenizers are the unglamorous infrastructure behind every generative AI deployment. They convert text, code, and images into numerical tokens that models like GPT-4 can process. Get them wrong and your model struggles with basic tasks, especially outside English.
The basics matter
Tokenizers split input into units (words, subwords, or characters) and assign each a unique ID. The choice affects everything: context window usage, processing costs, and output quality. Subword tokenizers like Byte Pair Encoding (BPE) and WordPiece dominate because they balance granularity and vocabulary size. Too many tokens and you hit context limits faster. Too few and the model misses nuances.
ChatGPT processes up to 4,096 tokens per input. That sounds generous until you run non-English text.
The APAC problem
Languages like Thai, Chinese, and Turkish need 6-10x more tokens than English for the same meaning. This isn't academic: it directly impacts API costs and context window utilization. A Thai government agency running chatbots pays more per interaction and hits token limits faster than an English equivalent.
Logographic scripts (Chinese) and agglutinative languages (Turkish) break tokenizer assumptions built around English. The result: higher costs, worse performance, and subtle biases in model behavior.
Math still doesn't work
Tokenizers also explain why GPT-4 struggles with arithmetic. The number "380" might be one token, but "381" becomes two. This inconsistency means the model can't reliably process digits. It's why GPT-4 has incorrectly claimed 7,735 is greater than 7,926 in testing.
What to watch
Enterprises deploying LLMs should audit tokenizer performance across their actual language mix. If you're running multilingual support or serving APAC markets, tokenizer inefficiency is a hidden cost multiplier. SentencePiece and custom BPE implementations can help, but the trade-offs between vocabulary size and token efficiency remain.
A February 2025 arXiv survey notes ongoing research into discrete tokenizers for multimodal models (text, image, video), but production improvements remain incremental. The fundamental constraints haven't changed since 2023.
Tokenizers aren't sexy infrastructure, but they're load-bearing. Ignore them and you'll wonder why your deployment costs are higher and performance is worse than benchmarks suggested.