Llama.cpp CPU inference hits 20 tokens/s on laptop, developer tests solver vs judge behavior

A developer running llama.cpp directly on Linux hardware (no LM Studio or Ollama wrapper) tested Llama 3.1 8B's tendency to fabricate answers versus admitting uncertainty. The conversation revealed how SLMs behave differently on factual versus creative tasks, something enterprise deployments need to account for.

The Biggish Editorial · Tuesday, February 3, 2026

Running llama.cpp binaries directly on consumer hardware is becoming a testbed for understanding how small language models actually behave under constraints.

A developer compiled llama.cpp with AVX-512 flags on a laptop i7-1165G7, hitting ~390% CPU load (half the 4-core/8-thread capacity) and 20 tokens per second with quantized Llama 3.1 8B. No wrappers like LM Studio or Ollama, just Linux commands and hardware. This is what "bare metal inference" looks like for folks testing SLMs without GPU budgets.

The more interesting part: the conversation itself. When asked what the "-cnv" flag does in llama-cli, the model fabricated an answer about "convolutional groups," then "CPU cores for conversion." Both wrong. When corrected and told honesty matters more than guessing, it acknowledged the behavior and adjusted.

This maps to a documented pattern in SLM evaluation. Some models prioritize being "helpful" (the "Solver" approach), inventing plausible-sounding answers. Others prioritize accuracy (the "Judge" approach), admitting uncertainty. Neither is universally better. For technical queries, judges win. For creative tasks, solvers are the point.

For enterprise deployments, this matters. If you're running SLMs for documentation lookup or compliance checks, fabricated answers are a liability. If you're prototyping creative content or brainstorming, constraint is the enemy. The model's default behavior affects which use case it suits.

The performance context: CPU-only inference with AVX-512 optimizations can deliver 2.8x token generation speedups on recent AMD and Intel chips. Prompt evaluation hit 25-29 tokens/s in this test, generation around 6 tokens/s. That's usable for single-user, offline tasks, edge deployments, or testing. It's not production-grade for high concurrency, where vLLM on GPUs scales better.

llama.cpp excels at predictable, low-resource scenarios. The trade-off is flat throughput under load. For enterprises testing local SLM deployments or edge use cases, that's often acceptable. For customer-facing applications at scale, it's not.

The conversation itself is worth reading. It shows how SLMs respond to correction and context-switching between factual and creative modes. That adaptability, or lack of it, determines whether a model fits your workflow.

Related Articles

Building local LLM stacks with Docker: what enterprise teams need to know

Flutter blog generator uses Pieces OS for context-aware technical writing

AI code generation hits UI drift problem as assistants ignore existing design systems