Voice bot latency under 500ms: Vapi's concurrency tweaks for appointment setters

The Problem

Voice AI appointment setters typically spike to 400-800ms latency when querying calendar APIs mid-conversation. That's the difference between natural flow and users hanging up. Three bottlenecks kill performance: blocking speech-to-text while waiting for availability checks, creating new Twilio connections per request, and synchronous webhook processing.

What Works

Vapi's recommended stack gets to ~465ms end-to-end on web, 965ms+ on telephony networks (PSTN adds 600ms overhead vs 100ms WebRTC). The architecture: Deepgram Nova-2 STT (80-120ms), GPT-4o-mini LLM (180-250ms instead of GPT-4's 400ms), ElevenLabs Turbo v2 TTS (140ms vs 280ms standard). Key optimization is concurrent function execution - calendar lookups run async while STT continues listening, shaving 150-400ms.

Connection pooling to Twilio matters. Establishing fresh connections per call adds 40-100ms. Vapi's multi-region edges and provider flexibility (bring your own AssemblyAI, Groq, etc.) let you pin infrastructure close to users.

The timeout pattern is critical: abort calendar API calls after 180ms, fall back to cached generic slots. Users notice delays over 200ms; at 500ms, conversation length drops 20% according to Vapi's data.

The Trade-offs

Pushing optimizeStreamingLatency: 4 in ElevenLabs trades slight audio quality for 40-60ms gains. Disabling STT formatting can add 1.5 seconds if misconfigured. Semantic caching risks inconsistent personalization - "I have openings Tuesday" works until your cache expires mid-call.

Retell and Bland claim ~600ms without Vapi's full stack, raising vendor lock-in questions. Packet loss (even 1%) can double delays despite optimization. Network jitter on PSTN remains the wild card - your 465ms web latency becomes 965ms+ on real phone lines.

What This Means

For CTOs evaluating voice AI: sub-500ms is achievable with discipline (region pinning, streaming ASR, async everything), but telephony networks add unavoidable overhead. Budget <1200ms per turn to avoid flow breaks. Test on actual phone lines, not just SIP clients. Watch Vapi's Call Logs API for bottleneck isolation - if calendar lookups consistently timeout, your backend is the problem, not the AI stack.

The real question: is shaving 200ms worth the engineering complexity? For high-volume appointment setting in healthcare or logistics, probably. For low-frequency scheduling, maybe not.

The Problem

What Works

The Trade-offs

What This Means

Related Articles

MIRROR and Engram architectures tackle LLM memory and reasoning limits

Xero CEO claims AI can't replicate accounting platform, cites proprietary data moat

Google's A2UI turns Apps Script into agent hub for Drive task automation