Trending:
AI & Machine Learning

Voice bot latency under 500ms: Vapi's concurrency tweaks for appointment setters

Most AI appointment bots hit 400-800ms delays when chaining calendar lookups, breaking conversation flow. Vapi's stack - concurrent function calls, Twilio connection pooling, async webhooks - cuts that to sub-200ms. The trade-offs matter.

The Problem

Voice AI appointment setters typically spike to 400-800ms latency when querying calendar APIs mid-conversation. That's the difference between natural flow and users hanging up. Three bottlenecks kill performance: blocking speech-to-text while waiting for availability checks, creating new Twilio connections per request, and synchronous webhook processing.

What Works

Vapi's recommended stack gets to ~465ms end-to-end on web, 965ms+ on telephony networks (PSTN adds 600ms overhead vs 100ms WebRTC). The architecture: Deepgram Nova-2 STT (80-120ms), GPT-4o-mini LLM (180-250ms instead of GPT-4's 400ms), ElevenLabs Turbo v2 TTS (140ms vs 280ms standard). Key optimization is concurrent function execution - calendar lookups run async while STT continues listening, shaving 150-400ms.

Connection pooling to Twilio matters. Establishing fresh connections per call adds 40-100ms. Vapi's multi-region edges and provider flexibility (bring your own AssemblyAI, Groq, etc.) let you pin infrastructure close to users.

The timeout pattern is critical: abort calendar API calls after 180ms, fall back to cached generic slots. Users notice delays over 200ms; at 500ms, conversation length drops 20% according to Vapi's data.

The Trade-offs

Pushing optimizeStreamingLatency: 4 in ElevenLabs trades slight audio quality for 40-60ms gains. Disabling STT formatting can add 1.5 seconds if misconfigured. Semantic caching risks inconsistent personalization - "I have openings Tuesday" works until your cache expires mid-call.

Retell and Bland claim ~600ms without Vapi's full stack, raising vendor lock-in questions. Packet loss (even 1%) can double delays despite optimization. Network jitter on PSTN remains the wild card - your 465ms web latency becomes 965ms+ on real phone lines.

What This Means

For CTOs evaluating voice AI: sub-500ms is achievable with discipline (region pinning, streaming ASR, async everything), but telephony networks add unavoidable overhead. Budget <1200ms per turn to avoid flow breaks. Test on actual phone lines, not just SIP clients. Watch Vapi's Call Logs API for bottleneck isolation - if calendar lookups consistently timeout, your backend is the problem, not the AI stack.

The real question: is shaving 200ms worth the engineering complexity? For high-volume appointment setting in healthcare or logistics, probably. For low-frequency scheduling, maybe not.