Complete Setup Guide: AI Providers, SIP & Voice Agents
Introduction
VOCALS is an AI-powered telephony platform that manages real-time voice conversations. It integrates SIP providers for telephony and supports interchangeable AI providers for speech recognition (STT), language models (LLM), and speech synthesis (TTS).
In this guide you will learn how to configure VOCALS from scratch in three major steps: setting up your AI providers, connecting a SIP provider (Netelip), and creating an intelligent voice agent.
Prerequisites
Before you begin, make sure you have:
- A VOCALS account (dashboard.usevocals.com)
- API keys from at least one provider in each category (STT, LLM, TTS)
- A Netelip account with a DID (phone number) and SIP credentials
How the voice pipeline works
Every call flows through a three-stage pipeline in real time. Incoming audio is transcribed with STT, the transcription is sent to the LLM to generate a response, and the response is synthesized into audio with TTS — all in under 2 seconds.
| Stage | Function | Description |
|---|---|---|
| STT | Speech-to-Text | Converts the caller’s audio into text in real time |
| LLM | Language Model | Generates conversational responses based on the transcription and system prompt |
| TTS | Text-to-Speech | Converts the response text into audio the caller hears |
For a deeper dive into how each stage works, see our guide to AI voice agents.
Step 1: Configure AI Providers
AI providers are the core of the voice pipeline. You need to configure at least one provider for each stage: STT, LLM, and TTS. VOCALS lets you mix and match providers per agent to optimize for latency, accuracy, cost, or language support.
Accessing provider configuration
- Log in to the VOCALS dashboard (dashboard.usevocals.com).
- In the sidebar, navigate to Configuration > Providers.
- Click Add Provider.
- Select the provider type (STT, LLM, or TTS) and the specific service.
- Enter your API key and configure the provider-specific settings.
- Click Save. VOCALS will validate the key by making a test request.
Tip: Create separate API keys for VOCALS instead of reusing keys from other projects. This makes it easier to track usage and rotate credentials.
STT Providers (Speech-to-Text)
STT providers transcribe the caller’s audio to text in real time:
| Provider | Models | Notes |
|---|---|---|
| Deepgram | nova-2, nova-2-general, nova-2-phonecall | Recommended. Low latency, excellent streaming support. |
| OpenAI Whisper | whisper-1 | Batch mode. Higher latency but good accuracy in noisy environments. |
| Alibaba Qwen | qwen-audio | Strong multilingual support, especially Chinese and Asian languages. |
| Fish Audio | transcribe-1 | Batch mode, 30+ language support. In beta. |
Recommended for Deepgram: Model nova-2, language en-US (or es-ES for Spanish), Smart Format enabled, Endpointing at 300 ms, and Interim Results enabled for faster partial transcriptions.
LLM Providers (Language Model)
LLM providers generate the agent’s conversational responses based on the transcription and system prompt.
| Provider | Models | Notes |
|---|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo | Good quality/speed balance. gpt-4o-mini for efficient general use. |
| Anthropic Claude | claude-sonnet-4-20250514, claude-haiku-4-20250414 | Excellent at following detailed prompts and maintaining consistent personas. |
| Google Gemini | gemini-2.5-flash, gemini-2.5-pro | Very low latency at competitive pricing. Ideal for high volume. |
| Moonshot Kimi | moonshot-v1-8k, moonshot-v1-32k | Strong Chinese support. Competitive pricing for Asian markets. |
Recommended: Temperature 0.7 and Max Tokens 256 are the defaults and work well for most conversational use cases.
TTS Providers (Text-to-Speech)
TTS providers convert the LLM’s response text into audio the caller hears.
| Provider | Models / Voices | Notes |
|---|---|---|
| ElevenLabs | eleven_turbo_v2_5, eleven_multilingual_v2 | Most natural voices. Supports voice cloning. Use turbo for telephony. |
| OpenAI TTS | tts-1, tts-1-hd (voices: alloy, echo, fable, onyx, nova, shimmer) | Simple to configure. tts-1 for telephony (lower latency). |
| Resemble AI | Custom voices by UUID | Specialized in brand voice cloning. |
| Fish Audio | s2, s1, speech-1.6, speech-1.5 | Natural voice with emotion control. 30+ languages. |
Recommended for ElevenLabs: Model eleven_turbo_v2_5, Stability 0.5, Similarity Boost 0.75, Optimize Streaming Latency 3. To find your Voice ID, go to your ElevenLabs dashboard > Voices > select a voice > copy the Voice ID.
Recommended combinations by use case
| Use case | STT | LLM | TTS |
|---|---|---|---|
| General (low latency) | Deepgram nova-2 | OpenAI gpt-4o-mini | ElevenLabs turbo v2.5 |
| High quality | Deepgram nova-2 | Anthropic Claude Sonnet | ElevenLabs multilingual v2 |
| Budget | Deepgram nova-2 | Google Gemini Flash | OpenAI tts-1 |
| Multilingual (30+ languages) | Fish Audio | Google Gemini Flash | Fish Audio s2 |
To understand how the BYOK model lets you control costs for these providers, check out our dedicated guide.
Step 2: Configure the SIP Provider (Netelip)
The SIP provider connects your phone numbers to VOCALS. Netelip is a European SIP trunk provider with coverage in Spain and Latin America. VOCALS uses Asterisk as a SIP gateway for generic providers like Netelip.
Before you begin, make sure you have:
- An active Netelip account.
- A DID (phone number) assigned in your Netelip dashboard.
- Your SIP credentials (username and password) from the Netelip control panel.
Adding Netelip in the VOCALS dashboard
- In the VOCALS dashboard, navigate to Settings > SIP Providers.
- Click Add SIP Provider.
- Select Netelip as the type. This will pre-fill the SIP server and port, but you can change them if needed.
- Fill in the connection details (see table below).
- (Optional) Configure inbound call filtering under Allowed IPs to restrict which IPs can send calls to your trunk.
- Click Save.
| Field | Value | Notes |
|---|---|---|
| SIP Server | sip.netelip.com | Pre-filled. Also available: sip-eu.netelip.com for European regional servers. |
| SIP Port | 5060 | Default port for Netelip. |
| Transport | UDP | Default transport protocol for Netelip. |
| Username | (your SIP username) | The SIP username provided by Netelip in your control panel. |
| Password | (your SIP password) | The SIP password provided by Netelip. |
| Allowed IPs | (optional) | IPs allowed to send calls. Leave empty to allow all. Netelip sends from its published IP ranges. |
Verifying registration status
After saving, VOCALS registers your trunk with the SIP server. Check the status indicator on the SIP provider card:
| Status | Meaning |
|---|---|
| Green (Registered) | The trunk is connected and ready to receive calls. |
| Red (Unregistered) | Registration failed. Verify credentials and SIP server address. |
| Yellow (Unknown) | Status could not be determined. The trunk may be initializing. Wait 10-30 seconds. |
Important: If the status is Red, verify: (1) the SIP server and port are correct, (2) the username and password match your Netelip panel, (3) the firewall is not blocking port 5060 or UDP ports 10000-10100 for RTP media.
Configuring the DID in Netelip
You need to configure your DID number in the Netelip panel so incoming calls are routed to VOCALS:
- Log in to your Netelip account.
- Navigate to DID Numbers or the equivalent section.
- Set the destination of your number to your VOCALS server IP on port 5060.
- Make sure the codec is set to G.711a (alaw) or G.711u (ulaw). VOCALS auto-detects both.
Tip: When configuring a new SIP provider, start by making an outbound test call to verify audio quality and latency before configuring inbound routing.
For more details on the Netelip integration, visit our Netelip integration page.
Step 3: Create a Voice Agent
An agent is the central unit in VOCALS. It defines how an AI voice assistant behaves on a call: what it says, how it sounds, and which providers power it. Each phone number is assigned to a single agent.
Creating the agent
- In the dashboard, navigate to Agents.
- Click Create Agent.
- Give the agent a descriptive name (for example, “Customer Support - English” or “Inbound Sales”).
- Configure the settings described below.
- Click Save to create the agent.
System Prompt
The system prompt is the most important setting. It defines the agent’s personality, instructions, and constraints. It determines everything about how the agent behaves in conversation.
Best practices for writing the system prompt:
- Be specific about response length. Phone conversations need short responses (1-2 sentences per turn).
- Define a persona. Give the agent a name, tone, and personality. Callers are more comfortable with a consistent character.
- Set boundaries. Explicitly state which topics the agent should and should not discuss. List topics that should be escalated to a human.
- Include example phrases for greetings, confirmations, and sign-offs.
- Handle edge cases: what to do when the agent does not know the answer, when the caller is upset, or when the conversation goes off-track.
- Use clear structure with sections, bullet points, and numbered steps. LLMs follow structured prompts better.
Important: Avoid excessively long prompts. Every token adds latency and cost to each LLM call. Aim for 200-500 words. If you need more content, consider using the Knowledge Base for reference information.
Welcome Message
This is the first thing the agent says when a call connects. It is played as TTS audio before the agent starts listening. Example: “Hello, thank you for calling [Company]. How can I help you today?”
Leave it blank if you want the agent to wait for the caller to speak first (useful for outbound calls).
Language configuration
Configure the primary language of the conversation. This setting is passed to the STT provider to improve transcription accuracy. Common values: en-US (English), en-GB (British English), es-ES (Spanish), pt-BR, fr-FR, de-DE.
Barge-in sensitivity
Controls how easily the caller can interrupt the agent while it is speaking:
| Level | Behavior |
|---|---|
| Very Low | Requires sustained, clear speech to interrupt. Ideal for noisy environments. |
| Low | Caller must speak louder or longer. Reduces false positives from ambient noise. |
| Medium | Balanced setting. Works well for most environments. |
| High | Agent stops speaking quickly when voice is detected. For quiet environments. |
| Very High | Agent stops at the first sign of voice. Fast, quiet conversations. |
Tip: If the agent gets interrupted by background noise, lower the barge-in sensitivity. If callers complain the agent talks over them, raise it.
Other agent settings
| Setting | Description |
|---|---|
| Interruptible | Enables/disables barge-in. Disable for messages that must be heard in full (legal notices, disclaimers). |
| Max Call Duration | Maximum call length in seconds. Default: 600 (10 minutes). Reduce for simple cases (surveys, confirmations). |
| Silence Threshold | Voice activity detection (VAD) sensitivity for barge-in. Default: 0.5. High values (0.7-0.9) require more confidence; low values (0.2-0.4) are more sensitive. |
Assigning providers to the agent
Each agent needs a provider for each stage of the pipeline:
- In the agent configuration, find the Providers section.
- Select an STT provider from your configured providers.
- Select an LLM provider.
- Select a TTS provider.
Tip: You can assign different providers to different agents. For example, your English sales agent could use Deepgram + GPT-4o + ElevenLabs, while your Spanish support agent uses Deepgram + Claude Sonnet + Fish Audio.
Assigning a phone number
For the agent to receive calls, it needs a phone number assigned:
- Go to Phone Numbers in the dashboard.
- Click on the number you want to assign.
- Select the agent you just created from the dropdown.
- Click Save. The change applies on the next incoming call.
You can assign the same agent to multiple phone numbers, useful when you have local numbers from different regions that should go to the same agent.
Step 4: Make a Test Call
With everything configured, it is time to verify the system works correctly:
- Call the phone number you configured.
- You should hear the welcome message and then be able to have a conversation with your AI agent.
- After the call, check the Dashboard to review: duration, conversation transcript, latency metrics (STT, LLM, TTS), and cost breakdown.
Troubleshooting Common Issues
| Problem | Solution |
|---|---|
| SIP registration fails (Red) | Verify credentials, SIP server, port, and transport. Make sure the firewall allows traffic on port 5060 and UDP 10000-10100. |
| No audio (silent call) | Verify RTP ports (UDP 10000-10100) are open. Check NAT configuration and codec (alaw or ulaw). |
| Agent does not respond | Verify the phone number is assigned to the agent and that STT, LLM, and TTS providers have valid API keys. |
| Echo or feedback | VOCALS applies adaptive jitter buffer automatically. If it persists, it may be on the SIP provider side. Reduce TTS volume if possible. |
| Authentication errors | The call will drop gracefully and the error will appear in the logs. Rotate the API key of the affected provider. |
Conclusion
In this guide you have learned how to configure VOCALS end to end: from connecting your AI providers to creating and testing your first voice agent. With the right providers, a well-crafted system prompt, and a configured SIP trunk, your agent is ready to handle calls autonomously.
Full documentation at docs.usevocals.com. If you need to explore more integration options, visit our integrations page. To choose the plan that best fits your call volume, check our plans and pricing.