Guide Setup SIP AI Providers

Complete Setup Guide: AI Providers, SIP & Voice Agents

Vocals Team | March 31, 2026

Introduction

VOCALS is an AI-powered telephony platform that manages real-time voice conversations. It integrates SIP providers for telephony and supports interchangeable AI providers for speech recognition (STT), language models (LLM), and speech synthesis (TTS).

In this guide you will learn how to configure VOCALS from scratch in three major steps: setting up your AI providers, connecting a SIP provider (Netelip), and creating an intelligent voice agent.

Prerequisites

Before you begin, make sure you have:

A VOCALS account (dashboard.usevocals.com)
API keys from at least one provider in each category (STT, LLM, TTS)
A Netelip account with a DID (phone number) and SIP credentials

How the voice pipeline works

Every call flows through a three-stage pipeline in real time. Incoming audio is transcribed with STT, the transcription is sent to the LLM to generate a response, and the response is synthesized into audio with TTS — all in under 2 seconds.

Stage	Function	Description
STT	Speech-to-Text	Converts the caller’s audio into text in real time
LLM	Language Model	Generates conversational responses based on the transcription and system prompt
TTS	Text-to-Speech	Converts the response text into audio the caller hears

For a deeper dive into how each stage works, see our guide to AI voice agents.

Step 1: Configure AI Providers

AI providers are the core of the voice pipeline. You need to configure at least one provider for each stage: STT, LLM, and TTS. VOCALS lets you mix and match providers per agent to optimize for latency, accuracy, cost, or language support.

Accessing provider configuration

Log in to the VOCALS dashboard (dashboard.usevocals.com).
In the sidebar, navigate to Configuration > Providers.
Click Add Provider.
Select the provider type (STT, LLM, or TTS) and the specific service.
Enter your API key and configure the provider-specific settings.
Click Save. VOCALS will validate the key by making a test request.

Tip: Create separate API keys for VOCALS instead of reusing keys from other projects. This makes it easier to track usage and rotate credentials.

STT Providers (Speech-to-Text)

STT providers transcribe the caller’s audio to text in real time:

Provider	Models	Notes
Deepgram	nova-2, nova-2-general, nova-2-phonecall	Recommended. Low latency, excellent streaming support.
OpenAI Whisper	whisper-1	Batch mode. Higher latency but good accuracy in noisy environments.
Alibaba Qwen	qwen-audio	Strong multilingual support, especially Chinese and Asian languages.
Fish Audio	transcribe-1	Batch mode, 30+ language support. In beta.

Recommended for Deepgram: Model nova-2, language en-US (or es-ES for Spanish), Smart Format enabled, Endpointing at 300 ms, and Interim Results enabled for faster partial transcriptions.

LLM Providers (Language Model)

LLM providers generate the agent’s conversational responses based on the transcription and system prompt.

Provider	Models	Notes
OpenAI	gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo	Good quality/speed balance. gpt-4o-mini for efficient general use.
Anthropic Claude	claude-sonnet-4-20250514, claude-haiku-4-20250414	Excellent at following detailed prompts and maintaining consistent personas.
Google Gemini	gemini-2.5-flash, gemini-2.5-pro	Very low latency at competitive pricing. Ideal for high volume.
Moonshot Kimi	moonshot-v1-8k, moonshot-v1-32k	Strong Chinese support. Competitive pricing for Asian markets.

Recommended: Temperature 0.7 and Max Tokens 256 are the defaults and work well for most conversational use cases.

TTS Providers (Text-to-Speech)

TTS providers convert the LLM’s response text into audio the caller hears.

Provider	Models / Voices	Notes
ElevenLabs	eleven_turbo_v2_5, eleven_multilingual_v2	Most natural voices. Supports voice cloning. Use turbo for telephony.
OpenAI TTS	tts-1, tts-1-hd (voices: alloy, echo, fable, onyx, nova, shimmer)	Simple to configure. tts-1 for telephony (lower latency).
Resemble AI	Custom voices by UUID	Specialized in brand voice cloning.
Fish Audio	s2, s1, speech-1.6, speech-1.5	Natural voice with emotion control. 30+ languages.

Recommended for ElevenLabs: Model eleven_turbo_v2_5, Stability 0.5, Similarity Boost 0.75, Optimize Streaming Latency 3. To find your Voice ID, go to your ElevenLabs dashboard > Voices > select a voice > copy the Voice ID.

Recommended combinations by use case

Use case	STT	LLM	TTS
General (low latency)	Deepgram nova-2	OpenAI gpt-4o-mini	ElevenLabs turbo v2.5
High quality	Deepgram nova-2	Anthropic Claude Sonnet	ElevenLabs multilingual v2
Budget	Deepgram nova-2	Google Gemini Flash	OpenAI tts-1
Multilingual (30+ languages)	Fish Audio	Google Gemini Flash	Fish Audio s2

To understand how the BYOK model lets you control costs for these providers, check out our dedicated guide.

Step 2: Configure the SIP Provider (Netelip)

The SIP provider connects your phone numbers to VOCALS. Netelip is a European SIP trunk provider with coverage in Spain and Latin America. VOCALS uses Asterisk as a SIP gateway for generic providers like Netelip.

Before you begin, make sure you have:

An active Netelip account.
A DID (phone number) assigned in your Netelip dashboard.
Your SIP credentials (username and password) from the Netelip control panel.

Adding Netelip in the VOCALS dashboard

In the VOCALS dashboard, navigate to Settings > SIP Providers.
Click Add SIP Provider.
Select Netelip as the type. This will pre-fill the SIP server and port, but you can change them if needed.
Fill in the connection details (see table below).
(Optional) Configure inbound call filtering under Allowed IPs to restrict which IPs can send calls to your trunk.
Click Save.

Field	Value	Notes
SIP Server	sip.netelip.com	Pre-filled. Also available: sip-eu.netelip.com for European regional servers.
SIP Port	5060	Default port for Netelip.
Transport	UDP	Default transport protocol for Netelip.
Username	(your SIP username)	The SIP username provided by Netelip in your control panel.
Password	(your SIP password)	The SIP password provided by Netelip.
Allowed IPs	(optional)	IPs allowed to send calls. Leave empty to allow all. Netelip sends from its published IP ranges.

Verifying registration status

After saving, VOCALS registers your trunk with the SIP server. Check the status indicator on the SIP provider card:

Status	Meaning
Green (Registered)	The trunk is connected and ready to receive calls.
Red (Unregistered)	Registration failed. Verify credentials and SIP server address.
Yellow (Unknown)	Status could not be determined. The trunk may be initializing. Wait 10-30 seconds.

Important: If the status is Red, verify: (1) the SIP server and port are correct, (2) the username and password match your Netelip panel, (3) the firewall is not blocking port 5060 or UDP ports 10000-10100 for RTP media.

Configuring the DID in Netelip

You need to configure your DID number in the Netelip panel so incoming calls are routed to VOCALS:

Log in to your Netelip account.
Navigate to DID Numbers or the equivalent section.
Set the destination of your number to your VOCALS server IP on port 5060.
Make sure the codec is set to G.711a (alaw) or G.711u (ulaw). VOCALS auto-detects both.

Tip: When configuring a new SIP provider, start by making an outbound test call to verify audio quality and latency before configuring inbound routing.

For more details on the Netelip integration, visit our Netelip integration page.

Step 3: Create a Voice Agent

An agent is the central unit in VOCALS. It defines how an AI voice assistant behaves on a call: what it says, how it sounds, and which providers power it. Each phone number is assigned to a single agent.

Creating the agent

In the dashboard, navigate to Agents.
Click Create Agent.
Give the agent a descriptive name (for example, “Customer Support - English” or “Inbound Sales”).
Configure the settings described below.
Click Save to create the agent.

System Prompt

The system prompt is the most important setting. It defines the agent’s personality, instructions, and constraints. It determines everything about how the agent behaves in conversation.

Best practices for writing the system prompt:

Be specific about response length. Phone conversations need short responses (1-2 sentences per turn).
Define a persona. Give the agent a name, tone, and personality. Callers are more comfortable with a consistent character.
Set boundaries. Explicitly state which topics the agent should and should not discuss. List topics that should be escalated to a human.
Include example phrases for greetings, confirmations, and sign-offs.
Handle edge cases: what to do when the agent does not know the answer, when the caller is upset, or when the conversation goes off-track.
Use clear structure with sections, bullet points, and numbered steps. LLMs follow structured prompts better.

Important: Avoid excessively long prompts. Every token adds latency and cost to each LLM call. Aim for 200-500 words. If you need more content, consider using the Knowledge Base for reference information.

Welcome Message

This is the first thing the agent says when a call connects. It is played as TTS audio before the agent starts listening. Example: “Hello, thank you for calling [Company]. How can I help you today?”

Leave it blank if you want the agent to wait for the caller to speak first (useful for outbound calls).

Language configuration

Configure the primary language of the conversation. This setting is passed to the STT provider to improve transcription accuracy. Common values: en-US (English), en-GB (British English), es-ES (Spanish), pt-BR, fr-FR, de-DE.

Barge-in sensitivity

Controls how easily the caller can interrupt the agent while it is speaking:

Level	Behavior
Very Low	Requires sustained, clear speech to interrupt. Ideal for noisy environments.
Low	Caller must speak louder or longer. Reduces false positives from ambient noise.
Medium	Balanced setting. Works well for most environments.
High	Agent stops speaking quickly when voice is detected. For quiet environments.
Very High	Agent stops at the first sign of voice. Fast, quiet conversations.

Tip: If the agent gets interrupted by background noise, lower the barge-in sensitivity. If callers complain the agent talks over them, raise it.

Other agent settings

Setting	Description
Interruptible	Enables/disables barge-in. Disable for messages that must be heard in full (legal notices, disclaimers).
Max Call Duration	Maximum call length in seconds. Default: 600 (10 minutes). Reduce for simple cases (surveys, confirmations).
Silence Threshold	Voice activity detection (VAD) sensitivity for barge-in. Default: 0.5. High values (0.7-0.9) require more confidence; low values (0.2-0.4) are more sensitive.

Assigning providers to the agent

Each agent needs a provider for each stage of the pipeline:

In the agent configuration, find the Providers section.
Select an STT provider from your configured providers.
Select an LLM provider.
Select a TTS provider.

Tip: You can assign different providers to different agents. For example, your English sales agent could use Deepgram + GPT-4o + ElevenLabs, while your Spanish support agent uses Deepgram + Claude Sonnet + Fish Audio.

Assigning a phone number

For the agent to receive calls, it needs a phone number assigned:

Go to Phone Numbers in the dashboard.
Click on the number you want to assign.
Select the agent you just created from the dropdown.
Click Save. The change applies on the next incoming call.

You can assign the same agent to multiple phone numbers, useful when you have local numbers from different regions that should go to the same agent.

Step 4: Make a Test Call

With everything configured, it is time to verify the system works correctly:

Call the phone number you configured.
You should hear the welcome message and then be able to have a conversation with your AI agent.
After the call, check the Dashboard to review: duration, conversation transcript, latency metrics (STT, LLM, TTS), and cost breakdown.

Troubleshooting Common Issues

Problem	Solution
SIP registration fails (Red)	Verify credentials, SIP server, port, and transport. Make sure the firewall allows traffic on port 5060 and UDP 10000-10100.
No audio (silent call)	Verify RTP ports (UDP 10000-10100) are open. Check NAT configuration and codec (alaw or ulaw).
Agent does not respond	Verify the phone number is assigned to the agent and that STT, LLM, and TTS providers have valid API keys.
Echo or feedback	VOCALS applies adaptive jitter buffer automatically. If it persists, it may be on the SIP provider side. Reduce TTS volume if possible.
Authentication errors	The call will drop gracefully and the error will appear in the logs. Rotate the API key of the affected provider.

Conclusion

In this guide you have learned how to configure VOCALS end to end: from connecting your AI providers to creating and testing your first voice agent. With the right providers, a well-crafted system prompt, and a configured SIP trunk, your agent is ready to handle calls autonomously.

Full documentation at docs.usevocals.com. If you need to explore more integration options, visit our integrations page. To choose the plan that best fits your call volume, check our plans and pricing.

Back to blog