What is Voice Activity Detection (VAD)?

Voice Activity Detection is a signal processing technique that determines whether an audio segment contains human speech or silence/background noise. Modern VAD systems use deep learning models trained on millions of audio samples to distinguish speech from non-speech sounds with high accuracy, even in noisy environments.

What is endpointing in speech recognition?

Endpointing is the process of detecting when a speaker has finished talking. It uses silence duration thresholds (typically 1000-2000 milliseconds) to determine that a pause is long enough to indicate the end of an utterance rather than a brief pause within a sentence. PrepPilot uses a 1200ms endpointing threshold.

How does Deepgram's UtteranceEnd event work?

Deepgram sends an UtteranceEnd event when it detects that the speaker has finished an utterance based on the utterance_end_ms parameter (PrepPilot uses 1500ms). This event fires after the specified silence duration, signaling that the speaker has completed their thought and it is safe to trigger the AI response generation.

Why does PrepPilot use 1500ms for utterance end instead of a shorter value?

The 1500ms threshold balances responsiveness with accuracy. A shorter value would trigger false positives during natural pauses within sentences, leading to premature AI responses. A longer value would introduce noticeable delay after the interviewer finishes speaking. Testing showed 1500ms captures the vast majority of natural sentence endings without triggering during mid-sentence pauses.

Can PrepPilot distinguish between the interviewer and the candidate speaking?

PrepPilot captures system audio (what comes through your speakers/headphones) separately from microphone audio. In stealth mode, it processes the system audio channel which contains the interviewer's voice. This separation means the AI responds to the interviewer's questions rather than to the candidate's own speech.

How PrepPilot Knows When the Interviewer Stops Talking

TechnicalMarch 12, 202617 min read

One of the most common questions we receive about PrepPilot's stealth mode is deceptively simple: how does it know when the interviewer has finished asking their question? The answer involves a fascinating intersection of signal processing, machine learning, and real-time systems engineering. In this technical deep-dive, we explain the pipeline from raw audio to AI response trigger, covering Voice Activity Detection, endpointing thresholds, and Deepgram's UtteranceEnd event system.

The Problem: Detecting the End of Speech

Detecting when someone has finished talking sounds trivial but is actually one of the harder problems in conversational AI. Humans are remarkably good at it because we use a combination of linguistic context, prosody (pitch and rhythm), syntax, and social cues to predict when a speaker is wrapping up. Machines must replicate this capability using only the audio signal.

The core challenge is distinguishing between three types of silence:

Intra-utterance pauses: Brief silences within a sentence, such as the pause between clauses or while the speaker thinks. These are typically 200-800ms long and should not trigger a response.
Turn-ending silence: The silence that follows a complete thought or question. These are typically 1000-2000ms and should trigger the AI response pipeline.
Extended silence: Longer pauses that might indicate the interviewer is waiting for a response, reading notes, or has a connection issue. These are 2000ms or longer.

If PrepPilot triggers the AI response too early (during an intra-utterance pause), it might miss the end of the question and generate an incomplete or irrelevant response. If it triggers too late, the candidate experiences an uncomfortable delay while waiting for the AI suggestion to appear.

Stage 1: Audio Capture and System Audio Isolation

The pipeline begins with audio capture. PrepPilot's desktop application (built on Tauri) captures system audio separately from microphone audio. This is a critical architectural decision. System audio contains what the interviewer is saying (since their voice comes through your speakers or headphones), while the microphone captures your own voice.

By processing only the system audio channel, PrepPilot avoids confusion between the interviewer's speech and the candidate's speech. This channel separation is one of the key advantages of the desktop application over browser extensions, which have more limited access to system-level audio routing.

Audio Format and Streaming

The captured audio is encoded as 16-bit PCM at 16kHz sample rate and streamed to the speech recognition service via WebSocket. The choice of 16kHz provides sufficient frequency resolution for speech recognition while keeping bandwidth requirements reasonable for real-time streaming. Audio chunks are sent in frames of approximately 100ms duration, creating a stream of small audio packets that can be processed incrementally.

Stage 2: Voice Activity Detection (VAD)

Voice Activity Detection is the first signal processing stage. VAD determines which audio frames contain speech and which contain silence or background noise. Modern VAD systems use deep neural networks (typically small CNNs or RNNs) trained on diverse audio datasets to classify each frame as speech or non-speech.

How VAD Works Under the Hood

A VAD model takes a short audio frame (typically 10-30ms) and produces a probability score indicating the likelihood that the frame contains speech. The model considers several features:

Spectral energy distribution: Human speech has characteristic energy patterns across frequency bands. Speech energy is concentrated in the 300Hz-3400Hz range, while background noise often has a more uniform distribution.
Temporal patterns: Speech has rhythmic patterns of voiced and unvoiced segments, with characteristic onset and offset patterns that differ from noise.
Harmonic structure: Voiced speech (vowels, some consonants) has harmonic overtones at regular frequency intervals, which is rare in background noise.
Zero-crossing rate: The rate at which the audio signal crosses zero amplitude. Speech and noise have different characteristic zero-crossing patterns.

Deepgram's VAD model operates in real-time with a latency of under 10ms per frame, meaning the system knows almost immediately when speech starts and stops.

VAD Challenges in Interview Contexts

Interview audio presents specific challenges for VAD. Video call audio often includes compression artifacts, echo cancellation residuals, and notification sounds that can confuse simpler VAD models. Background noise from the interviewer's environment (typing, air conditioning, other people) must be filtered out. Deepgram's VAD handles these challenges through training on a wide variety of real-world audio conditions, including video call recordings.

Stage 3: Endpointing (Silence Detection)

Once VAD identifies that speech has stopped, the endpointing module begins counting the duration of the silence. This is where the critical threshold decisions happen. PrepPilot configures Deepgram with an endpointing threshold of 1200 milliseconds.

The 1200ms Endpointing Threshold

This means that after the last detected speech frame, the system waits 1200ms of continuous silence before considering the speech segment complete. If speech resumes within that window, the counter resets and the current utterance continues. Only if 1200ms of silence passes without any speech detection does the system finalize the current transcript segment.

Why 1200ms? This value was determined through testing across hundreds of interview recordings. The distribution of natural pauses within English sentences shows that most intra-sentence pauses are under 800ms, with a sharp drop-off after 1000ms. By setting the threshold at 1200ms, we capture the vast majority of sentence completions while rarely triggering during natural mid-sentence pauses. The remaining edge cases (very deliberate speakers who pause longer between clauses) are handled by the utterance end stage.

How Endpointing Differs from Simple Silence Detection

Naive silence detection simply checks if the audio amplitude is below a threshold. This approach fails in real-world conditions because background noise rarely produces true silence. Even in a quiet room, there is always ambient noise. True endpointing uses the VAD output (not raw amplitude) as its input, making it robust against background noise. The question is not whether the audio is quiet, but whether the audio contains speech.

Stage 4: Deepgram's UtteranceEnd Event

The final and most important stage in the pipeline is the UtteranceEnd event. PrepPilot configures Deepgram with utterance_end_ms: 1500, which creates a second, independent layer of end-of-speech detection.

What UtteranceEnd Does

When Deepgram detects that 1500ms have passed since the last word was transcribed, it sends a special UtteranceEnd event through the WebSocket connection. This event is distinct from the regular transcript events and serves as a definitive signal that the speaker has finished their current utterance.

The relationship between endpointing (1200ms) and UtteranceEnd (1500ms) creates a two-stage system:

At 1200ms of silence: The endpointing system finalizes the current transcript segment. The transcript for the current utterance is considered complete, but the system has not yet confirmed this is a true turn ending.
At 1500ms of silence: The UtteranceEnd event fires. This confirms that the speaker has genuinely finished their turn. PrepPilot uses this event as the trigger to send the accumulated transcript to the AI model for response generation.

Why Two Stages?

The two-stage approach provides both quick transcript finalization and reliable turn detection. The 1200ms endpointing ensures that partial transcripts are finalized quickly (important for displaying real-time text to the user). The 1500ms UtteranceEnd provides the more conservative check that triggers the AI response. This separation means the user sees the transcribed text quickly but the AI response is only triggered when we are confident the speaker has finished.

Stage 5: AI Response Trigger

When the UtteranceEnd event fires, the frontend triggers the AI response pipeline. Here is what happens in the approximately 300ms between the UtteranceEnd event and the first AI-generated tokens appearing on screen:

Transcript assembly (5ms): The accumulated transcript segments from the current conversational turn are assembled into a single coherent text string.
Context packaging (10ms): The transcript is packaged with conversation context, including any resume data, job description, and previous Q&A pairs from the same session.
API call initiation (50ms): The packaged prompt is sent to the AI model (Claude or GPT-5.3) via streaming API call.
First token generation (200-400ms): The AI model processes the prompt and begins streaming response tokens.
Display (immediate): Response tokens are displayed on the stealth overlay as they arrive, providing a typing-like appearance.

The total latency from the end of the interviewer's speech to the first visible AI suggestion is approximately 1800-2200ms (1500ms UtteranceEnd delay plus 300-700ms processing). This means within about two seconds of the interviewer finishing their question, the candidate begins seeing AI-generated response suggestions.

Edge Cases and How We Handle Them

The Deliberate Pauser

Some interviewers pause deliberately within their questions for emphasis or to gather their thoughts. A 1500ms pause might occur mid-question. PrepPilot handles this by continuing to listen after the UtteranceEnd fires. If more speech is detected within a few seconds, the AI response is updated with the complete question. The streaming nature of the AI response means it can be regenerated quickly with the additional context.

Crosstalk and Echo

When the candidate starts speaking before the interviewer finishes (crosstalk), or when audio echo creates the appearance of continued speech, the system may delay the UtteranceEnd event. This is actually desirable behavior because it prevents the AI from triggering during conversational overlap. The system audio channel separation helps minimize echo issues.

Network Jitter and Packet Loss

Real-time audio streaming over WebSocket is susceptible to network issues. If packets are delayed or lost, the VAD may produce inaccurate results. PrepPilot includes a client-side audio buffer that smooths out brief network interruptions (up to 500ms). For longer interruptions, the system gracefully degrades by working with partial transcripts rather than failing entirely.

Background Noise Spikes

Sudden noises in the interviewer's environment (a door closing, a phone buzzing, typing) can be mistakenly classified as speech by the VAD, preventing the UtteranceEnd from firing. Deepgram's model is robust against most common office noises, but particularly loud or speech-like sounds can cause brief delays. In practice, these delays add 200-500ms and rarely affect the user experience.

Performance Metrics

Based on our internal testing across 500+ simulated interview sessions, the UtteranceEnd detection system achieves the following performance:

True positive rate: 96.3% of genuine turn endings are correctly detected within 2 seconds
False positive rate: 3.8% of mid-sentence pauses incorrectly trigger UtteranceEnd
Median detection latency: 1,650ms from end of speech to UtteranceEnd event
Median total latency: 2,100ms from end of speech to first AI token displayed

These metrics represent a significant improvement over simple timer-based approaches, which typically achieve 85-90% true positive rates with higher false positive rates. The neural VAD and two-stage detection system provides substantially more reliable turn detection.

Comparison with Other Approaches

Browser-Based Silence Detection

Browser extensions that attempt similar functionality are limited to the Web Audio API, which provides raw audio data but no built-in VAD. These tools must implement their own silence detection, typically using simple amplitude thresholds. This approach is far less accurate than neural VAD, especially in noisy environments or with compressed video call audio.

Server-Side Full Processing

Some competing tools process all audio server-side, introducing additional network latency. By using Deepgram's real-time streaming API with client-side audio capture, PrepPilot minimizes the total latency pipeline.

The architecture decisions behind PrepPilot's question detection system reflect a deep understanding of the requirements for real-time conversational AI. Every parameter, from the 1200ms endpointing threshold to the 1500ms utterance end trigger, was tuned through extensive testing to provide the most natural and responsive experience possible during phone screens and video interviews alike.

Try Stealth Mode Free

50 free credits. No credit card required. Works on Windows and macOS.

Download PrepPilot