How PrepPilot Knows When the Interviewer Stops Talking

TechnicalMarch 12, 202617 min read

One of the most common questions we receive about PrepPilot's stealth mode is deceptively simple: how does it know when the interviewer has finished asking their question? The answer involves a fascinating intersection of signal processing, machine learning, and real-time systems engineering. In this technical deep-dive, we explain the pipeline from raw audio to AI response trigger, covering Voice Activity Detection, endpointing thresholds, and Deepgram's UtteranceEnd event system.

The Problem: Detecting the End of Speech

Detecting when someone has finished talking sounds trivial but is actually one of the harder problems in conversational AI. Humans are remarkably good at it because we use a combination of linguistic context, prosody (pitch and rhythm), syntax, and social cues to predict when a speaker is wrapping up. Machines must replicate this capability using only the audio signal.

The core challenge is distinguishing between three types of silence:

If PrepPilot triggers the AI response too early (during an intra-utterance pause), it might miss the end of the question and generate an incomplete or irrelevant response. If it triggers too late, the candidate experiences an uncomfortable delay while waiting for the AI suggestion to appear.

Stage 1: Audio Capture and System Audio Isolation

The pipeline begins with audio capture. PrepPilot's desktop application (built on Tauri) captures system audio separately from microphone audio. This is a critical architectural decision. System audio contains what the interviewer is saying (since their voice comes through your speakers or headphones), while the microphone captures your own voice.

By processing only the system audio channel, PrepPilot avoids confusion between the interviewer's speech and the candidate's speech. This channel separation is one of the key advantages of the desktop application over browser extensions, which have more limited access to system-level audio routing.

Audio Format and Streaming

The captured audio is encoded as 16-bit PCM at 16kHz sample rate and streamed to the speech recognition service via WebSocket. The choice of 16kHz provides sufficient frequency resolution for speech recognition while keeping bandwidth requirements reasonable for real-time streaming. Audio chunks are sent in frames of approximately 100ms duration, creating a stream of small audio packets that can be processed incrementally.

Stage 2: Voice Activity Detection (VAD)

Voice Activity Detection is the first signal processing stage. VAD determines which audio frames contain speech and which contain silence or background noise. Modern VAD systems use deep neural networks (typically small CNNs or RNNs) trained on diverse audio datasets to classify each frame as speech or non-speech.

How VAD Works Under the Hood

A VAD model takes a short audio frame (typically 10-30ms) and produces a probability score indicating the likelihood that the frame contains speech. The model considers several features:

Deepgram's VAD model operates in real-time with a latency of under 10ms per frame, meaning the system knows almost immediately when speech starts and stops.

VAD Challenges in Interview Contexts

Interview audio presents specific challenges for VAD. Video call audio often includes compression artifacts, echo cancellation residuals, and notification sounds that can confuse simpler VAD models. Background noise from the interviewer's environment (typing, air conditioning, other people) must be filtered out. Deepgram's VAD handles these challenges through training on a wide variety of real-world audio conditions, including video call recordings.

Stage 3: Endpointing (Silence Detection)

Once VAD identifies that speech has stopped, the endpointing module begins counting the duration of the silence. This is where the critical threshold decisions happen. PrepPilot configures Deepgram with an endpointing threshold of 1200 milliseconds.

The 1200ms Endpointing Threshold

This means that after the last detected speech frame, the system waits 1200ms of continuous silence before considering the speech segment complete. If speech resumes within that window, the counter resets and the current utterance continues. Only if 1200ms of silence passes without any speech detection does the system finalize the current transcript segment.

Why 1200ms? This value was determined through testing across hundreds of interview recordings. The distribution of natural pauses within English sentences shows that most intra-sentence pauses are under 800ms, with a sharp drop-off after 1000ms. By setting the threshold at 1200ms, we capture the vast majority of sentence completions while rarely triggering during natural mid-sentence pauses. The remaining edge cases (very deliberate speakers who pause longer between clauses) are handled by the utterance end stage.

How Endpointing Differs from Simple Silence Detection

Naive silence detection simply checks if the audio amplitude is below a threshold. This approach fails in real-world conditions because background noise rarely produces true silence. Even in a quiet room, there is always ambient noise. True endpointing uses the VAD output (not raw amplitude) as its input, making it robust against background noise. The question is not whether the audio is quiet, but whether the audio contains speech.

Stage 4: Deepgram's UtteranceEnd Event

The final and most important stage in the pipeline is the UtteranceEnd event. PrepPilot configures Deepgram with utterance_end_ms: 1500, which creates a second, independent layer of end-of-speech detection.

What UtteranceEnd Does

When Deepgram detects that 1500ms have passed since the last word was transcribed, it sends a special UtteranceEnd event through the WebSocket connection. This event is distinct from the regular transcript events and serves as a definitive signal that the speaker has finished their current utterance.

The relationship between endpointing (1200ms) and UtteranceEnd (1500ms) creates a two-stage system:

  1. At 1200ms of silence: The endpointing system finalizes the current transcript segment. The transcript for the current utterance is considered complete, but the system has not yet confirmed this is a true turn ending.
  2. At 1500ms of silence: The UtteranceEnd event fires. This confirms that the speaker has genuinely finished their turn. PrepPilot uses this event as the trigger to send the accumulated transcript to the AI model for response generation.

Why Two Stages?

The two-stage approach provides both quick transcript finalization and reliable turn detection. The 1200ms endpointing ensures that partial transcripts are finalized quickly (important for displaying real-time text to the user). The 1500ms UtteranceEnd provides the more conservative check that triggers the AI response. This separation means the user sees the transcribed text quickly but the AI response is only triggered when we are confident the speaker has finished.

Stage 5: AI Response Trigger

When the UtteranceEnd event fires, the frontend triggers the AI response pipeline. Here is what happens in the approximately 300ms between the UtteranceEnd event and the first AI-generated tokens appearing on screen:

  1. Transcript assembly (5ms): The accumulated transcript segments from the current conversational turn are assembled into a single coherent text string.
  2. Context packaging (10ms): The transcript is packaged with conversation context, including any resume data, job description, and previous Q&A pairs from the same session.
  3. API call initiation (50ms): The packaged prompt is sent to the AI model (Claude or GPT-5.3) via streaming API call.
  4. First token generation (200-400ms): The AI model processes the prompt and begins streaming response tokens.
  5. Display (immediate): Response tokens are displayed on the stealth overlay as they arrive, providing a typing-like appearance.

The total latency from the end of the interviewer's speech to the first visible AI suggestion is approximately 1800-2200ms (1500ms UtteranceEnd delay plus 300-700ms processing). This means within about two seconds of the interviewer finishing their question, the candidate begins seeing AI-generated response suggestions.

Edge Cases and How We Handle Them

The Deliberate Pauser

Some interviewers pause deliberately within their questions for emphasis or to gather their thoughts. A 1500ms pause might occur mid-question. PrepPilot handles this by continuing to listen after the UtteranceEnd fires. If more speech is detected within a few seconds, the AI response is updated with the complete question. The streaming nature of the AI response means it can be regenerated quickly with the additional context.

Crosstalk and Echo

When the candidate starts speaking before the interviewer finishes (crosstalk), or when audio echo creates the appearance of continued speech, the system may delay the UtteranceEnd event. This is actually desirable behavior because it prevents the AI from triggering during conversational overlap. The system audio channel separation helps minimize echo issues.

Network Jitter and Packet Loss

Real-time audio streaming over WebSocket is susceptible to network issues. If packets are delayed or lost, the VAD may produce inaccurate results. PrepPilot includes a client-side audio buffer that smooths out brief network interruptions (up to 500ms). For longer interruptions, the system gracefully degrades by working with partial transcripts rather than failing entirely.

Background Noise Spikes

Sudden noises in the interviewer's environment (a door closing, a phone buzzing, typing) can be mistakenly classified as speech by the VAD, preventing the UtteranceEnd from firing. Deepgram's model is robust against most common office noises, but particularly loud or speech-like sounds can cause brief delays. In practice, these delays add 200-500ms and rarely affect the user experience.

Performance Metrics

Based on our internal testing across 500+ simulated interview sessions, the UtteranceEnd detection system achieves the following performance:

These metrics represent a significant improvement over simple timer-based approaches, which typically achieve 85-90% true positive rates with higher false positive rates. The neural VAD and two-stage detection system provides substantially more reliable turn detection.

Comparison with Other Approaches

Browser-Based Silence Detection

Browser extensions that attempt similar functionality are limited to the Web Audio API, which provides raw audio data but no built-in VAD. These tools must implement their own silence detection, typically using simple amplitude thresholds. This approach is far less accurate than neural VAD, especially in noisy environments or with compressed video call audio.

Server-Side Full Processing

Some competing tools process all audio server-side, introducing additional network latency. By using Deepgram's real-time streaming API with client-side audio capture, PrepPilot minimizes the total latency pipeline.

The architecture decisions behind PrepPilot's question detection system reflect a deep understanding of the requirements for real-time conversational AI. Every parameter, from the 1200ms endpointing threshold to the 1500ms utterance end trigger, was tuned through extensive testing to provide the most natural and responsive experience possible during phone screens and video interviews alike.

Try Stealth Mode Free

50 free credits. No credit card required. Works on Windows and macOS.

Download PrepPilot