Real-Time Speech Transcription in Job Interviews: The Technology Behind It

Deep DiveMarch 7, 202615 min read

Behind every real-time AI interview assistant is a speech-to-text engine converting spoken words into text fast enough that the AI can generate a response before the candidate needs to start talking. This article is a technical deep-dive into how that transcription works, specifically how PrepPilot uses Deepgram's Nova-2 model to achieve sub-second speech recognition during live job interviews.

If you have read our overview of how stealth mode works, you know the basic pipeline: audio capture, transcription, AI generation, overlay display. Here we focus entirely on the transcription step, the engineering decisions behind it, and why it matters for the end-user experience.

Why Speech-to-Text Accuracy Matters for Interviews

In a casual transcription scenario like generating subtitles for a recorded podcast, an occasional misrecognized word is tolerable because the listener can fill in the gaps from context. In a live interview assistant, transcription errors have compounding consequences. A misrecognized keyword in the question changes the meaning entirely, which causes the AI to generate an irrelevant response, which then confuses the candidate who reads it on the overlay while trying to formulate their answer.

Consider the difference between transcribing "Tell me about a time you managed a tight deadline" correctly versus mishearing "managed" as "damaged." The AI would generate a completely different response. This is why PrepPilot invests heavily in transcription accuracy, choosing Deepgram Nova-2 for its industry-leading word error rate of under 8% on conversational English, compared to roughly 12% for Whisper Large v3 in streaming mode and 10% for Google Speech-to-Text v2.

Deepgram Nova-2: Architecture and Capabilities

Deepgram Nova-2 is a purpose-built speech recognition model that differs architecturally from the attention-based transformer models used by OpenAI Whisper. While the exact architecture is proprietary, Deepgram has published that Nova-2 uses an end-to-end neural network optimized specifically for streaming, meaning it was designed from the ground up to process audio incrementally rather than in complete utterance batches.

Streaming vs. Batch Processing

The distinction between streaming and batch transcription is fundamental to understanding why some speech-to-text engines are suitable for real-time applications and others are not. Batch models like the original Whisper take an entire audio segment (typically 30 seconds), process it as a whole, and return the complete transcription. This approach achieves higher accuracy because the model has full context but introduces unacceptable latency for real-time use. You cannot wait 30 seconds for a transcription when you need to respond to a question within seconds.

Streaming models process audio in small chunks (PrepPilot sends 100-millisecond chunks) and return results incrementally. Deepgram handles this through two types of results. Interim results are partial transcriptions that update as more audio arrives. They may change as the model reconsiders earlier words in light of new context. Final results are committed transcriptions that the model is confident in and will not revise. PrepPilot uses only final results for question assembly to avoid acting on text that might change.

The WebSocket Protocol

Communication between PrepPilot and Deepgram happens over a persistent WebSocket connection. When stealth mode activates, PrepPilot opens a WebSocket to Deepgram's streaming API endpoint with the following configuration parameters:

The WebSocket remains open for the duration of the interview session. Audio chunks are sent as binary frames, and transcription results return as JSON text frames. This bidirectional persistent connection eliminates the overhead of establishing new HTTP requests for each audio chunk, which would add unacceptable latency.

Silence Detection and Utterance End Events

One of the most critical pieces of the transcription pipeline is knowing when the interviewer has finished speaking. This is not as simple as detecting silence, because interviewers naturally pause mid-sentence for emphasis, to think, or to check notes. The system must distinguish between a mid-thought pause and the end of a question.

Voice Activity Detection (VAD)

Deepgram's Nova-2 includes a built-in voice activity detection model that runs alongside the speech recognition model. VAD classifies each audio frame as containing speech or not. This is used both to avoid processing silence (which would waste compute) and to track pause durations.

The Endpointing Threshold

PrepPilot configures the endpointing parameter to 1500 milliseconds. This means that when Deepgram detects 1.5 consecutive seconds of non-speech audio after a period of speech, it emits a special utterance-end event. This event tells PrepPilot that the interviewer has likely finished their current statement or question.

The 1.5-second threshold was chosen through extensive testing. Shorter thresholds (500ms, 1000ms) triggered too many false positives, cutting off interviewers mid-sentence and generating responses to incomplete questions. Longer thresholds (2000ms, 2500ms) added unnecessary delay, making the AI feel sluggish. The 1.5-second sweet spot correctly identifies 94% of question endings while triggering on only 3% of mid-sentence pauses. For more detail on how this auto-detection creates a hands-free experience, see our article on hands-free interview coaching.

The Utterance End Backup

In addition to the endpointing parameter, PrepPilot configures utterance_end_ms to 2000 milliseconds as a backup mechanism. If the endpointing logic does not trigger (for example, in noisy environments where VAD has difficulty distinguishing speech from background noise), the utterance end timer acts as a fallback. After 2 seconds of receiving no new final transcript results, Deepgram emits the utterance-end event regardless of VAD state.

Handling Audio Quality Challenges

Interview audio rarely arrives in pristine condition. The audio passes through the interviewer's microphone (which may be a laptop microphone, headset, or external mic of varying quality), through their internet connection (which may introduce compression artifacts and packet loss), through the conferencing platform's audio processing (noise cancellation, automatic gain control, echo cancellation), and finally through the candidate's speakers or headphones before being captured by PrepPilot.

Codec and Compression Artifacts

Conferencing platforms compress audio using codecs like Opus, AAC, or G.711. Each compression step removes information that can degrade transcription accuracy. Deepgram Nova-2 is trained on audio that includes common codec artifacts, which is one reason it outperforms models trained primarily on high-quality studio recordings. Its training data includes thousands of hours of video conferencing audio, telephony audio, and real-world conversation recordings with typical quality degradation.

Background Noise and Crosstalk

Interviewers sometimes conduct calls from noisy environments. A colleague talking in the background, an air conditioner running, or keyboard typing can all appear in the audio stream. Nova-2 handles this through its neural network architecture, which learns to separate the primary speaker's voice from background noise. However, no model is perfect, and PrepPilot includes a preprocessing step that applies basic noise reduction before sending audio to Deepgram. This two-stage approach (local noise reduction plus Deepgram's built-in noise handling) maximizes accuracy in challenging audio conditions.

Multilingual Transcription

PrepPilot supports over 30 languages through Deepgram's multilingual model. When the language parameter is set to "multi," the model automatically detects which language is being spoken and switches its recognition accordingly. Supported languages include English (all variants), Spanish, French, German, Portuguese, Italian, Dutch, Japanese, Korean, Mandarin Chinese, Cantonese, Hindi, Bengali, Tamil, Turkish, Russian, Ukrainian, Polish, Czech, Swedish, Norwegian, Danish, Finnish, Indonesian, Malay, Vietnamese, Thai, Arabic, Hebrew, and Tagalog.

For interviews conducted in a single non-English language, setting the specific language code (for example, "fr" for French or "de" for German) improves accuracy compared to the multi-language auto-detection mode, because the model does not need to allocate processing capacity to language identification. PrepPilot's settings allow users to specify their interview language or leave it on auto-detect. For a deeper look at multilingual capabilities, read our article on interview assistance in 30+ languages.

Latency Breakdown: From Sound Wave to Displayed Text

Understanding the complete latency profile helps explain why the system feels responsive. Here is the breakdown for a typical word spoken by the interviewer:

  1. Audio capture buffer: ~50ms (half of a 100ms audio chunk)
  2. Network transmission to Deepgram: ~20-50ms (depending on location)
  3. Deepgram processing: ~150-250ms (model inference time)
  4. Network transmission back: ~20-50ms
  5. Rendering on overlay: ~5ms

Total end-to-end latency from spoken word to displayed text is typically 250 to 400 milliseconds. This means the live transcript on the overlay trails the interviewer's speech by less than half a second, which is fast enough to feel real-time to the candidate. For context, human cognitive processing of speech takes approximately 200 milliseconds, so the transcript appears on screen at roughly the same speed the candidate's brain processes what they heard.

Comparing Speech-to-Text Providers

PrepPilot evaluated multiple speech-to-text providers before selecting Deepgram. Here is how the leading options compare for the specific use case of real-time interview transcription:

For PrepPilot's specific requirements of sub-second latency, high accuracy on conferencing audio, and broad language support, Deepgram Nova-2 offered the best combination. The simple WebSocket API also reduced integration complexity, which translates to fewer potential failure points in production.

Try Stealth Mode Free

50 free credits. No credit card required. Works on Windows and macOS.

Download PrepPilot