blogFeb 27, 2026

How speech recognition actually works

A plain-language explanation of how modern ASR systems convert audio into text — from sound waves to speaker-labeled transcripts.

Leonid

Speech recognition has improved dramatically in recent years, but the underlying process remains mysterious to most people. Here's how it actually works, without the jargon.

From sound to signal

When you speak, you produce sound waves — pressure variations in the air. A microphone converts these into an electrical signal, which is then digitized into a stream of numbers (samples) at a fixed rate, typically 16,000 times per second.

This raw waveform is too noisy and unstructured for direct analysis. The first processing step converts it into a spectrogram — a visual representation of frequencies over time. Think of it as a heat map showing which sound frequencies are active at each moment.

Feature extraction

The spectrogram is further processed into mel-frequency features — a compressed representation that mirrors how the human ear perceives sound. Low frequencies get more resolution than high ones, because that's where most speech information lives.

These features become the input to the neural network. Each small time window (typically 20-30 milliseconds) produces one feature vector describing what the audio sounds like at that moment.

The neural network

Modern speech recognition uses encoder-decoder transformer models. The encoder reads the audio features and builds an internal representation of the entire recording. The decoder then generates text token by token, predicting what comes next based on everything it has seen so far.

This is the same architecture used in large language models, but adapted for audio input instead of text input. The model learns patterns from thousands of hours of transcribed audio during training.

Speaker diarization

Identifying who is speaking is a separate step called diarization. The system:

Detects segments where someone is speaking (voice activity detection)
Extracts a voice "fingerprint" (embedding) for each segment
Clusters similar fingerprints together — segments with similar voice characteristics get the same speaker label
Assigns labels like Speaker 1, Speaker 2, etc.

This happens independently from transcription and the results are merged afterward to produce a speaker-labeled transcript.

Language detection

Most modern models can identify the language automatically. They analyze the first few seconds of audio and match acoustic patterns against known languages. Some models handle multiple languages within a single recording, switching labels as the speaker changes language.

Where accuracy depends on audio quality

Recognition quality varies based on several factors:

Background noise — constant noise (fans, traffic) is easier to handle than sudden sounds (doors, coughs)
Microphone distance — closer is always better; headset mics outperform room microphones
Number of speakers — two speakers are easier to separate than five speaking at once
Audio format — higher bitrate recordings preserve more detail; compressed phone calls lose information
Accent and speed — models perform best on common speech patterns and moderate tempo

What Mediata does with this

When you upload a recording to Mediata, the system runs the full pipeline:

Audio preprocessing and format normalization
Speech recognition to produce raw text
Speaker diarization to identify who said what
Timestamp alignment to connect text to specific moments
The result appears as a structured transcript with speaker labels, timestamps, and full text — ready for search and AI analysis

The entire process runs on specialized GPU infrastructure and typically completes in a fraction of the recording's duration.