
Towards Approximate Fast Diarization: A CPU-Only Alternative to Pyannote 3.1
Last Updated on September 4, 2025 by Editorial Team
Author(s): Shashwat (Shawn) Gupta
Originally published on Towards AI.
Speaker diarization doesn’t always require sophisticated multi-stage neural pipelines to deliver practical results. For many real-world applications, particularly those focused on post-processing clean audio content like podcasts or interviews, a surprisingly simple approach using basic clustering techniques can achieve effective “who spoke when” identification. By extracting speaker embeddings from fixed audio chunks and applying straightforward clustering algorithms like MeanShift, this lightweight method trades some precision for remarkable speed and accessibility — running entirely on CPU while processing hours of audio in minutes. This approach proves especially valuable when generating context-aware transcripts, where approximate speaker boundaries are sufficient to transform raw transcription output into properly attributed dialogue without the computational overhead of state-of-the-art models.
The Pyannote 2.1 Approach: Sophisticated Multi-Stage Pipeline
Pyannote 2.1 represents a sophisticated approach to speaker diarization, employing a carefully designed three-stage pipeline that integrates neural networks with classical clustering techniques:
1. Local Neural Speaker Segmentation
The first stage applies end-to-end neural speaker segmentation using a sliding window approach. A 5-second window slides across the audio with 500ms steps, with the model trained to handle up to 3 simultaneous speakers (Kmax = 3). For each 16ms frame within each window, the model outputs probabilities indicating whether each potential speaker is active.
The key innovation here is the use of much shorter, heavily overlapping windows compared to earlier approaches. This creates several advantages:
- Reduced computational complexity: Shorter sequences are easier to train and process
- Test-time augmentation effect: Overlapping windows provide multiple predictions for the same audio region
- Better speaker segmentation: Dense sampling captures more nuanced speaker transitions
A binarization threshold θ is then applied to convert probabilities into binary speaker activity decisions.

2. Local Speaker Embedding Extraction
For each window containing active speakers, the system extracts one embedding per detected speaker. This process is overlap-aware: when multiple speakers are active simultaneously, each speaker’s embedding is computed only from audio segments where that specific speaker talks alone.
This approach offers significant advantages over traditional periodic embedding extraction:
- Pure speaker signals: Embeddings are extracted from audio containing only one speaker’s voice
- Variable-length contexts: Uses up to 5 seconds of speech (the full window) rather than fixed 1–2 second segments
- Overlap awareness: Automatically handles simultaneous speech by separating speakers

3. Global Agglomerative Clustering
The extracted embeddings are clustered using classical agglomerative hierarchical clustering with centroid linkage. Despite the availability of more sophisticated clustering techniques like spectral clustering or variational Bayesian methods, agglomerative clustering was chosen for practical reasons:
- Simplicity: Only requires one hyperparameter (distance threshold δ)
- Flexibility: Handles the variable number of embeddings per time window
- Reliability: Doesn’t assume chronological ordering or strict periodicity

4. Final Aggregation
The final stage converts clustered local segments into a complete diarization output through:
- Estimating instantaneous speaker counts by averaging over overlapping windows
- Computing speaker activity scores by aggregating clustered segments
- Selecting the most active speakers for each time frame
- Optional gap-filling for segments shorter than a threshold Δ
This sophisticated pipeline achieves impressive accuracy (often achieving DER <15% on benchmark datasets) but comes with significant computational costs:
- GPU Dependency: Multiple neural network stages require CUDA-capable hardware for reasonable performance
- Model Complexity: The segmentation model alone requires significant computational resources
- Processing Time: The multi-stage pipeline can be slow, taking 40x faster than real-time even with GPU acceleration
- Memory Usage: High memory requirements, especially for the sliding window approach with 500ms steps
- Setup Complexity: Multiple dependencies and model downloads required
The Simple, Fast, and Surprisingly Effective Alternative

We have adapted the above approach to run efficiently on CPUs with impressive speed improvements. A 90-minute video that previously took around 10 minutes to process using pyannote/speaker-diarization-3.1 on a T4 GPU now takes approximately 1.5 minutes on CPU. The goal is to achieve fast, approximate diarization locally, enabling the generation of context-aware transcripts rather than relying solely on raw transcription output.
Intended Purpose
This approach is designed for context-aware transcript processing of high-quality audio content, such as YouTube podcasts. It assumes that transcript generation has already been completed using tools like Whisper or its faster alternatives. The key advantages of local processing include:
- Reduced costs compared to cloud-based solutions
- Broader applicability without API dependencies
- No expensive API fees or streaming limitations
- No file size restrictions
- Elimination of chunking requirements
- Reduced error accumulation due to fewer processing steps
This method is significantly faster than ElevenLabs diarization, which limits processing to 8 minutes of audio with transcript and diarization combined. Traditional chunking approaches require additional time and post-processing to map local speakers to consistent real-world speaker identities (as described in steps 3 and 4 above).
Limitations
This approach has several constraints:
- May not perform well in noisy, real-world environments with low-quality audio
- Cannot distinguish between multiple speakers speaking simultaneously
- Provides only approximate boundary detection
Key Modifications to the Base Approach
- Window sizing: We use 10-second sliding windows (instead of 5-second windows) and assign one dominant speaker per segment.
- Speaker embedding extraction: We employ SpeechBrain’s pretrained ECAPA-TDNN model to extract speaker embeddings. This model runs efficiently on CPU while producing high-quality speaker characteristic representations (similar to PyAnnote’s approach).
- Clustering algorithm: We use MeanShift clustering instead of UPGMC, as empirical testing showed better performance.
- Boundary correction: Speaker annotations are inserted into the transcript at the nearest delimiter (periods, question marks, or exclamation marks in Hindi: . ? |).
Here’s a stripped-down approach that trades some accuracy for simplicity and speed:
import librosa
import numpy as np
import torch
from speechbrain.pretrained import SpeakerRecognition
from sklearn.cluster import MeanShift
def chunk_audio(audio, sr, chunk_duration=10):
chunk_length = int(sr * chunk_duration)
return for i in range(0, len(audio), chunk_length)]
def extract_embeddings(chunks, model):
embeddings = []
for chunk in chunks:
if len(chunk) == 0:
continue
tensor_chunk = torch.tensor(chunk).unsqueeze(0)
emb = model.encode_batch(tensor_chunk).squeeze().detach().cpu().numpy()
embeddings.append(emb)
return np.vstack(embeddings)
def diarize_meanshift(audio_path, chunk_length=10):
# Load audio (16kHz mono)
audio, sr = librosa.load(audio_path, sr=16000)
# Split audio into fixed-length chunks
chunks = chunk_audio(audio, sr, chunk_length)
# Load pretrained speaker embedding model
model = SpeakerRecognition.from_hparams(
source="speechbrain/spkrec-ecapa-voxceleb",
run_opts={"device":"cpu"}
)
# Extract embeddings per chunk
embeddings = extract_embeddings(chunks, model)
# Cluster using MeanShift
clustering = MeanShift()
labels = clustering.fit_predict(embeddings)
# Format diarization results
segments = []
for i, label in enumerate(labels):
start = i * chunk_length
end = start + chunk_length
segments.append((start, end, f"Speaker {label + 1}"))
return segments
def apply_speaker_tags_to_transcript(result_words, speaker_changes, hops_second=5):
"""
Intelligently insert speaker tags into transcribed text at natural boundaries
"""
result_words.sort(key=lambda x: x['start'])
speaker_idx = 0
for idx, word in enumerate(result_words):
if (speaker_idx < len(speaker_changes) and
word['start'] > speaker_changes[speaker_idx][0]):
# Search for natural punctuation boundaries
forward_idx = backward_idx = -1
# Forward search for punctuation
for curr_idx in range(idx, len(result_words)):
if result_words[curr_idx]['start'] >= word['start'] + hops_second:
break
if any(delimiter in result_words[curr_idx]['text']
for delimiter in '.!?।'):
forward_idx = curr_idx
break
# Backward search for punctuation
for curr_idx in range(idx, -1, -1):
if result_words[curr_idx]['start'] <= word['start'] - hops_second:
break
if any(delimiter in result_words[curr_idx]['text']
for delimiter in '.!?।'):
backward_idx = curr_idx
break
# Choose the closest punctuation mark
if forward_idx != -1 and backward_idx != -1:
chosen_idx = (forward_idx if forward_idx - idx < idx - backward_idx
else backward_idx)
elif forward_idx != -1:
chosen_idx = forward_idx
elif backward_idx != -1:
chosen_idx = backward_idx
else:
chosen_idx = idx # Fallback to current position
# Insert speaker tag
speaker_tag = f'[{speaker_changes[speaker_idx][1]}]'
result_words[chosen_idx]['text'] += speaker_tag
speaker_idx += 1
return result_words
# Complete usage example
audio_file = "./audio.mp3"
# Step 1: Get diarization results
results = diarize_meanshift(audio_file, chunk_length=10)
# Step 2: Extract speaker changes
current_speaker = ''
speaker_changes = []
for start, end, speaker in results:
if speaker != current_speaker:
speaker_changes.append((start, speaker))
current_speaker = speaker
# Step 3: Apply to transcript (assuming you have transcribed words with timestamps)
# result_words = your_transcription_result['words']
# tagged_words = apply_speaker_tags_to_transcript(result_words, speaker_changes)
Conclusion
This fast CPU-based speaker diarization approach demonstrates that efficient local processing can achieve significant performance improvements over existing solutions while maintaining practical accuracy for high-quality audio content. By leveraging SpeechBrain’s ECAPA-TDNN model with MeanShift clustering and strategic boundary correction, the method processes 90-minute videos in approximately 1.5 minutes on standard CPU hardware — representing a 6.7x speed improvement over PyAnnote’s speaker-diarization-3.1. The elimination of chunking requirements and API dependencies makes this approach particularly valuable for cost-effective, scalable transcript processing applications. While limitations exist for noisy environments and overlapping speech scenarios, the method provides an excellent foundation for context-aware transcript generation in controlled audio conditions.
References
- Ravanelli, M., et al. (2021). SpeechBrain: A General-Purpose Speech Toolkit. arXiv preprint arXiv:2106.04624.
- Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Proceedings of Interspeech 2020.
- Bredin, H., et al. (2020). pyannote.audio: neural building blocks for speaker diarization. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- Fukunaga, K., & Hostetler, L. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on information theory, 21(1), 32–40.
- Radford, A., et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint arXiv:2212.04356.
- Sell, G., & Garcia-Romero, D. (2014). Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In 2014 IEEE Spoken Language Technology Workshop (SLT).
- Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length normalization in speaker recognition systems. In Twelfth annual conference of the international speech communication association.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.