Towards Approximate Fast Diarization: A CPU-Only Alternative to Pyannote 3.1

Last Updated on September 4, 2025 by Editorial Team

Author(s): Shashwat (Shawn) Gupta

Originally published on Towards AI.

Speaker diarization doesn’t always require sophisticated multi-stage neural pipelines to deliver practical results. For many real-world applications, particularly those focused on post-processing clean audio content like podcasts or interviews, a surprisingly simple approach using basic clustering techniques can achieve effective “who spoke when” identification. By extracting speaker embeddings from fixed audio chunks and applying straightforward clustering algorithms like MeanShift, this lightweight method trades some precision for remarkable speed and accessibility — running entirely on CPU while processing hours of audio in minutes. This approach proves especially valuable when generating context-aware transcripts, where approximate speaker boundaries are sufficient to transform raw transcription output into properly attributed dialogue without the computational overhead of state-of-the-art models.

The Pyannote 2.1 Approach: Sophisticated Multi-Stage Pipeline

Pyannote 2.1 represents a sophisticated approach to speaker diarization, employing a carefully designed three-stage pipeline that integrates neural networks with classical clustering techniques:

1. Local Neural Speaker Segmentation

The first stage applies end-to-end neural speaker segmentation using a sliding window approach. A 5-second window slides across the audio with 500ms steps, with the model trained to handle up to 3 simultaneous speakers (Kmax = 3). For each 16ms frame within each window, the model outputs probabilities indicating whether each potential speaker is active.

The key innovation here is the use of much shorter, heavily overlapping windows compared to earlier approaches. This creates several advantages:

Reduced computational complexity: Shorter sequences are easier to train and process
Test-time augmentation effect: Overlapping windows provide multiple predictions for the same audio region
Better speaker segmentation: Dense sampling captures more nuanced speaker transitions

A binarization threshold θ is then applied to convert probabilities into binary speaker activity decisions.

Towards Approximate Fast Diarization: A CPU-Only Alternative to Pyannote 3.1 — Source: pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe

2. Local Speaker Embedding Extraction

For each window containing active speakers, the system extracts one embedding per detected speaker. This process is overlap-aware: when multiple speakers are active simultaneously, each speaker’s embedding is computed only from audio segments where that specific speaker talks alone.

This approach offers significant advantages over traditional periodic embedding extraction:

Pure speaker signals: Embeddings are extracted from audio containing only one speaker’s voice
Variable-length contexts: Uses up to 5 seconds of speech (the full window) rather than fixed 1–2 second segments
Overlap awareness: Automatically handles simultaneous speech by separating speakers

Source: pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe

3. Global Agglomerative Clustering

The extracted embeddings are clustered using classical agglomerative hierarchical clustering with centroid linkage. Despite the availability of more sophisticated clustering techniques like spectral clustering or variational Bayesian methods, agglomerative clustering was chosen for practical reasons:

Simplicity: Only requires one hyperparameter (distance threshold δ)
Flexibility: Handles the variable number of embeddings per time window
Reliability: Doesn’t assume chronological ordering or strict periodicity

4. Final Aggregation

The final stage converts clustered local segments into a complete diarization output through:

Estimating instantaneous speaker counts by averaging over overlapping windows
Computing speaker activity scores by aggregating clustered segments
Selecting the most active speakers for each time frame
Optional gap-filling for segments shorter than a threshold Δ

This sophisticated pipeline achieves impressive accuracy (often achieving DER <15% on benchmark datasets) but comes with significant computational costs:

GPU Dependency: Multiple neural network stages require CUDA-capable hardware for reasonable performance
Model Complexity: The segmentation model alone requires significant computational resources
Processing Time: The multi-stage pipeline can be slow, taking 40x faster than real-time even with GPU acceleration
Memory Usage: High memory requirements, especially for the sliding window approach with 500ms steps
Setup Complexity: Multiple dependencies and model downloads required

The Simple, Fast, and Surprisingly Effective Alternative

We have adapted the above approach to run efficiently on CPUs with impressive speed improvements. A 90-minute video that previously took around 10 minutes to process using pyannote/speaker-diarization-3.1 on a T4 GPU now takes approximately 1.5 minutes on CPU. The goal is to achieve fast, approximate diarization locally, enabling the generation of context-aware transcripts rather than relying solely on raw transcription output.

Intended Purpose

This approach is designed for context-aware transcript processing of high-quality audio content, such as YouTube podcasts. It assumes that transcript generation has already been completed using tools like Whisper or its faster alternatives. The key advantages of local processing include:

Reduced costs compared to cloud-based solutions
Broader applicability without API dependencies
No expensive API fees or streaming limitations
No file size restrictions
Elimination of chunking requirements
Reduced error accumulation due to fewer processing steps

This method is significantly faster than ElevenLabs diarization, which limits processing to 8 minutes of audio with transcript and diarization combined. Traditional chunking approaches require additional time and post-processing to map local speakers to consistent real-world speaker identities (as described in steps 3 and 4 above).

Limitations

This approach has several constraints:

May not perform well in noisy, real-world environments with low-quality audio
Cannot distinguish between multiple speakers speaking simultaneously
Provides only approximate boundary detection

Key Modifications to the Base Approach

Window sizing: We use 10-second sliding windows (instead of 5-second windows) and assign one dominant speaker per segment.
Speaker embedding extraction: We employ SpeechBrain’s pretrained ECAPA-TDNN model to extract speaker embeddings. This model runs efficiently on CPU while producing high-quality speaker characteristic representations (similar to PyAnnote’s approach).
Clustering algorithm: We use MeanShift clustering instead of UPGMC, as empirical testing showed better performance.
Boundary correction: Speaker annotations are inserted into the transcript at the nearest delimiter (periods, question marks, or exclamation marks in Hindi: . ? |).

Here’s a stripped-down approach that trades some accuracy for simplicity and speed:

import librosa
import numpy as np
import torch
from speechbrain.pretrained import SpeakerRecognition
from sklearn.cluster import MeanShift

def chunk_audio(audio, sr, chunk_duration=10):
 chunk_length = int(sr * chunk_duration)
 return  for i in range(0, len(audio), chunk_length)]

def extract_embeddings(chunks, model):
 embeddings = []
 for chunk in chunks:
 if len(chunk) == 0:
 continue
 tensor_chunk = torch.tensor(chunk).unsqueeze(0)
 emb = model.encode_batch(tensor_chunk).squeeze().detach().cpu().numpy()
 embeddings.append(emb)
 return np.vstack(embeddings)

def diarize_meanshift(audio_path, chunk_length=10):
 # Load audio (16kHz mono)
 audio, sr = librosa.load(audio_path, sr=16000)
 
 # Split audio into fixed-length chunks
 chunks = chunk_audio(audio, sr, chunk_length)
 
 # Load pretrained speaker embedding model
 model = SpeakerRecognition.from_hparams(
 source="speechbrain/spkrec-ecapa-voxceleb", 
 run_opts={"device":"cpu"}
 )
 
 # Extract embeddings per chunk
 embeddings = extract_embeddings(chunks, model)
 
 # Cluster using MeanShift
 clustering = MeanShift()
 labels = clustering.fit_predict(embeddings)
 
 # Format diarization results
 segments = []
 for i, label in enumerate(labels):
 start = i * chunk_length
 end = start + chunk_length
 segments.append((start, end, f"Speaker {label + 1}"))
 
 return segments

def apply_speaker_tags_to_transcript(result_words, speaker_changes, hops_second=5):
 """
 Intelligently insert speaker tags into transcribed text at natural boundaries
 """
 result_words.sort(key=lambda x: x['start'])
 
 speaker_idx = 0
 for idx, word in enumerate(result_words):
 if (speaker_idx < len(speaker_changes) and 
 word['start'] > speaker_changes[speaker_idx][0]):
 
 # Search for natural punctuation boundaries
 forward_idx = backward_idx = -1
 
 # Forward search for punctuation
 for curr_idx in range(idx, len(result_words)):
 if result_words[curr_idx]['start'] >= word['start'] + hops_second:
 break
 if any(delimiter in result_words[curr_idx]['text'] 
 for delimiter in '.!?।'):
 forward_idx = curr_idx
 break
 
 # Backward search for punctuation 
 for curr_idx in range(idx, -1, -1):
 if result_words[curr_idx]['start'] <= word['start'] - hops_second:
 break
 if any(delimiter in result_words[curr_idx]['text'] 
 for delimiter in '.!?।'):
 backward_idx = curr_idx
 break
 
 # Choose the closest punctuation mark
 if forward_idx != -1 and backward_idx != -1:
 chosen_idx = (forward_idx if forward_idx - idx < idx - backward_idx 
 else backward_idx)
 elif forward_idx != -1:
 chosen_idx = forward_idx
 elif backward_idx != -1:
 chosen_idx = backward_idx
 else:
 chosen_idx = idx # Fallback to current position
 
 # Insert speaker tag
 speaker_tag = f'[{speaker_changes[speaker_idx][1]}]'
 result_words[chosen_idx]['text'] += speaker_tag
 
 speaker_idx += 1
 
 return result_words

# Complete usage example
audio_file = "./audio.mp3"

# Step 1: Get diarization results
results = diarize_meanshift(audio_file, chunk_length=10)

# Step 2: Extract speaker changes
current_speaker = ''
speaker_changes = []
for start, end, speaker in results:
 if speaker != current_speaker:
 speaker_changes.append((start, speaker))
 current_speaker = speaker

# Step 3: Apply to transcript (assuming you have transcribed words with timestamps)
# result_words = your_transcription_result['words']
# tagged_words = apply_speaker_tags_to_transcript(result_words, speaker_changes)

Conclusion

This fast CPU-based speaker diarization approach demonstrates that efficient local processing can achieve significant performance improvements over existing solutions while maintaining practical accuracy for high-quality audio content. By leveraging SpeechBrain’s ECAPA-TDNN model with MeanShift clustering and strategic boundary correction, the method processes 90-minute videos in approximately 1.5 minutes on standard CPU hardware — representing a 6.7x speed improvement over PyAnnote’s speaker-diarization-3.1. The elimination of chunking requirements and API dependencies makes this approach particularly valuable for cost-effective, scalable transcript processing applications. While limitations exist for noisy environments and overlapping speech scenarios, the method provides an excellent foundation for context-aware transcript generation in controlled audio conditions.

References

Ravanelli, M., et al. (2021). SpeechBrain: A General-Purpose Speech Toolkit. arXiv preprint arXiv:2106.04624.
Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Proceedings of Interspeech 2020.
Bredin, H., et al. (2020). pyannote.audio: neural building blocks for speaker diarization. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Fukunaga, K., & Hostetler, L. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on information theory, 21(1), 32–40.
Radford, A., et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint arXiv:2212.04356.
Sell, G., & Garcia-Romero, D. (2014). Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In 2014 IEEE Spoken Language Technology Workshop (SLT).
Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length normalization in speaker recognition systems. In Twelfth annual conference of the international speech communication association.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Towards Approximate Fast Diarization: A CPU-Only Alternative to Pyannote 3.1

Author(s): Shashwat (Shawn) Gupta

The Pyannote 2.1 Approach: Sophisticated Multi-Stage Pipeline

1. Local Neural Speaker Segmentation

2. Local Speaker Embedding Extraction

3. Global Agglomerative Clustering

4. Final Aggregation

The Simple, Fast, and Surprisingly Effective Alternative

Conclusion

References

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Towards Approximate Fast Diarization: A CPU-Only Alternative to Pyannote 3.1

Author(s): Shashwat (Shawn) Gupta

The Pyannote 2.1 Approach: Sophisticated Multi-Stage Pipeline

1. Local Neural Speaker Segmentation

2. Local Speaker Embedding Extraction

3. Global Agglomerative Clustering

4. Final Aggregation

The Simple, Fast, and Surprisingly Effective Alternative

Conclusion

References

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement