In the domain of audio signal processing, the accurate and efficient diarization of conversational speech is still a
challenging task, particularly in environments with significant speaker overlap and diverse acoustic scenarios. This paper
introduces a comprehensive speaker diarization pipeline that improves performance and efficiency in processing
conversational speech. Our pipeline comprises several key components: Voice Activity Detection (VAD), Speaker
Overlap Detection (SOD), Speaker Separation models, robust speaker embedding, clustering algorithms, and
sophisticated post-processing techniques. Beginning with Voice Activity Detection (VAD), the pipeline efficiently
discriminates between speech and non-speech segments, effectively reducing processing overhead. Following VAD, the
Speaker Overlap Detection (SOD) component identifies segments featuring speaker overlap. Following this, a speaker
separation model separates the overlapping speech into distinct streams. A pivotal enhancement in our pipeline is the
integration of robust speaker embedding and clustering techniques, which capture and utilize speaker-specific
characteristics to improve the grouping of speech segments. Finally, the post-processing stage refines these segments to
ensure temporal consistency and improve the overall diarization accuracy. We evaluated our pipeline across multiple
benchmark datasets, proving significant reductions up to 10% in Diarization Error Rate (DER) compared to existing
methods. |