Quick Start

Command line

The execution of the jar file directly invokes the speaker diarization method developed for broadcast news recordings. Suppose we need to compute the diarization “./showName.seg” of the audio file “./showName.wav”. The command line to accomplish this would be:

/usr/bin/java -Xmx2024m -jar ./LIUM_SpkDiarization.jar --fInputMask=./showName.wav --sOutputMask=./showName.seg --doCEClustering  showName<

corresponding to:

  • javathe name of the java virtual machine (JVM).
  • option -Xmx2048m sets the memory of the JVM to 2048MB, which is appropriate to treat a one-hour show.
  • option -jar./LIUM_SpkDiarization.jar specifies the jar to use.
  • option --fInputMask=./showName.wav is the name of the audio file. It can be in Sphere format or Wave format (16kHz / 16bit PCM mono), the type is auto detected according the extension. 
  • option --sOutputMask=/showName.seg is the output file containing the segmentation.
  •  if the option –doCEClustering  is set, the program computes the NCLR/CE clustering at the end. The diarization error rate is minimized. If this option is not set, the program stops right after the detection of the gender and the resulting segmentation is sufficient for a transcription system.
  •  showName is the name of the show.

The other possible options are:

  •  --trace to display information during processing.
  •  --help to display a brief usage guide of the tools.
  •  --system=current selects the diarization system (currently unused).
  •  --saveAllStep save every step of the diarization. They are saved in the following files:
    •  show.i.seg : initial segmentation (split in segments of 2 sec)
    •  show.pms.seg : Speech/Music/Silence segmentation
    •  show.s.seg : GLR based segmentation, make small segments
    •  show.l.seg : linear clustering (merge only side by side segments)
    • show.h.seg : hierarchical clustering
    •  show.d.seg : viterbi decoding 
    •  show.adj.seg : boundaries adjusted
    •  show.flt.seg : filter spk segmentation according to pms segmentation
    •  show.spl.seg : segments longer than 20 sec are splitted
    •  show.g.seg : the gender and the bandwith are detected.(This is the segmentation used for transcription)
    •  show.c.seg : final segmentation with NCLR/CE clustering (if the option --doCEClustering is set.)
  •  --loadInputSegmentation loads the initial segmentation (UEM) from the file specified by the option --sInputMask. By default, the initial segmentation is composed of one segment ranging from the start to the end of the show.

Caution: there is a problem not yet solve under windows. The load of resources (as gmm) don’t works.

System description

The Sphinx 4 tools are used for the computation of features from the signal. For the first three steps described below, the features are composed of 13 MFCCs with coefficient C0 as energy, and are not normalized (no CMS or warping). Different sets of features are used for further steps, similarly computed using the Sphinx tools.

Before segmenting the signal into homogeneous regions, a safety check is performed over the features. They are checked to ensure that there is no sequence of several identical features (usually resulting from a problem during the recording of the sound), for such sequences would disturb the segmentation process.

Segmentation based on BIC

A pass of distance-based segmentation detects the instantaneous change points corresponding to segment boundaries. It detects the change points through a generalized likelihood ratio (GLR), computed using Gaussians with full covariance matrices. The Gaussians are estimated over a five-second window sliding along the whole signal. A change point, i.e. a segment boundary, is present in the middle of the window when the GLR reaches a local maximum.

A second pass over the signal fuses consecutive segments of the same speaker from the start to the end of the record. The measure employs ∆BIC, using full covariance Gaussians, as defined in equation 1 below.

BIC Clustering

The algorithm is based upon a hierarchical agglomerative clustering. The initial set of clusters is composed of one segment per cluster. Each cluster is modeled by a Gaussian with a full covariance matrix. ∆BIC measure is employed to select the candidate clusters to group as well as to stop the merging process. The two closest clusters i and j are merged at each iteration until ∆BICi,j > 0.

∆BIC is defined in equation 1. Let |Σi|, |Σj| and |Σ| be the determinants of gaussians associated to the clusters i, j and i + j. λ is a parameter to set up. The penalty factor P (eq. 2) depends on d, the dimension of the features, as well as on ni and nj, refering to the total length of cluster i and cluster j respectively.

BIC_{i,j} = {n_i+n_j}/2 log|\Sigma| – n_i/2 log|\Sigma_i| – n_j/2 log|\Sigma_j| – \lambda P (eq. 1)

P= 1/2 (d + {d(d+1)}/2) + log(n_i+n_j) (eq. 2)

This penalty factor only takes the length of the two candidate clusters into account whereas the standard factor uses the length of the whole data.

Segmentation based on Viterbi decoding

A Viterbi decoding is performed to generate a new segmentation. A cluster is modeled by a HMM with only one state, represented by a GMM with 8 components (diagonal covariance). The GMM is learned by EM-ML over the segments of the cluster. The log-penalty between two HMMs is fixed experimentally.

The segment boundaries produced by the Viterbi decoding are not perfect: for example, some of them fall within words. In order to avoid this, the boundaries are adjusted by applying a set of rules defined experimentally. They are moved slightly in order to be located in low energy regions. Long segments are also cut recursively at their points of lowest energy in order to yield segments shorter than 20 seconds.

Speech detection

In order to remove music and jingle regions, a segmentation into speech / non-speech is obtained using a Viterbi decoding with 8 one-state HMMs. The eight models consist of 2 models of silence (wide and narrow band), 3 models of wide band speech (clean, over noise or over music), 1 model of narrow band speech, 1 model of jingles, and 1 model of music.

Each state is represented by a 64 diagonal GMM trained by EM- ML on ESTER 1 data. The features are 12 MFCCs completed by ∆ coefficients (coefficient C0 is removed).

Gender and bandwidth detection

Detection of gender and bandwidth is done using a GMM (with 128 diagonal components) for each of the 4 combinations of gender (male / female) and bandwidth (narrow / wide band). Each cluster is labeled according to the characteristics of the GMM which maximizes likelihood over the features of the cluster.

Each model is learned from about one hour of speech extracted from the ESTER training corpus. The features are composed of 12 MFCCs and ∆ coefficients (C0 is removed). The entire features of the recording are warped using a 3 second sliding window, before the features of each cluster are normalized (centered and reduced).

The diarization resulting from this step fits the needs of automatic speech recognition: the segments are shorter than 20 seconds; they contain the voice of only one speaker; and bandwidth and gender are known for each segment.

GMM-based speaker clustering

In the segmentation and clustering steps above, features were used unnormalized in order to preserve information on the background environment, which helps differentiating between speakers. At this point however, each cluster contains the voice of only one speaker, but several clusters can be related to the same speaker. The contribution of the background environment to the cluster models must be removed (through feature normalization), before a hierarchical agglomerative clustering is performed over the last diarization in order to obtain a one-to-one relationship between clusters and speakers.

Thanks to the greater length of the speaker clusters resulting from the BIC hierarchical clustering, more robust, complex speaker models can be used for this step. A Universal Background Model (UBM), resulting from the fusion of the four gender- and bandwidth- dependent GMMs used earlier, serves as a base. The means of the UBM are adapted for each cluster to obtain the model for its speaker.

At each iteration, the two clusters that maximize a given measure are merged. The default measure is the Cross Entropy (CE/NCLR). The clustering stops when the measure gets higher than a threshold set a priori.