Diarization for ASR

This script performs a BIC diarization (ussally for ASR decoding)

The proposed diarization system was inspired by the system~:raw-latex:cite{Barras06} which won the RT‘04 fall evaluation and the ESTER,1 evaluation. It was developed during the ESTER,2 evaluation campaign for the transcription with the goal of minimizing word error rate.

Automatic transcription requires accurate segment boundaries. Segment boundaries have to be set within non-informative zones such as filler words.

Speaker diarization needs to produce homogeneous speech segments; however, purity and coverage of the speaker clusters are the main objectives here. Errors such as having two distinct clusters (i.e., detected speakers) corresponding to the same real speaker, or conversely, merging segments of two real speakers into only one cluster, get heavier penalty in the NIST time-based diarization metric than misplaced boundaries~:raw-latex:cite{NIST2004a}.

The system is composed of acoustic BIC segmentation followed with BIC hierarchical clustering. Viterbi decoding is performed to adjust the segment boundaries.

Music and jingle regions are not removed but a speech activity diarization could be load before to segment and cluster the show.

Optionally, long segments are cut to be shorter than 20 seconds.

%matplotlib inline

__license__ = "LGPL"
__author__ = "Sylvain Meignier"
__copyright__ = "Copyright 2015-2016 Sylvain Meignier"
__license__ = "LGPL"
__maintainer__ = "Sylvain Meignier"
__email__ = "sidekit@univ-lemans.fr"
__status__ = "Production"
__docformat__ = 'reStructuredText'

import argparse
import logging
import matplotlib
import copy

from matplotlib import pyplot as plot
from s4d.utils import *
from s4d.diar import Diar
from s4d import viterbi, segmentation
from sidekit import features_server
from s4d.clustering import hac_bic
from sidekit.sidekit_io import init_logging
from s4d.gui.dendrogram import plot_dendrogram

BIC diarization

Arguments, variables and logger

Set the log level

loglevel = logging.INFO

Set the input audio or mfcc file and the speech activity detection file (here not set).

data_dir = 'data'
show='20041219_1300_1314_RTM_ELDA'
input_show = os.path.join(data_dir, 'audio',show+'.sph')
#input_sad = None
input_sad = os.path.join(data_dir, 'sns',show+'.sns.seg')

Size of left or right windows (step 2)

win_size=250

Threshold for: * linear segmentation (step 3) * BIC HAC (step 4) * Vitterbi (step 5)

thr_l = 2
thr_h = 3
thr_vit = -250

If save_all is True then all produced diarization are saved

save_all = True

Set the logger options: logge information in console and file, set the level.

init_logging( level=loglevel)

Check if we work with an audio file or a mffc in spro4 format

input_fn = path_show_ext(input_show)
ffile = 'spro4'
if input_fn[2] in ['.sph', '.wav']:
    ffile = 'audio'

logging.info('type of input: '+ffile)
2015-12-09 13:41:52,909 - root - INFO - type of input: audio

Prepare various variables

wdir = os.path.join('out', show)

if not os.path.exists(wdir):
    os.makedirs(wdir)

Step 0 : MFCC

Get a Feature server instance

logging.info('Make MFCC')
fs = features_server.FeaturesServer(input_dir=input_fn[0],
                                    input_file_extension=input_fn[2],
                                    from_file=ffile,
                                    config='diar_16k')
2015-12-09 13:41:52,916 - root - INFO - Make MFCC

Load MFCC or compute MFCC from audio

logging.info('Load show: %s', show)
tcep, vad = fs.load(show)
cep = tcep[0]
2015-12-09 13:41:52,920 - root - INFO - Load show: 20041219_1300_1314_RTM_ELDA
2015-12-09 13:41:52,921 - root - INFO - read audio
2015-12-09 13:41:53,069 - root - INFO -  size of signal: 53.790894 len 14100960 type size 4
2015-12-09 13:41:53,666 - root - INFO - process part : 0.000000 881.310000 881.310000
2015-12-09 13:41:54,989 - root - INFO - no vad
2015-12-09 13:41:55,314 - root - INFO - keep log_e
2015-12-09 13:41:55,322 - root - INFO - !! size of signal cep: 8.740822 len 88129 type size 104
2015-12-09 13:41:55,327 - root - INFO - Smooth the labels and fuse the channels if more than one
2015-12-09 13:41:55,732 - root - INFO - no norm

Save the MFCC in spro4 format

mfcc_filename = os.path.join(wdir, show + '.pmfcc')
fs.save(show, mfcc_filename, 'spro4')
2015-12-09 13:41:55,737 - root - INFO - save spro4 format: out/20041219_1300_1314_RTM_ELDA/20041219_1300_1314_RTM_ELDA.pmfcc

Step 1

The initial diarization is loaded from a speech activity detection diarization (SAD) or a segment from the fist MFCC feature to the last MFCC feature is created.

logging.info('Check MFCC')

if input_sad is not None:
    init_diar = Diar.read_seg(input_sad)
    init_diar.pack(50)
else:
    init_diar = seg.init_seg(cep, show)

if save_all:
    init_filename = os.path.join(wdir, show + '.i.seg')
    Diar.write_seg(init_filename, init_diar)
2015-12-09 13:41:55,983 - root - INFO - Check MFCC

Step 2: gaussian divergence segmentation

First segmentation: Segment each segment of init_diar using the Gaussian Divergence method

logging.info('Make segmentation:')

#process every segment
seg_diar = Diar()
for seg in init_diar:
    l = seg.length()
    logging.debug('start: ', seg['start'],'end: ', seg['stop'], 'len: ', l)
    if l > 2*win_size:
        cep_seg = seg.seg_features(cep)
        tmp = segmentation.div_gauss(cep_seg, show=show, win=win_size, shift=seg['start'])
        seg_diar.append_diar(tmp)
    else:
        seg_diar.append_seg(seg)

i=0
for seg in seg_diar:
    seg['label'] = 'S'+str(i)
    i += 1

if save_all:
    seg_filename = os.path.join(wdir, show + '.s.seg')
    Diar.write_seg(seg_filename, seg_diar)
2015-12-09 13:41:55,998 - root - INFO - Make segmentation:

Step 3: linear BIC segmentation

This segmentation over the signal fuses consecutive segments of the same speaker from the start to the end of the record. The measure employs the \Delta BIC based on Bayesian Information Criterion , using full covariance Gaussians (see class gauss.GaussFull).

bicl_diar = segmentation.bic_linear(cep, seg_diar, thr_l, sr=False)
if save_all:
    bicl_filename = os.path.join(wdir, show + '.l.seg')
    Diar.write_seg(bicl_filename, bicl_diar)
2015-12-09 13:41:56,263 - root - WARNING - there is a hole between segment
2015-12-09 13:41:56,263 - root - WARNING - there is a hole between segment
2015-12-09 13:41:56,265 - root - WARNING - there is a hole between segment
2015-12-09 13:41:56,274 - root - WARNING - there is a hole between segment
2015-12-09 13:41:56,280 - root - WARNING - there is a hole between segment
2015-12-09 13:41:56,281 - root - WARNING - there is a hole between segment
2015-12-09 13:41:56,282 - root - WARNING - there is a hole between segment
2015-12-09 13:41:56,284 - root - WARNING - there is a hole between segment
2015-12-09 13:41:56,286 - root - WARNING - there is a hole between segment
2015-12-09 13:41:56,287 - root - WARNING - there is a hole between segment
2015-12-09 13:41:56,290 - root - WARNING - there is a hole between segment
147

Step 4: BIC HAC

Perform a BIC HAC

logging.info('Make clustering alpha: %f', thr_h)
bic = hac_bic.HAC_BIC(fs, bicl_diar, alpha=thr_h, sr=False)
bich_diar = bic.perform(to_the_end=True)
if save_all:
    bichac_filename = os.path.join(wdir, show + '.h.seg')
    Diar.write_seg(bichac_filename, bich_diar)

link, data = plot_dendrogram(bic.merge, 0, size=(25,6), log=True)
2015-12-09 13:41:56,298 - root - INFO - Make clustering alpha: 3.000000
/Users/meignier/pyenv3/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):
../_images/diarization_31_1.png

Step 4: re-segmentation

Viterbi decoding * HMM is trained: one GMM per speaker, GMM has 8 component with diagonal covariance matrix. Only a penalty between state is fixed. * Emission is computer: likelyhood for each feature * a Viterbi decoding is performed

logging.info('Make viterbi penalties: %f', thr_vit)
hmm = viterbi.Viterbi(cep, bich_diar, exit_penalties=[thr_vit])
hmm.train()
hmm.emission()
vit_diar = hmm.decode(init_diar)
if save_all:
    vit_filename = os.path.join(wdir, show + '.d.seg')
    Diar.write_seg(vit_filename, vit_diar)
2015-12-09 13:41:56,734 - root - INFO - Make viterbi penalties: -250.000000

Step 5

Move the segment boundaries into low energy aeras.

# adj_table = seg.adjust(cep, vit_table)
    # if save_all:
    #    Table.write_seg(adj_filename, adj_table)
    # output_seg(args.output, adj_table)

Compute the diarization error rate

from s4d.scoring import DER
from s4d.gui.viewer import PlotDiar
from s4d.gui.viewer_utils import diar_diff

ref = Diar.read_seg(os.path.join(data_dir, 'ref', show + '.seg'))
uem = Diar.read_uem(os.path.join(data_dir, 'ref', show + '.uem'))

der = DER(vit_diar, ref, uem, collar=25, no_overlap=False)
der.confusion()
der.assignment()
res = der.error()
print(res.rate_header())
print(res.time())
print(res.rate())

diff_diar = diar_diff(vit_diar, ref)

p = PlotDiar(diff_diar, size=(25, 6))
p.draw()
 ||         sns          fa        miss |     speaker          fa        miss        conf ||
 ||    1762.62s      12.49s       3.89s |     827.56s      12.49s       3.89s     254.23s ||
 ||       0.93%       0.71%       0.22% |      32.70%       1.51%       0.47%      30.72% ||
uem from ref
append collar
/Users/meignier/pyenv3/lib/python3.5/site-packages/matplotlib/figure.py:1653: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "
../_images/diarization_37_2.png

Speaker diarization

MFCC for Speaker clustering

  • get 12 MFCC + Delta and normalize them
from sidekit.frontend.features import compute_delta
from s4d.utils import cep_sliding_norm

#print('baseline cep:', cep.shape)
cep12 = cep[:,1:]
#print('12 MFCC:', cep12.shape)


delta = compute_delta(cep12)
#print('12 delta:', delta.shape)

cep24 = np.column_stack((cep12, delta))
#print('12MFCC + 12 delta:', cep24.shape)


cep_sliding_norm(cep24, win=301, center=True, reduce=True)


#hack put the new cep24 in the feature server
fs.cep[0] = cep24
#print(fs.cep.shape)

CLR clustering

  • load UBM
from sidekit.mixture import Mixture

ubm = Mixture()
ubm_fn = os.path.join(data_dir, 'model', 'ubm128.gmm')
print(ubm_fn)
ubm.read_pickle(ubm_fn)
data/model/ubm128.gmm
  • perform CLR HAC clustreing
    • initialize HAC
    • compute models
    • compute distance
    • do clustreing
from s4d.clustering.hac_clr import HAC_CLR
thr_clr = -1.4
clr_hac = HAC_CLR(fs, vit_diar, ubm)
clr_hac.initial_models()
clr_hac.initial_distances()
Warning, some arguments are not named, computation might not be parallelized
No Parallel processing with this module
dist = copy.deepcopy(clr_hac.dist)
n=len(vit_diar.unique('label'))
#print(dist)
np.fill_diagonal(dist, np.min(dist))


# Plot the density map using nearest-neighbor interpolation
plot.figure(figsize=(10,10))
plot.pcolor(dist.T, cmap='RdBu')
plot.colorbar()
plot.show()
/Users/meignier/pyenv3/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):
../_images/diarization_44_1.png
clr_diar = clr_hac.perform(thr_clr, to_the_end=True)

if save_all:
    clr_filename = os.path.join(wdir, show + '.clr.seg')
    Diar.write_seg(clr_filename, clr_diar)

link, data = plot_dendrogram(clr_hac.merge, thr_clr, size=(25,6), log=True)
/Users/meignier/pyenv3/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):
../_images/diarization_45_1.png
  • Compute DER
der = DER(clr_diar, ref, uem, collar=25, no_overlap=False)
der.confusion()
der.assignment()
res = der.error()
print(show)
print(res.rate_header())
print(res.time())
print(res.rate())

diff_diar = diar_diff(clr_diar, ref)

p = PlotDiar(diff_diar, size=(25, 6))
p.draw()
20041219_1300_1314_RTM_ELDA
 ||         sns          fa        miss |     speaker          fa        miss        conf ||
 ||    1762.62s      12.49s       3.89s |     827.56s      12.49s       3.89s      25.94s ||
 ||       0.93%       0.71%       0.22% |       5.11%       1.51%       0.47%       3.13% ||
uem from ref
append collar
/Users/meignier/pyenv3/lib/python3.5/site-packages/matplotlib/figure.py:1653: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "
../_images/diarization_47_2.png

I-vector PLDA clustering

#matplotlib.use('TKAgg')

Graph and HAC

Licence

This file is part of S4D.

SD4 is a python package for speaker diarization based on SIDEKIT. S4D home page: http://www-lium.univ-lemans.fr/s4d/ SIDEKIT home page: http://www-lium.univ-lemans.fr/sidekit/

S4D is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

S4D is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with SIDEKIT. If not, see http://www.gnu.org/licenses/.

Copyright 2014-2015 Sylvain Meignier