vad

Copyright 2014-2020 Anthony Larcher and Sylvain Meignier

frontend provides methods to process an audio signal in order to extract useful parameters for speaker verification.

frontend.vad.label_fusion(label, win=3)[source]

Apply a morphological filtering on the label to remove isolated labels. In case the input is a two channel label (2D ndarray of boolean of same length) the labels of two channels are fused to remove overlaping segments of speech.

Parameters
  • label – input labels given in a 1D or 2D ndarray

  • win – parameter or the morphological filters

frontend.vad.pre_emphasis(input_sig, pre)[source]

Pre-emphasis of an audio signal. :param input_sig: the input vector of signal to pre emphasize :param pre: value that defines the pre-emphasis filter.

frontend.vad.segment_axis(a, length, overlap=0, axis=None, end='cut', endvalue=0)[source]

Generate a new array that chops the given array along the given axis into overlapping frames.

This method has been implemented by Anne Archibald, as part of the talk box toolkit example:

segment_axis(arange(10), 4, 2)
array([[0, 1, 2, 3],
   ( [2, 3, 4, 5],
     [4, 5, 6, 7],
     [6, 7, 8, 9]])
Parameters
  • a – the array to segment

  • length – the length of each frame

  • overlap – the number of array elements by which the frames should overlap

  • axis – the axis to operate on; if None, act on the flattened array

  • end – what to do with the last frame, if the array is not evenly divisible into pieces. Options are: - ‘cut’ Simply discard the extra values - ‘wrap’ Copy values from the beginning of the array - ‘pad’ Pad with a constant value

  • endvalue – the value to use for end=’pad’

Returns

a ndarray

The array is not copied unless necessary (either because it is unevenly strided and being flattened or because end is set to ‘pad’ or ‘wrap’).

frontend.vad.speech_enhancement(X, Gain, NN)[source]

This program is only to process the single file seperated by the silence section if the silence section is detected, then a counter to number of buffer is set and pre-processing is required.

Usage: SpeechENhance(wavefilename, Gain, Noise_floor)

Parameters
  • X – input audio signal

  • Gain – default value is 0.9, suggestion range 0.6 to 1.4, higher value means more subtraction or noise redcution

  • NN

Returns

a 1-dimensional array of boolean that is True for high energy frames.

Copyright 2014 Sun Han Wu and Anthony Larcher

frontend.vad.vad_energy(log_energy, distrib_nb=3, nb_train_it=8, flooring=0.0001, ceiling=1.0, alpha=2)[source]
Parameters
  • log_energy

  • distrib_nb

  • nb_train_it

  • flooring

  • ceiling

  • alpha

Returns

frontend.vad.vad_percentil(log_energy, percent)[source]
Parameters
  • log_energy

  • percent

Returns

frontend.vad.vad_snr(sig, snr, fs=16000, shift=0.01, nwin=256)[source]

Select high energy frames based on the Signal to Noise Ratio of the signal. Input signal is expected encoded on 16 bits

Parameters
  • sig – the input audio signal

  • snr – Signal to noise ratio to consider

  • fs – sampling frequency of the input signal in Hz. Default is 16000.

  • shift – shift between two frames in seconds. Default is 0.01

  • nwin – number of samples of the sliding window. Default is 256.