Commun Parameters – LIUM SpkDiarization

Most of the tools take:

A diarization as input and generate another diarization as output. The exceptions are model trainers, which generate a model instead of a diarization.
A file containing acoustic vector or a audio file.

Diarization parameters

Input

Diarization file formats are explain in Data section.

--sInputMask=<path> set the <path> of the input diarization file. It could be:

an absolute path: --sInputMask=/home/myseg.seg
a relative path from the current directory: --sInputMask=seg/myseg.seg
a path where the %s is substituted by the show name parameter. For example, --sInputMask=seg/%s.seg with show name equal to myseg is equivalent to --sInputMask=seg/myseg.seg.

The parameter --sInputFormat set information about the file. It takes two values separated by a comma. The first value set the file format and the second set the charset. The supported file format are:

seg corresponds to the file format of the toolkit;
bck corresponds to a LIUM file format close to a NIST CTM speech file;
ctl corresponds to the sphinx control file format with some extension in the speaker name;
saus.seg is an extension in witch sausages graph are stored (deprecated);
seg.xml corresponds to an xml version of seg proposed in ANR EPAC project (experimental use only);
media.xml corresponds to an xml version of seg proposed in ANR PORT-MEDIA project (experimental use only);

The charset correspond to a charset name defined in JVM, please read Charset javadoc page. The most use in france are ISO-8859-1 and UTF8.

To read a ctl encoded in UTF8, the parameter is --sInputFormat=seg,UTF8.

output

The output diarization parameters is similar to the input ones:

--sInputMask is now --sOutputMask=PATH
--sInputFormat is now --sOutputFormat

Feature parameters

input

The parameter --fInputMask defines the file to load. It could be:

an absolute path: --fInputMask=/home/myfile.mfcc
a relative path from the current directory: --fInputMask=file/myfile.wav
a path where %s is substituted by the show name parameter. For example, --fInputMask=./file/%s.sph with show name equal to myshow is equivalent to --fInputMask=./file/myshow.sph.

The parameter --fInputDesc=type[:deltatype] [,s:e:ds:de:dds:dde,dim,c:r:wSize:method] contains 4 blocks separated by a comma:

the type of file:type;
the description of the feature vector: s:e:ds:de:dds:dde;
the number of the static parameters and energy present on disk or computed on the fly;
the normalization to applied on feature: c:r:wSize:method.

The feature type could be:

sphinx a sphinx file (mfcc, plp, etc);
spro4 a spro4 (mfcc, lfcc, filter bank, etc);
gztxt a gzipped-text text in which each line corresponds to a vector;
htk a htk file;
featureSetTransformation only used in programming to transform a feature set into an other feature set; for example when you when to apply CMS on unnormalize features;
audio8kHz2sphin, audio16kHz2sphin,audio22kHz2sphinx, audio44kHz2sphinx, or audio48kHz2sphinxa sphere, wave audio file (recorded at 8, 16, 22, 44 or 48 KHz) is a converted in mfcc using sphinx 4;
audio2sphinx is no more available.

The description of the vector is described by s:e:ds:de:dds:dde where:

s corresponds to static values, if s equal:
- 0 the static is not present on disk,
- 1 the static is present,
- 3 the static is present on disk but values are removed after the loading;
e corresponds to the energy, value could also be [0, 1, 2 , 3];
d corresponds to delta, value could also be [0, 1, 2 , 3];
de corresponds to delta energy, value could also be [0, 1, 2 , 3];
dd corresponds to delta delta, value could also be [0, 1, 2 , 3];
dde corresponds to delta delta energy, value could also be [0, 1, 2 , 3].

The feature deltaType could be:

sphinx a sphinx style delta and delta delta;
spro4 a spro4 style delta and delta delta;
htk a htk style delta and delta delta;

The normalization is controlled by 4 parameters c:r:wSize:method, where:

c corresponds to the cepstral mean subtraction (CMS), 0 signifies that CMS is not applied, whereas 1 signifies that CMS is applied;
r corresponds to the variance normalization, admit value is 0 or 1;
wSize is link to the normalization method, the value correspond of the number of frame in a sliding window, on which the normalization is computed (CMS, variance or warping);
method indicates how to apply the normalization:
- mean and/or variance are computed on segment if the value is set to 0,
- mean and/or variance are computed on cluster if the value is set to 1,
- mean and/or variance are computed on sliding window if the value is set to 2,
- feature warping if the value is set to 3,
- feature warping followed by a CMS and/or a variance normalization on segment if the value is set to 4,
- feature mapping followed by a CMS and/or a variance normalization on cluster if the value is set to 5,
- feature warping followed by a CMS and/or a variance normalization on cluster if the value is set to 6.

Feature warping^[1] and feature mapping ^[2] are classical normalization method employed in speaker verification system. Read:

output

The output feature parameters is similar to the input ones:

--fInputMask is --fOutputMask,
--fInputDesc is --fOutputDesc.

1.
^a

J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verification,” in Proc. ISCA Workshop on Speaker Recognition – 2001: A Speaker Oddyssey, June 2001.

2.
^a

D. Reynolds, “Channel robust speaker verification via feature mapping,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, 2003, pp. II–53–6.