Commun Parameters

Most of the tools take:

  • A diarization as input and generate another diarization as output. The exceptions are model trainers, which generate a model instead of a diarization.
  • A file containing acoustic vector or a audio file.

Diarization parameters

Input

Diarization file formats are explain in Data section.

--sInputMask=<path> set the <path> of the input diarization file. It could be:

  • an absolute path: --sInputMask=/home/myseg.seg
  • a relative path from the current directory: --sInputMask=seg/myseg.seg
  • a path where the %s is substituted by the show name parameter. For example, --sInputMask=seg/%s.seg with show name equal to myseg is equivalent to --sInputMask=seg/myseg.seg.

The parameter --sInputFormat set information about the file. It takes two values separated by a comma. The first value set the file format and the second set the charset. The supported file format are:

  • seg corresponds to the file format of the toolkit;
  • bck corresponds to a LIUM file format close to a NIST CTM speech file;
  • ctl corresponds to the sphinx control file format with some extension in the speaker name;
  • saus.seg is an extension in witch sausages graph are stored (deprecated);
  • seg.xml corresponds to an xml version of seg proposed in ANR EPAC project (experimental use only);
  • media.xml corresponds to an xml version of seg proposed in ANR PORT-MEDIA project (experimental use only);

The charset correspond to a charset name defined in JVM, please read Charset javadoc page. The most use in france are ISO-8859-1 and UTF8.

To read a ctl encoded in UTF8, the parameter is --sInputFormat=seg,UTF8.

output

The output diarization parameters is similar to the input ones:

  • --sInputMask is now --sOutputMask=PATH
  • --sInputFormat is now --sOutputFormat

Feature parameters

input

The parameter --fInputMask defines the file to load. It could be:

  • an absolute path: --fInputMask=/home/myfile.mfcc
  • a relative path from the current directory: --fInputMask=file/myfile.wav
  • a path where %s is substituted by the show name parameter. For example, --fInputMask=./file/%s.sph with show name equal to myshow is equivalent to --fInputMask=./file/myshow.sph.

The parameter --fInputDesc=type[:deltatype] [,s:e:ds:de:dds:dde,dim,c:r:wSize:method] contains 4 blocks separated by a comma:

  • the type of file:type;
  • the description of the feature vector: s:e:ds:de:dds:dde;
  • the number of the static parameters and energy present on disk or computed on the fly;
  • the normalization to applied on feature: c:r:wSize:method.

The feature type could be:

  • sphinx a sphinx file (mfcc, plp, etc);
  • spro4 a spro4 (mfcc, lfcc, filter bank, etc);
  • gztxt a gzipped-text text in which each line corresponds to a vector;
  • htk a htk file;
  • featureSetTransformation only used in programming to transform a feature set into an other feature set; for example when you when to apply CMS on unnormalize features;
  • audio8kHz2sphin, audio16kHz2sphin,audio22kHz2sphinx, audio44kHz2sphinx, or audio48kHz2sphinxa sphere, wave audio file (recorded at 8, 16, 22, 44 or 48 KHz) is a converted in mfcc using sphinx 4;
  • audio2sphinx is no more available.

The description of the vector is described by s:e:ds:de:dds:dde where:

  • s corresponds to static values, if s equal:
    • 0 the static is not present on disk,
    • 1 the static is present,
    • 3 the static is present on disk but values are removed after the loading;
  • e corresponds to the energy, value could also be [0, 1, 2 , 3];
  • d corresponds to delta, value could also be [0, 1, 2 , 3];
  • de corresponds to delta energy, value could also be [0, 1, 2 , 3];
  • dd corresponds to delta delta, value could also be [0, 1, 2 , 3];
  • dde corresponds to delta delta energy, value could also be [0, 1, 2 , 3].

The feature deltaType could be:

  • sphinx a sphinx style delta and delta delta;
  • spro4 a spro4 style delta and delta delta;
  • htk a htk style delta and delta delta;

The normalization is controlled by 4 parameters c:r:wSize:method, where:

  • c corresponds to the cepstral mean subtraction (CMS), 0 signifies that CMS is not applied, whereas 1 signifies that CMS is applied;
  • r corresponds to the variance normalization, admit value is 0 or 1;
  • wSize is link to the normalization method, the value correspond of the number of frame in a sliding window, on which the normalization is computed (CMS, variance or warping);
  • method indicates how to apply the normalization:
    • mean and/or variance are computed on segment if the value is set to 0,
    • mean and/or variance are computed on cluster if the value is set to 1,
    • mean and/or variance are computed on sliding window if the value is set to 2,
    • feature warping if the value is set to 3,
    • feature warping followed by a CMS and/or a variance normalization on segment if the value is set to 4,
    • feature mapping followed by a CMS and/or a variance normalization on cluster if the value is set to 5,
    • feature warping followed by a CMS and/or a variance normalization on cluster if the value is set to 6.

Feature warping[1] and feature mapping [2] are classical normalization method employed in speaker verification system. Read:

output

The output feature parameters is similar to the input ones:

  • --fInputMask is --fOutputMask,
  • --fInputDesc is --fOutputDesc.


1.
a

J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verification,” in Proc. ISCA Workshop on Speaker Recognition – 2001: A Speaker Oddyssey, June 2001.
2.
a

D. Reynolds, “Channel robust speaker verification via feature mapping,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, 2003, pp. II–53–6.