Data

Data

The 3 main data in LIUM_SpkDiarization are:

Diarization

The diarization file is the most important file in the toolkit. All programs are driven by a segmentation file and most of them generate a segmentation file (trainer generate gmm).

Segment format

The format for diarization files is close to the MDTM or STM NIST format. Each line corresponds to a segment.
Example:

9981217_0700_0800_inter_fm_dga 1 1 317 F S U spk0
  • field 1: 19981217_0700_0800_inter_fm_dga = the show name
  • field 2: 1 the channel number
  • field 3: 1 the start of the segment (in features)
  • field 4: 317 the length of the segment (in features)
  • field 5: F the speaker gender (U=unknown, F=female, M=Male)
  • field 6: S the type of band (T=telephone, S=studio)
  • field 7: U the type of environment (music, speech only, …)
  • field 8: spk0 the speaker label

Diarization file

The next example show the diarization of one show 20071218_1900_1920_inter composed of:

  • 7 clusters (ie. speaker) named: S0, S1 S143, S3, S28, S12, S11;
  • 14 segments.
20071218_1900_1920_inter.seg
20071218_1900_1920_inter 1 0    322   M S U S0
20071218_1900_1920_inter 1 322  680   F S U S1
20071218_1900_1920_inter 1 1148 371   M S U S143
20071218_1900_1920_inter 1 1772 310   F S U S3
20071218_1900_1920_inter 1 2082 318   F S U S28
20071218_1900_1920_inter 1 2495 1570  M S U S12
20071218_1900_1920_inter 1 4065 628   M S U S12
20071218_1900_1920_inter 1 4693 841   M S U S11
20071218_1900_1920_inter 1 5610 1394  M S U S12
20071218_1900_1920_inter 1 7004 644   M S U S11
20071218_1900_1920_inter 1 7706 1342  M S U S12
20071218_1900_1920_inter 1 9048 1385  M S U S12
20071218_1900_1920_inter 1 10433 1060 M S U S12
20071219_1900_1920_inter 1 11493 1651 M S U S12

Access to several shows

A diarization file could draw data from several shows. It is very useful in a batch mode context (training of GMM, computing log likelihood ratio, cross-show diarization, etc.) or to perform a diarization over a collection of audio files (never really tested, but it should work:)). The sample shows:

  • 3 shows named L42, R45 and R50;
  • 2 clusters (ie. female and male telephone background models) named FT and MT;
  • 12 segments.
multi_show.seg
L42 1 1267 102 M T U MT
L42 1 1375 146 M T U MT
L42 1 1597 237 F T U FT
R45 1 2639 187 F T U FT
R45 1 2834 119 M T U MT
R45 1 3152 347 M T U MT
R45 1 19082 182 F T U FT
R45 1 19961 143 M T U MT
R45 1 20170 103 M T U MT
R50 1 2563 210 F T U FT
R50 1 3179 268 M T U MT
R50 1 5003 298 M T U MT

Other file format

seg.xml

seg.xml was developed during ANR EPAC projet. This is an xml format with:

  • the diarization information given by <segment> … <segment/>, <speaker />;
  • the row list of word given by <text> … </text>;
  • graphs of word (one best is also encoded in a graph): <graph> … </graph>;
  • named entities: <entity> … </entity>.
20040108_0655_0915_CULTURE_ELDA.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<epac>
	<tools>
		<tool type="speaker diarization" name="REF" version="" date="Wed Jan 27 09:21:33 2010"/>
		<tool type="word transcription" name="REF" version="" date="Wed Jan 27 09:21:33 2010"/>
		<tool type="ne detector" name="LINA nemesis EN detector toolkit" version="X" date="Wed Jan 27 09:21:33 2010"/>
	</tools>
	<audiofile name="20040108_0655_0915_CULTURE_ELDA" >
		<speakers>
			<speaker name="Thierry_Watlé" identity="" type="generic label" gender="M" generator="auto"/>
			<speaker name="Éric_Meyer" identity="" type="generic label" gender="M" generator="auto"/>
			<speaker name="Gérard_Prunier" identity="" type="generic label" gender="M" generator="auto"/>
			<speaker name="Alain_Gérard_Slama" identity="" type="generic label" gender="M" generator="auto"/>
			<speaker name="Stéphane_Braunschweig" identity="" type="generic label" gender="M" generator="auto"/>
			<speaker name="Nicolas_Demorand" identity="" type="generic label" gender="M" generator="auto"/>
		</speakers>
		<segments>
			<segment start="26.070000" end="30.116000" bandwidth="S" speaker="Nicolas_Demorand" generator="auto">
				<text generator="auto">Nicolas Demorand bonjour ŕ tous bienvenue sur France Culture il est six heures cinquante huit </text>
				<graph id="0" type="1-best" generator="auto">
					<link id="0" start="0" end="1" type="wtoken" probability="0">Nicolas</link>
					<link id="1" start="1" end="2" type="wtoken" probability="0">Demorand</link>
					<link id="2" start="2" end="3" type="wtoken" probability="0">bonjour</link>
					<link id="3" start="3" end="4" type="wtoken" probability="0">ŕ</link>
					<link id="4" start="4" end="5" type="wtoken" probability="0">tous</link>
					<link id="5" start="5" end="6" type="wtoken" probability="0">bienvenue</link>
					<link id="6" start="6" end="7" type="wtoken" probability="0">sur</link>
					<link id="7" start="7" end="8" type="wtoken" probability="0">France</link>
					<link id="8" start="8" end="9" type="wtoken" probability="0">Culture</link>
					<link id="9" start="9" end="10" type="wtoken" probability="0">il</link>
					<link id="10" start="10" end="11" type="wtoken" probability="0">est</link>
					<link id="11" start="11" end="12" type="wtoken" probability="0">six</link>
					<link id="12" start="12" end="13" type="wtoken" probability="0">heures</link>
					<link id="13" start="13" end="14" type="wtoken" probability="0">cinquante</link>
					<link id="14" start="14" end="15" type="wtoken" probability="0">huit</link>
				</graph>
				<entities generator="auto">
					<entity type="pers" >
						<path graph="0" link="0" />
						<path graph="0" link="1" />
					</entity>
					<entity type="org" >
						<path graph="0" link="7" />
						<path graph="0" link="8" />
					</entity>
					<entity type="time" >
						<path graph="0" link="11" />
						<path graph="0" link="12" />
						<path graph="0" link="13" />
						<path graph="0" link="14" />
					</entity>
				</entities>
			</segment>
			<segment start="37.374000" end="48.894000" bandwidth="S" speaker="Nicolas_Demorand" generator="auto">
				<text generator="auto">le texte de loi de la loi sur la laďcité est actuellement scruté ŕ la loupe par le conseil d' État il sera composé de trois articles trčs courts rédigés par Luc Ferry </text>
				<graph id="0" type="1-best" generator="auto">
					<link id="0" start="0" end="1" type="wtoken" probability="0">le</link>
					<link id="1" start="1" end="2" type="wtoken" probability="0">texte</link>
					<link id="2" start="2" end="3" type="wtoken" probability="0">de</link>
					<link id="3" start="3" end="4" type="wtoken" probability="0">loi</link>
					<link id="4" start="4" end="5" type="wtoken" probability="0">de</link>
					<link id="5" start="5" end="6" type="wtoken" probability="0">la</link>
					<link id="6" start="6" end="7" type="wtoken" probability="0">loi</link>
					<link id="7" start="7" end="8" type="wtoken" probability="0">sur</link>
					<link id="8" start="8" end="9" type="wtoken" probability="0">la</link>
					<link id="9" start="9" end="10" type="wtoken" probability="0">laďcité</link>
					<link id="10" start="10" end="11" type="wtoken" probability="0">est</link>
					<link id="11" start="11" end="12" type="wtoken" probability="0">actuellement</link>
					<link id="12" start="12" end="13" type="wtoken" probability="0">scruté</link>
					<link id="13" start="13" end="14" type="wtoken" probability="0">ŕ</link>
					<link id="14" start="14" end="15" type="wtoken" probability="0">la</link>
					<link id="15" start="15" end="16" type="wtoken" probability="0">loupe</link>
					<link id="16" start="16" end="17" type="wtoken" probability="0">par</link>
					<link id="17" start="17" end="18" type="wtoken" probability="0">le</link>
					<link id="18" start="18" end="19" type="wtoken" probability="0">conseil</link>
					<link id="19" start="19" end="20" type="wtoken" probability="0">d'</link>
					<link id="20" start="20" end="21" type="wtoken" probability="0">État</link>
					<link id="21" start="21" end="22" type="wtoken" probability="0">il</link>
					<link id="22" start="22" end="23" type="wtoken" probability="0">sera</link>
					<link id="23" start="23" end="24" type="wtoken" probability="0">composé</link>
					<link id="24" start="24" end="25" type="wtoken" probability="0">de</link>
					<link id="25" start="25" end="26" type="wtoken" probability="0">trois</link>
					<link id="26" start="26" end="27" type="wtoken" probability="0">articles</link>
					<link id="27" start="27" end="28" type="wtoken" probability="0">trčs</link>
					<link id="28" start="28" end="29" type="wtoken" probability="0">courts</link>
					<link id="29" start="29" end="30" type="wtoken" probability="0">rédigés</link>
					<link id="30" start="30" end="31" type="wtoken" probability="0">par</link>
					<link id="31" start="31" end="32" type="wtoken" probability="0">Luc</link>
					<link id="32" start="32" end="33" type="wtoken" probability="0">Ferry</link>
				</graph>
				<entities generator="auto">
					<entity type="org" >
						<path graph="0" link="18" />
						<path graph="0" link="19" />
						<path graph="0" link="20" />
					</entity>
					<entity type="pers" >
						<path graph="0" link="31" />
						<path graph="0" link="32" />
					</entity>
				</entities>
			</segment>
		</segments>
	</audiofile>
</epac>

CTL

This is a modified version of Sphinx CTL file. The speaker name, the last field, is composed of 6 sub-filds separated by -. By example 20041006_0800_0900_CULTURE-287.50-348.38-unk-M-20041006_0800_0900_CULTURE_Jean-François_Aki is composed of:

  • the name of the show: 20041006_0800_0900_CULTURE,
  • the start of the segment in second: 287.50,
  • the end of the segment in second: 348.38,
  • the bandwidth (available values are S, T and unk standing for studio, telephone and unknown): unk,
  • the gender of the sepaker (M, F or unk): M,
  • the name of the speaker prefixed by the name of the show: 20041006_0800_0900_CULTURE_Jean-François_Aki.
20041006_0800_0900_CULTURE.ctl
20041006_0800_0900_CULTURE 28750 34838 20041006_0800_0900_CULTURE-287.50-348.38-unk-M-20041006_0800_0900_CULTURE_Jean-François_Aki
20041006_0800_0900_CULTURE 92996 121972 20041006_0800_0900_CULTURE-929.96-1219.72-unk-M-Alexandre_Adler
20041006_0800_0900_CULTURE 48446 49552 20041006_0800_0900_CULTURE-484.46-495.52-unk-M-Axel_Urgin
20041006_0800_0900_CULTURE 82258 88267 20041006_0800_0900_CULTURE-822.58-882.67-unk-M-Christophe_Champin
20041006_0800_0900_CULTURE 0 4587 20041006_0800_0900_CULTURE-0.00-45.87-unk-M-Hervé_Gardette
20041006_0800_0900_CULTURE 4968 10053 20041006_0800_0900_CULTURE-49.68-100.53-unk-M-Hervé_Gardette
20041006_0800_0900_CULTURE 17011 23370 20041006_0800_0900_CULTURE-170.11-233.70-unk-M-Hervé_Gardette
20041006_0800_0900_CULTURE 27535 28749 20041006_0800_0900_CULTURE-275.35-287.49-unk-M-Hervé_Gardette
20041006_0800_0900_CULTURE 34838 38998 20041006_0800_0900_CULTURE-348.38-389.98-unk-M-Hervé_Gardette
20041006_0800_0900_CULTURE 50668 52668 20041006_0800_0900_CULTURE-506.68-526.68-unk-M-Hervé_Gardette
20041006_0800_0900_CULTURE 59458 61655 20041006_0800_0900_CULTURE-594.58-616.55-unk-M-Hervé_Gardette
20041006_0800_0900_CULTURE 62021 67263 20041006_0800_0900_CULTURE-620.21-672.63-unk-M-Hervé_Gardette
20041006_0800_0900_CULTURE 75625 82258 20041006_0800_0900_CULTURE-756.25-822.58-unk-M-Hervé_Gardette
20041006_0800_0900_CULTURE 88267 90992 20041006_0800_0900_CULTURE-882.67-909.92-unk-M-Hervé_Gardette
20041006_0800_0900_CULTURE 91467 91610 20041006_0800_0900_CULTURE-914.67-916.10-unk-M-Hervé_Gardette
20041006_0800_0900_CULTURE 23370 27535 20041006_0800_0900_CULTURE-233.70-275.35-unk-M-Jacques_Chirac
20041006_0800_0900_CULTURE 129372 134673 20041006_0800_0900_CULTURE-1293.72-1346.73-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 134673 134865 20041006_0800_0900_CULTURE-1346.73-1348.65-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 134865 142125 20041006_0800_0900_CULTURE-1348.65-1421.25-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 142126 142278 20041006_0800_0900_CULTURE-1421.26-1422.78-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 142279 143034 20041006_0800_0900_CULTURE-1422.79-1430.34-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 143034 143239 20041006_0800_0900_CULTURE-1430.34-1432.39-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 143240 146164 20041006_0800_0900_CULTURE-1432.40-1461.64-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 146164 146271 20041006_0800_0900_CULTURE-1461.64-1462.71-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 146455 156508 20041006_0800_0900_CULTURE-1464.55-1565.08-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 157974 158120 20041006_0800_0900_CULTURE-1579.74-1581.20-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 158590 162848 20041006_0800_0900_CULTURE-1585.90-1628.48-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 163028 163233 20041006_0800_0900_CULTURE-1630.28-1632.33-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 163328 170138 20041006_0800_0900_CULTURE-1633.28-1701.38-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 170139 170336 20041006_0800_0900_CULTURE-1701.39-1703.36-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 170336 173282 20041006_0800_0900_CULTURE-1703.36-1732.82-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 173402 173478 20041006_0800_0900_CULTURE-1734.02-1734.78-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 173479 173531 20041006_0800_0900_CULTURE-1734.79-1735.31-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 173611 183083 20041006_0800_0900_CULTURE-1736.11-1830.83-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 209020 214362 20041006_0800_0900_CULTURE-2090.20-2143.62-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 214362 214477 20041006_0800_0900_CULTURE-2143.62-2144.77-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 214478 219061 20041006_0800_0900_CULTURE-2144.78-2190.61-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 219061 219235 20041006_0800_0900_CULTURE-2190.61-2192.35-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 219236 228690 20041006_0800_0900_CULTURE-2192.36-2286.90-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 228690 228965 20041006_0800_0900_CULTURE-2286.90-2289.65-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 228965 240004 20041006_0800_0900_CULTURE-2289.65-2400.04-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 243330 243469 20041006_0800_0900_CULTURE-2433.30-2434.69-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 243469 243738 20041006_0800_0900_CULTURE-2434.69-2437.38-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 243739 255448 20041006_0800_0900_CULTURE-2437.39-2554.48-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 255619 255965 20041006_0800_0900_CULTURE-2556.19-2559.65-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 255965 260911 20041006_0800_0900_CULTURE-2559.65-2609.11-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 260997 269188 20041006_0800_0900_CULTURE-2609.97-2691.88-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 272280 282777 20041006_0800_0900_CULTURE-2722.80-2827.77-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 282778 283009 20041006_0800_0900_CULTURE-2827.78-2830.09-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 283009 288893 20041006_0800_0900_CULTURE-2830.09-2888.93-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 289018 294015 20041006_0800_0900_CULTURE-2890.18-2940.15-unk-M-Jacques_Derrida
20041006_0800_0900_CULTURE 298162 316915 20041006_0800_0900_CULTURE-2981.62-3169.15-unk-M-Jacques_Derrida


Acoustic features

Acoustic features could be computed on the fly from the audio recording using Sphinx 4 classes or read from a file containing vectors of parameters. The supported file formats are Sphinx format, SPro4 format, HTK format, and gzipped-text in which each line corresponds to a vector.

Transformations can be applied before using the acoustic features. The first and second order derivatives as well as energy and its derivatives can be computed or deleted after reading the raw features. Feature distribution can be centered and/or reduced, given mean and variance vectors computed over a segment, over a cluster, over all the features of the recording, or over a sliding window. Indeed, a FeatureSet gets segments or clusters information from the ClusterSet given as a parameter when the FeatureSet instance is created. Feature warping is only suitable for sliding windows and can be combined with the previous normalization, before or after.

Commun Parameters gives information on the feature parameters.