Extract your I-Vectors

Once trained a Universal Background Model (GMM or DNN) and a Total Variability matrix, you are now ready to extract i-vectors.

Starting from version 1.2 of SIDEKIT, the extraction process is managed with a FactorAnalyser.

Considering that you have already created:

and that you have computed sufficient statistics on one or multiple set of segments and that those statistics are stored in one or multiple StatServer, stat_server.

1. Extract i-vectors in a single process

The following code wil extract i-vectors for the set of segments which statistics are in stat_server.

fa = sidekit.FactorAnalyser()

iv, iv_uncertainty = fa.extract_ivectors_single(ubm,
                                                stat_server,
                                                uncertainty=True)
Where:
  • ubm is a Mixture

  • stat_server is an object of type StatServer

  • uncertainty is a boolean, if True, the method also returns a matrix where each line is the diagonal of the of the uncerainty matrix of the corresponding i-vector.

Note

iv is a StatServer that contains i-vectors in stat1 and ones in stat0.

2. Extract i-vectors on multiple process on a single node

The following code wil extract i-vectors for the set of segments which statistics are in stat_server using multiple process on a single machine.

Due to the limitations of the Multiprocessing module (related to the pickling of objects), we advertise to keep batchsize of a few hundred sessions.

fa = sidekit.FactorAnalyser()

iv, iv_uncertainty = fa.extract_ivectors(ubm,
                                         stat_server_filename,
                                         prefix='',
                                         batch_size=300,
                                         uncertainty=False,
                                         num_thread=1)
Where:
  • ubm is a Mixture

  • stat_server_filename is the name of an HDF5 containing a StatServer

  • prefix is the prefix of the statistic data set within the HDF5 file

  • batch_size number of sessions to process on each process

  • uncertainty is a boolean, if True, the method also returns a matrix where each line is the diagonal of the of the uncertainty matrix of the corresponding i-vector.

  • num_thread, number of process to run in parallel

3. Extract i-vectors on multiple nodes

SIDEKIT also provide a function to extract i-vectors on several nodes (machines) which is especially appropriate for big size models (> 4000 distributions).

Refer to the Parallel computation in SIDEKIT. page to see how to launch your computation on several nodes.

The code to execute should look like this:

fa = sidekit.FactorAnalyser()

sidekit.sidekit_mpi.extract_ivector(stat_server_file_name,
                                    ubm,
                                    output_file_name,
                                    uncertainty=False,
                                    prefix='')
Where:
  • stat_server_filename is a filename of a StatServer containing sufficient statistics that will be used to generate i-vectors

  • ubm is a Mixture object

  • output_file_name name of the HDF5 file where i-vectors will be stored

  • uncertainty is a boolean, if True, the method also returns a matrix where each line is the diagonal of the of the uncertainty matrix of the corresponding i-vector. This matrix is stored on disk in a HDF5 file.

  • prefix is the prefix of the sufficient statistics in the HDF5 file