.. _iv_extraction: Extract your I-Vectors ====================== Once trained a Universal Background Model (GMM or DNN) and a Total Variability matrix, you are now ready to extract i-vectors. Starting from version 1.2 of **SIDEKIT**, the extraction process is managed with a `FactorAnalyser`. Considering that you have already created: - a :ref:`Mixture`, `ubm`, to use as a UBM - a :ref:`FeaturesServer`, `features_server`, to load acoustic features - a :ref:`FactorAnalyser` and that you have computed sufficient statistics on one or multiple set of segments and that those statistics are stored in one or multiple :ref:`StatServer`, `stat_server`. 1. Extract i-vectors in a single process ---------------------------------------- The following code wil extract i-vectors for the set of segments which statistics are in `stat_server`. .. code-block:: python fa = sidekit.FactorAnalyser() iv, iv_uncertainty = fa.extract_ivectors_single(ubm, stat_server, uncertainty=True) Where: - **ubm** is a ``Mixture`` - **stat_server** is an object of type ``StatServer`` - **uncertainty** is a boolean, if True, the method also returns a matrix where each line is the diagonal of the of the uncerainty matrix of the corresponding i-vector. .. note:: ``iv`` is a :ref:`StatServer` that contains i-vectors in `stat1` and ones in `stat0`. 2. Extract i-vectors on multiple process on a single node --------------------------------------------------------- The following code wil extract i-vectors for the set of segments which statistics are in `stat_server` using multiple process on a single machine. Due to the limitations of the Multiprocessing module (related to the pickling of objects), we advertise to keep *batchsize* of a few hundred sessions. .. code-block:: python fa = sidekit.FactorAnalyser() iv, iv_uncertainty = fa.extract_ivectors(ubm, stat_server_filename, prefix='', batch_size=300, uncertainty=False, num_thread=1) Where: - **ubm** is a ``Mixture`` - **stat_server_filename** is the name of an HDF5 containing a ``StatServer`` - **prefix** is the prefix of the statistic data set within the HDF5 file - **batch_size** number of sessions to process on each process - **uncertainty** is a boolean, if True, the method also returns a matrix where each line is the diagonal of the of the uncertainty matrix of the corresponding i-vector. - **num_thread**, number of process to run in parallel 3. Extract i-vectors on multiple nodes -------------------------------------- **SIDEKIT** also provide a function to extract i-vectors on several nodes (machines) which is especially appropriate for big size models (> 4000 distributions). Refer to the :ref:`MPI`. page to see how to launch your computation on several nodes. The code to execute should look like this: .. code-block:: python fa = sidekit.FactorAnalyser() sidekit.sidekit_mpi.extract_ivector(stat_server_file_name, ubm, output_file_name, uncertainty=False, prefix='') Where: - **stat_server_filename** is a filename of a **StatServer** containing sufficient statistics that will be used to generate i-vectors - **ubm** is a ``Mixture`` object - **output_file_name** name of the HDF5 file where i-vectors will be stored - **uncertainty** is a boolean, if True, the method also returns a matrix where each line is the diagonal of the of the uncertainty matrix of the corresponding i-vector. This matrix is stored on disk in a HDF5 file. - **prefix** is the prefix of the sufficient statistics in the HDF5 file