.. _iv_extraction:

Extract your I-Vectors
======================

Once trained a Universal Background Model (GMM or DNN) and a Total Variability matrix, you
are now ready to extract i-vectors.

Starting from version 1.2 of **SIDEKIT**, the extraction process is managed with a `FactorAnalyser`.

Considering that you have already created:

   - a :ref:`Mixture`, `ubm`, to use as a UBM
   - a :ref:`FeaturesServer`, `features_server`, to load acoustic features
   - a :ref:`FactorAnalyser`

and that you have computed sufficient statistics on one or multiple set of segments and that those
statistics are stored in one or multiple :ref:`StatServer`, `stat_server`.

1. Extract i-vectors in a single process
----------------------------------------

The following code wil extract i-vectors for the set of segments which statistics are in `stat_server`.

.. code-block:: python

   fa = sidekit.FactorAnalyser()

   iv, iv_uncertainty = fa.extract_ivectors_single(ubm,
                                                   stat_server,
                                                   uncertainty=True)

Where:
   - **ubm** is a ``Mixture``
   - **stat_server** is an object of type ``StatServer``
   - **uncertainty** is a boolean, if True, the method also returns a matrix where each line
     is the diagonal of the of the uncerainty matrix of the corresponding i-vector.

.. note::
   ``iv`` is a :ref:`StatServer` that contains i-vectors in `stat1` and ones in `stat0`.

2. Extract i-vectors on multiple process on a single node
---------------------------------------------------------

The following code wil extract i-vectors for the set of segments which statistics are in `stat_server`
using multiple process on a single machine.

Due to the limitations of the Multiprocessing module (related to the pickling of objects),
we advertise to keep *batchsize* of a few hundred sessions.

.. code-block:: python

   fa = sidekit.FactorAnalyser()

   iv, iv_uncertainty = fa.extract_ivectors(ubm,
                                            stat_server_filename,
                                            prefix='',
                                            batch_size=300,
                                            uncertainty=False,
                                            num_thread=1)

Where:
   - **ubm** is a ``Mixture``
   - **stat_server_filename** is the name of an HDF5 containing a ``StatServer``
   - **prefix** is the prefix of the statistic data set within the HDF5 file
   - **batch_size** number of sessions to process on each process
   - **uncertainty** is a boolean, if True, the method also returns a matrix where each line
     is the diagonal of the of the uncertainty matrix of the corresponding i-vector.
   - **num_thread**, number of process to run in parallel

3. Extract i-vectors on multiple nodes
--------------------------------------

**SIDEKIT** also provide a function to extract i-vectors on several nodes (machines)
which is especially appropriate for big size models (> 4000 distributions).

Refer to the :ref:`MPI`. page to see how to launch your computation on several nodes.

The code to execute should look like this:

.. code-block:: python

   fa = sidekit.FactorAnalyser()

   sidekit.sidekit_mpi.extract_ivector(stat_server_file_name,
                                       ubm,
                                       output_file_name,
                                       uncertainty=False,
                                       prefix='')

Where:
   - **stat_server_filename** is a filename of a  **StatServer** containing sufficient statistics that will be used to
     generate i-vectors
   - **ubm** is a ``Mixture`` object
   - **output_file_name** name of the HDF5 file where i-vectors will be stored
   - **uncertainty** is a boolean, if True, the method also returns a matrix where each line
     is the diagonal of the of the uncertainty matrix of the corresponding i-vector.
     This matrix is stored on disk in a HDF5 file.
   - **prefix** is the prefix of the sufficient statistics in the HDF5 file