.. _tv_estimation:

Train an i-vector extractor
===========================

Total Variability models (TV) are trained via EM algorithm using the FactorAnalyser class from **SIDEKIT**.

TV are trained using sufficient statistics that are accumulated using a ``StatServer`` object (or a neural network).
The training also required a UBM of type Mixture.

**SIDEKIT** provides four implementations of the Total Variability EM estimation. Three are methods of the FactorAnalyser
class while the fourth one is available in the ``sidekit_mpi`` module and required the installation of the MPI
library.

   1. **total_variability_raw** that is provided for didactic purpose, the code is written using the
      plain (raw) mathematical formulas without any optimization.
   2. **total_variability_single** that provides a single process implementation of the EM algorithm.
      this version runs on a single process on a single machine but has been optimized
   3. **total_variability** is the parallelised and optimised implementation. This method makes use of the
      Multiprocessing module to parallelise computation on a single machine.

1. Get to know the algorithm with total_variability_raw
-------------------------------------------------------

We strongly **encourage** you to **READ** the code if this method to understand how the EM algorithm
works for total variability model.

We strongly **discourage** you to **USE** this method as it is absolutely not optimized.

For a usable version of the same method refer to section 3 a(or 2) below.

2. Using a single process on one machine
----------------------------------------

Training of a TV model on a single machine, single process.
Before running:
   - train a GMM-UBM of type Mixture
   - accumulate sufficient statistics using a StatServer object

You can then train the TV model by running:

.. code:: python

   fa = sidekit.FactorAnalyser()

   fa.total_variability_single(stat_server_filename,
                               ubm,
                               tv_rank,
                               nb_iter=20,
                               min_div=True,
                               tv_init=None,
                               batch_size=300,
                               save_init=False,
                               output_file_name=None)

In this example:
   - **stat_server_filename** is a list of file names for StatServer containing sufficient statistics
     of all sessions to train the TV model
   - **ubm** is the Mixture object for which the sufficient statistics have been computed
   - **tv_rank** is an integer, it is the rank of the resulting Total Variability matrix (size of the i-vectors)
   - **nb_iter** is the number of iterations to run for the EM algorithm
   - **min_div** is a boolean, if True every iteration include a Minimum divergence re-estimation step
   - **tv_init** is a matrix used to initialize the training if None, the matrix is initialized randomly
   - **batch_size** is the number of session that are processed at once to reduce memory footprint
   - **save_init** is a boolean, if True, the initial model is saved
   - **output_file_name** is the name of the file the model will be saved to

3. Using multiple process on one machine with Python MultiProcessing
--------------------------------------------------------------------

Training of a TV model on a single machine, multiple process.
Before running:
   - train a GMM-UBM of type Mixture
   - accumulate sufficient statistics using a StatServer object

You can then train the TV model by running:

.. code:: python

   fa = sidekit.FactorAnalyser()

   fa.total_variability(stat_server_filename,
                        ubm,
                        tv_rank,
                        nb_iter=20,
                        min_div=True,
                        tv_init=None,
                        batch_size=300,
                        save_init=False,
                        output_file_name=None,
                        num_thread=1)

In this example:
   - **stat_server_filename** is a list of file names for StatServer containing sufficient statistics
     of all sessions to train the TV model
   - **ubm** is the Mixture object for which the sufficient statistics have been computed
   - **tv_rank** is an integer, it is the rank of the resulting Total Variability matrix (size of the i-vectors)
   - **nb_iter** is the number of iterations to run for the EM algorithm
   - **min_div** is a boolean, if True every iteration include a Minimum divergence re-estimation step
   - **tv_init** is a matrix used to initialize the training if None, the matrix is initialized randomly
   - **batch_size** is the number of session that are processed at once to reduce memory footprint
   - **save_init** is a boolean, if True, the initial model is saved
   - **output_file_name** is the name of the file the model will be saved to
   - **num_thread** is the number of process to run on the machine

.. warning::
   The batchsize parameter might cause troubles due to the limitation of the Pickle module.
   Objects and data are exchanged between process via pickling which does not accept "too big" objects.

Note that Numpy and Scipy are linked to the low level BLAS library that might also parallelise the computation on
multiple cores. Thus don't set a number of process that is too high.

We recommend setting the number of parallel process between 5 and 10 depending on your machine.


4. Using multiple process on multiple nodes with MPI
----------------------------------------------------

See :ref:`MPI` for details about MPI installation and use.

Training of a TV model on a single machine, multiple process.
Before running:
   - train a GMM-UBM of type Mixture
   - accumulate sufficient statistics using a StatServer object

You can then train the TV model by running:

.. code:: python

   fa = sidekit.FactorAnalyser()

   fa = sidekit.sidekit_mpi.total_variability(stat_server_filename,
                                              ubm,
                                              tv_rank=10,
                                              nb_iter=10,
                                              min_div=True,
                                              tv_init=fa_init.F,
                                              save_init=False,
                                              output_file_name="tv_mpi")

In this example:
   - **stat_server_filename** is a list of file names for StatServer containing sufficient statistics
     of all sessions to train the TV model
   - **ubm** is the Mixture object for which the sufficient statistics have been computed
   - **tv_rank** is an integer, it is the rank of the resulting Total Variability matrix (size of the i-vectors)
   - **nb_iter** is the number of iterations to run for the EM algorithm
   - **min_div** is a boolean, if True every iteration include a Minimum divergence re-estimation step
   - **tv_init** is a matrix used to initialize the training if None, the matrix is initialized randomly
   - **save_init** is a boolean, if True, the initial model is saved
   - **output_file_name** is the name of the file the model will be saved to