Train and eval you x-vector extractor
=====================================


| This tutorial is about training and running an X-vector
| extractor on the VoxCeleb dataset and evaluating using the 
| standard protocol.
|
| We assume here that you've downloaded VoxCeleb1 data
| from the official website


The different steps described in this tutorial are as follows:

    - Create a PyTorch DataSet
    - Create and train an X-vector extractor
    - Extract x-vectors
    - Train a PLDA model and evaluate on VoxCeleb1


Create DataSets for training
-----------------------------

Training of an X-vector extractor: `Xtractor` requires the creation of two objects of
type SideSet. SideSets inheritate from PyTorch DataSets.

In Sidekit we consider that training corpora are cut into two partds for training and validation
of the network.
For this reason, each corpora is associated to one single YAML that can be used to initialize two 
SideSets object (one for training, and the other one for validation).

DataSets are initialized by using a `YAML` file that must include the parameters described below.

:Note: 
	A SideSet can be used to feed the networks with acoustic features (MFCC) but can also be used to provide raw waveforms.

Miscellaneous parameters
::::::::::::::::::::::::

.. code-block:: yaml

	seed: 1234
	log_interval: 10

	dataset_description: voxceleb1_dev.csv
	data_root_directory: /lium/corpus/base/voxceleb1/dev/wav/
	data_file_extension: .wav
	sample_rate: 16000

	validation_ratio: 0.1
	batch_size: 64


:seed: 
	seed of the random initialization 

:log_interval:
	Frequency to display the training Loss in number of batches

:dataset_description:
	the name of the file containing the dataset description in CSV format

:data_root_directory:
	Path where wavefiles are stored

:data_file_extension:
	extension of the audio file

:validation_ratio:
	Percentage of the data used for validation

:batch_size:
	Size of the batches


Training options
::::::::::::::::

The "train" section of the YAML file defines the Training set options.

.. code-block:: yaml

	train:
	    duration: 4
	    chunk_per_segment: 1
	    overlap: 0.0

	    transformation:
	        pipeline: MFCC,CMVN,FrequencyMask(12-30),TemporalMask(70)

	    augmentation:
	        spec_aug: 0.5
	        temp_aug: 0.5

:duration:
	Duration of the speech chuncks given in seconds

:chunk_per_segment:
	Maximum number of chunks to select from every speech segment. 
	-1 means selecting all possible segments

:overlap:
	overlap in percentagre between two possible successive chunks of audio data

:transformation:
	Section that describes the transformations applied to the audio chuncks

:pipeline:
	string that gives the sequence of transformations to apply

:augmentation:
	Data augmentation can be applied on-the-fly, the chosen augmentation processes
	is described in this section. Some parameters refer to transformation applied
	for the case of spectral and temporal augmentation.

:spec_aug:
	Apply spectral augmentation (a band of frequency coefficient chosen randomlly is 
	masked). The given parameter is a percentage of chuncks that are modified.

:temp_augm:
	Apply temporal augmentation (a temporal band chosen randomlly is 
	masked). The given parameter is a percentage of chuncks that are modified.


Validation options
::::::::::::::::::

Section that is similar to the training. In this example we see that datra augmentatioin is not 
applied furing validation.
Duration of the chunks can be let empty to process the entire speech segments.

.. code-block:: yaml

	eval:

	    duration: 4

	    transformation:
	        pipeline: MFCC,CMVN
	        spec_aug: 0.5
	        temp_aug: 0.5

	    augmentation:
	        spec_aug: 0.0
	        temp_aug: 0.0


Instantiate a SideSet
:::::::::::::::::::::

Given a CSV file describing the corpora and a YAML file similar to the one described above,
one can instantiate a SideSet as follow:

.. code-block:: python

	training_set = SideSet(data_set_yaml=dataset_yaml, 
                           set_type="train", 
                           dataset_df=training_df, 
                           chunk_per_segment=dataset_params['chunk_per_segment'], 
                           overlap=dataset_params['overlap'])


:data_set_yaml:
	The YAML file that describes the DataSet to create

:set_type:
	train or validation to apply the chosen configuration

:dataset_df:
	Optional: a pandas.DataFrame that describe the coropora of portion of the corpora to use 
	(see the rest of the tutorial for examples)

:chunk_per_segment:
	Number of audio chunks to select for each audio segment.
	Integer.
	This value can be set to -1 to select all possible chunks

:overlap:
	allowed overlap between consecutive chunks


Download standard corpora descriptions
::::::::::::::::::::::::::::::::::::::

	- VoxCeleb1 development data
	- VoxCeleb2 development data
	- ALLIES development data


Create and train the Xtractor
-----------------------------

X-vectors in SIDEKIT are extracted using an Xtractor object.
An Extraxctor inheritates from the torch.module class and is  
a stack of 3 or 4 torch.nn.sequential.

All parts of the Xtractor are described by a section of the loaded yaml file
as follows:

Miscellaneous parameters
::::::::::::::::::::::::

.. code-block:: yaml

	feature_size: 30
	activation: LeakyReLU

:feature_size: size of the input acoustic features, note that this size is not used when the Xtractor 
	includes a preprocessor block.

:activation: all activations are the same in the current implementation, could be "LeakyReLU", "PReLU", "ReLU6" and by default "ReLU"

Process feature sequences
:::::::::::::::::::::::::

The first block of the network is made of Convolutional layers to process a sequence of features.
The artchitecture is given by a succession of layers of type "Conv1d", "activation" or "BatchNorm".
The type of layer is determine by the start of the of the layer's name.

:conv*: will be a Conv1d layer
:norm*: will be a BatchNorm layer
:activation*: will add an activation function

A weight decay can be used for regulartization in order to prevent overfitting.

.. code-block:: yaml

	segmental:

	    weight_decay: 0.0002

	    conv1:
	        output_channels: 512
	        kernel_size: 5
	        dilation: 1
	    activation1: True
	    norm1: 512

	    conv2:
	        output_channels: 512
	        kernel_size: 3
	        dilation: 2
	    activation2: True
	    norm2: 512

	    conv3:
	        output_channels: 512
	        kernel_size: 3
	        dilation: 3
	    activation3: True
	    norm3: 512

	    conv4:
	        output_channels: 512
	        kernel_size: 1
	        dilation: 1
	    activation4: True
	    norm4: 512

	    conv5:
	        output_channels: 1536
	        kernel_size: 1
	        dilation: 1
	    activation5: True
	    norm5: 1536


Processing before embedding
:::::::::::::::::::::::::::

This section defines the block that occurs after the pooling.
In this version, the only possible pooling is a mean and standard deviation, 
thus the input size of this block is twice the size of the output of 
the previous one. the output of this block is the so called x-vector.

This block can include Linear layers activation, dropout and BatchNorm.
Again, the type of layer is defined by the name of the layer:

:lin*: will be a Linear layer
:norm*: will be a BatchNorm layer
:activation*: will add an activation function
:dropout: will adda dropout layer


.. code-block:: yaml

	before_embedding:

	    weight_decay: 0.0002

	    linear6:
	        output: 512


Processing after embedding
::::::::::::::::::::::::::

This section defines the block that occurs after the extraction of the x-vector.

This block can include Linear layers activation, dropout and BatchNorm.
Again, the type of layer is defined by the name of the layer:

:lin*: will be a Linear layer
:norm*: will be a BatchNorm layer
:activation*: will add an activation function
:dropout: will adda dropout layer

.. code-block:: yaml

	after_embedding:

	    weight_decay: 0.0002

	    activation6: True
	    norm6: 512
	    dropout6: 0.05

	    linear7:
	        output: 512
	    activation7: True
	    norm7: 512

	    linear8:
	        output: speaker_number


Optional: add a preprocessor
::::::::::::::::::::::::::::

Instead of processing acoustic features, the Xtractor can be fed with raw Waveform when including a 
preprocessor block.
The current version of the Xtractor only allows a standard SincNet network.

.. code-block:: yaml

	preprocessor:
	    type: sincnet
	    waveform_normalize: True
	    sample_rate: 16000
	    min_low_hz: 50
	    min_band_hz: 50
	    out_channels: [80, 60, 60]
	    kernel_size: [251, 5, 5]
	    stride: [1, 1, 1]
	    max_pool: [3, 3, 3]
	    instance_normalize: True
	    activation: leaky_relu
	    dropout: 0.0


Instantiate an Xtractor
:::::::::::::::::::::::

Creating an Xtractor objet is very simple and only requires to define the number of output classes 'speaker_number'
and possibly a customized architecture defined by a YAML file as described above.
In case no YAML file is given during initialization, the architecture of the Xtractor is the described above.


.. code-block:: python

	# Create a default architecture with a custom number of classes
	model = Xtractor(speaker_number)
 
 	# Create an Xtractor according the architecture described in a YAML file
	model = Xtractor(speaker_number, model_yaml)

:speaker_number: is the number of classes (speakers)

:model_yaml: the name of the YAML file describing the architecture 


Training the Xtractor
---------------------

The simpler way to create, train and save an Xtractor is to call the xtrain function from sidekit.nnet.xvectort module.

.. code-block:: python

 	sidekit.nnet.xtrain(speaker_number=args.class_number,
                        dataset_yaml=args.dataset,
                        epochs=args.epochs,
                        lr=args.lr,
                        model_yaml=args.architecture,
                        tmp_model_name=args.outputname,
                        best_model_name=args.outputbestname,
                        multi_gpu=args.multi_gpu=="true",
                        clipping=False,
                        num_thread=args.num_processes
                       )


:speaker_number:
	The number of output classes (speakers)

:dataset_yaml:
	the YAML file used to describre the SideSet (training and validation)

:epochs:
	Number of epochs to run, default is 100

:lr:
	learning_rate, default is 0.01

:model_yaml:
	YAML file to describe the model architecture (can be None for default architecture)

:model_name: 
	optional. Name of a previous checkpoint file to start from

:tmp_model_name:
	name of the checkoint file used to save the model after each iteration
	Note that Xtractors, even trained with torch.nn.DataParallel will be saved in single GPU mode.

:best_model_name:
	name of the checkpoint file to save the current best model version.
	Updated each time the validation loss is lower than the best one.

:multi_gpu:
	Boolean, if False, force the use of a single GPU, default is True and makes use of torch.nn.DataParallel 

:clipping:
	Boolean. If True, gradient is clipped to 1

:num_thread:
	Number of process used by the DataLoaders, default is 1


Extract x-vectors
-----------------

Once your Xtractor has been trained you can now extract x-vectors.
The process is fully managed with sidekit.bosaris.IdMap to be complient to SIDEKIT philosophy.

An IdMap is an object containing the folloqing information:

:leftids: typically the name of the class the audio chunk belongs to
:rightids: typically the name of the audio file to load the signal from
:start: the start time ofg the audio chunk, given as frame number (or number of samples)
:stop: the end time of the audio chunk, given as frame number (or number of samples)


.. code-block:: python

	import sidekit
	import torch

	from sidekit.nnet import extract_embeddings

	
	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
	

	idmap_name = "VoxCeleb1_enrol_idmap.h5"
	
	# Process the data and return a StatServer 
	xv_stat = extract_embeddings(idmap_name=idmap_name, 
	                             speaker_number=1211,
	                             model_filename="best_model_newXV.pt", 
	                             model_yaml="archi.yaml", 
	                             data_root_name="/lium/corpus/base/voxceleb1/test/wav/" , 
	                             device=device,
	                             transform_pipeline="MFCC,CMVN")
	
	# Save the StatServer in HDF5 format
	xv_stat.write("VoxCeleb1_enrol_xvectors.h5")

IdMap in HDF5 format can be downloaded from here for the standard VoxCeleb1 evaluation.

	- training_idmap
	- enrolment_idmap
	- test_idmap


Evaluate using a PLDA model
---------------------------