Automatic Speaker Verification is a challenging task for several reasons. Firstly, we need to learn speaker representations from recordings of arbitrary duration. Secondly, we need to be able to generalize to new speakers, languages and recording conditions. Over the last few years Deep Neural Networks have established themselves as the state-of-the-art technique for this task. However, these methods require vast amounts of training data and are heavily dependent on additional classifiers for performing speaker verification. Our goal in this research is to develop techniques that are able to overcome the core challenges of speaker verification, while also addressing the limitations of the current generation of neural network based solutions.
In order to achieve this goal we make several contributions to neural network based speaker verification. Firstly, we propose network architectures that are computationally efficient, and capable of processing recordings of arbitrary size. Secondly, we show that robust verification performance can be achieved using simple non-parametric classifiers like cosine distance. This not only simplifies the verification process, but leads to representations that inherently capture more speaker discriminative information. Thirdly, we show that we can transfer knowledge between different verification tasks, leading to robust performance even on small datasets. Finally, we propose a novel approach that allows us to adapt speaker representations to new languages and recording conditions.