Andrew Mackenzie (McGill University)
Title: Tensor Programs and µP
Abstract: We will discuss the limiting behaviour of large neural networks as the layer width goes to infinity. One of the factors that most affects limiting behaviour is the specific parametrization used; apart from training stability, this will determine whether or not the neural network can learn features. We show a technique for mechanically deriving the "best" parametrization, known as µP. As an additional empirical benefit, we demonstrate that under µP, hyperparameters transfer directly across different sizes of models, allowing for running all experiments at a small scale.