Transformer Activation Functions and their Details
Here are a few observations: GPT-2, developed by OpenAI, opts for the GELU (Gaussian Error Linear Unit) activation function On the other hand, LLaMA, a creation of Facebook Research, embraces SwiGLU activation function. Meanwhile, Gemma, a PyTorch implementation by Google, adopts GeGLU activation functions. So what are these new activation functions ? How should one go about implementing them in pytorch ? In this blog post I try to understand the definitions of these activation functions and how they could be implemented in pytorch....