Transformer Activation Functions and their Details

Here are a few observations:

GPT-2, developed by OpenAI, opts for the GELU (Gaussian Error Linear Unit) activation function

On the other hand, LLaMA, a creation of Facebook Research, embraces SwiGLU activation function.

Meanwhile, Gemma, a PyTorch implementation by Google, adopts GeGLU activation functions.

So what are these new activation functions ? How should one go about implementing them in pytorch ?

In this blog post I try to understand the definitions of these activation functions and how they could be implemented in pytorch. At the end I will show the definitions from the actual implementaions in GPT-2, Gemma and LLaMA.

Imports

import torch
import torch.nn as nn

Create a new tensor with random values to use it as input for the activation functions

tensor = torch.randn(10)
print(tensor)
print(tensor.shape)

tensor([-0.8281,  1.0340, -0.4363, -0.4764,  0.6419, -0.1156,  1.4339,  1.5654,
         0.7124, -0.5667])
torch.Size([10])

RELU Variants

RELU

Rectified Linear Unit

$$ f(x) = max(0, x) $$

relu = nn.ReLU()
output = relu(tensor)
print(output)

tensor([0.0000, 1.0340, 0.0000, 0.0000, 0.6419, 0.0000, 1.4339, 1.5654, 0.7124,
        0.0000])

CReLU

Concatenated ReLU

$$ f(x) = [\text{ReLU}(x), \text{ReLU}(-x)] $$

Observe that the dimension of the output tensor is twice the input tensor.

crelu_output = torch.cat((relu(tensor), relu(-tensor)))
print(crelu_output)

tensor([0.0000, 1.0340, 0.0000, 0.0000, 0.6419, 0.0000, 1.4339, 1.5654, 0.7124,
        0.0000, 0.8281, 0.0000, 0.4363, 0.4764, 0.0000, 0.1156, 0.0000, 0.0000,
        0.0000, 0.5667])

Leaky ReLU

$$ f(x) = \begin{cases} x, & \text{if } x > 0\\ \text{negative_slope } * x, & \text{otherwise} \end{cases} $$

leaky_relu = nn.LeakyReLU( negative_slope=0.1)
output = leaky_relu(tensor)
print(output)

tensor([-0.0828,  1.0340, -0.0436, -0.0476,  0.6419, -0.0116,  1.4339,  1.5654,
         0.7124, -0.0567])

ReLU6

$$ f(x) = \begin{cases} 0, & \text{if } x \leq 0\\ 6, & \text{if } x \geq 6\\ x, & \text{otherwise} \end{cases} $$

relu6 = nn.ReLU6()
output = relu6(tensor)
print(output)

tensor([0.0000, 0.8944, 0.0000, 0.6875, 0.0526, 0.0000, 0.0000, 0.0000, 1.1088,
    0.0000])

(Left) Leaky ReLU, (Middle) ReLU, (Right)ReLU6; taken from pytorch documentation

Other Linear Unit Variants

GELU

Gaussian Error Linear Unit

$$ GELU(x) = x \times \Phi(x) $$

where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution (mean = 0, standard deviation = 1)

There is an approximate version of GELU that is faster to compute but at the cost of exactness

$$ GELU(x) = 0.5 \times x \times (1 + \tanh(\sqrt{2/\pi} \times (x + 0.044715 \times x^3))) $$

$$ GELU(x) = x \times \sigma(1.702 \times x) $$

Motivation for GELU 💡

The motivation mentioned in the paper is based on the following observations:

ReLU determinstaically multiplies by 0 or 1
Dropout ( A regularization Technique ) also multiplies by 0 or 1, but stochastically
It is possible to multiple with the 0-1 mask stochastically while also depending on the input in the following way ( this is similar to Adaptive Dropout, zoneout ) the mask is given by $m \sim Bernoulli(\Phi(x))$

Since $x\sim N(0, 1)$ after the batch normaliztion anyway, this means that inputs have high probablity of getting dropped when $x$ decreases.

They say that we often want deterministic decision from the output, so they proposed GELU as the expected transformation of the stochastic regularizer.

$$ E[x*m] = Ix * \Phi(x) + 0x * (1 - \Phi(x)) = x * \Phi(x) $$

I dont fully understand the motivation myself, but I guess having a partial idea is better than having no idea. They somehow want to take idea from dropout and activation functions and combine them to get a better activation function.

gelu = nn.GELU(approximate=False) # using the accurate version of GELU
output = gelu(tensor)
print(output)

tensor([-0.1688,  0.8783, -0.1445, -0.1510,  0.4747, -0.0525,  1.3252,  1.4735,
         0.5428, -0.1618])

SiLU / Swish

Sigmoid Linear Unit

$$ f(x) = x \times \sigma(x) $$

where $\sigma(x)$ is the sigmoid function.

This is very similar to GELU but $\sigma(x)$ is used instead of $\Phi(x)$

There is also a version of Swish with learneaable parameter.

$$ f(x) = x \times \sigma(\beta x) $$

where $\beta$ is a learnable parameter

silu = nn.SiLU()
output = silu(tensor)
print(output)

tensor([-0.2518,  0.7628, -0.1713, -0.1825,  0.4206, -0.0544,  1.1579,  1.2948,
         0.4780, -0.2051])

HardSwish

Introduced in Searching for MobileNetV3

$$ f(x) = \frac{x \times \text{ReLU6}(x + 3)}{6} $$

It is actually implemented piecewise as follows:

$$ f(x) = \begin{cases} 0, & \text{if } x \leq -3\\ x, & \text{if } x \geq +3\\ \frac{x.(x+3)}{6}, & \text{otherwise} \end{cases} $$

(Left)GELU, SiLU graphs from here ; (Right)Hard Swish from pytorch Documentation

GLU and variants

This section is heavily based on the paper GLU Variants improve Transfomers

GLU

Gated Linear Unit (GLU) : A neural network layer defined by component-wise product of two linear transformations of the input

$$ \text{GLU}(x, W, V, b, c) = \sigma (Wx + b) \odot (Vx + c) $$

They also suggest omitting the activation, whih they call a bilnear layer

$$ \text{Bilinear}(x, W, V, b, c) = (Wx + b) \odot (Vx + c) $$

Note: The bias term is often omitted

GLU Variants

Any activation function could be used in place of $\sigma$ in the GLU equation, giving rise to a family of GLU variants.

$$ \text{ReGLU}(x, W, V, b, c) = \text{ReLU}(Wx + b) \odot (Vx + c) $$ $$ \text{SiGLU}(x, W, V, b, c) = \text{SiLU(Wx + b)} \odot (Vx + c)\\ $$ $$ \text{GeGLU}(x, W, V, b, c) = \text{GELU}(Wx + b) \odot (Vx + c)\\ $$

Example Practical Implementation

# ReGLU could be implmeneted like this in pytorch

class ReGLU(nn.Module):
    def __init__(self):
        super(ReGLU, self).__init__()
        input_dim = 10
        hidden_dim = 20
        self.W = nn.Linear(input_dim, hidden_dim)
        self.V = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.W(x) * self.relu(self.V(x))

# Create a tensor with 10 random numbers
tensor = torch.randn(10)

# Create an instance of the ReGLU class
reglu = ReGLU()

# Apply the ReGLU function to the tensor
output = reglu(tensor)
print(output)

tensor([ 0.0000,  0.0074,  0.0000, -0.0000, -0.0753, -0.0598, -0.0000, -0.0000,
        -0.0000,  0.1110, -0.0109,  0.2933, -0.0185,  0.0000, -0.0016,  0.0250,
         0.0000,  0.3512,  0.0000,  0.0000], grad_fn=<MulBackward0>)

Implementations from Projects

Here are a few practical implementations from LLM models

GPT-2

GPT-2 uses an approximate version of GELU.

LLaMA

The python code

F.silu(self.w1(x)) * self.w3(x)

is the SwiGLU implementation, the whole function is for the FFN ( MLP layer ) in the transformers.

Gemma

The GEGLU is a little more hidden in this function, the screenshot shows the whole FFN ( MLP layer ) function, but if you carefully observe you can make out the GEGLU implementation. (Hint look at the fuse variable)

Imports#

RELU Variants#

RELU#

CReLU#

Leaky ReLU#

ReLU6#

Other Linear Unit Variants#

GELU#

SiLU / Swish#

HardSwish#

GLU and variants#

GLU#

GLU Variants#

Example Practical Implementation#

Implementations from Projects#

GPT-2#

LLaMA#

Gemma#