Here are a few observations:
GPT-2, developed by OpenAI, opts for the GELU (Gaussian Error Linear Unit) activation function
On the other hand, LLaMA, a creation of Facebook Research, embraces SwiGLU activation function.
Meanwhile, Gemma, a PyTorch implementation by Google, adopts GeGLU activation functions.
So what are these new activation functions ? How should one go about implementing them in pytorch ?
In this blog post I try to understand the definitions of these activation functions and how they could be implemented in pytorch. At the end I will show the definitions from the actual implementaions in GPT-2, Gemma and LLaMA.
Imports
import torch
import torch.nn as nn
Create a new tensor with random values to use it as input for the activation functions
tensor = torch.randn(10)
print(tensor)
print(tensor.shape)
tensor([-0.8281, 1.0340, -0.4363, -0.4764, 0.6419, -0.1156, 1.4339, 1.5654,
0.7124, -0.5667])
torch.Size([10])
RELU Variants
RELU
Rectified Linear Unit
$$ f(x) = max(0, x) $$
relu = nn.ReLU()
output = relu(tensor)
print(output)
tensor([0.0000, 1.0340, 0.0000, 0.0000, 0.6419, 0.0000, 1.4339, 1.5654, 0.7124,
0.0000])
CReLU
Concatenated ReLU
$$ f(x) = [\text{ReLU}(x), \text{ReLU}(-x)] $$
Observe that the dimension of the output tensor is twice the input tensor.
crelu_output = torch.cat((relu(tensor), relu(-tensor)))
print(crelu_output)
tensor([0.0000, 1.0340, 0.0000, 0.0000, 0.6419, 0.0000, 1.4339, 1.5654, 0.7124,
0.0000, 0.8281, 0.0000, 0.4363, 0.4764, 0.0000, 0.1156, 0.0000, 0.0000,
0.0000, 0.5667])
Leaky ReLU
$$ f(x) = \begin{cases} x, & \text{if } x > 0\\ \text{negative_slope } * x, & \text{otherwise} \end{cases} $$
leaky_relu = nn.LeakyReLU( negative_slope=0.1)
output = leaky_relu(tensor)
print(output)
tensor([-0.0828, 1.0340, -0.0436, -0.0476, 0.6419, -0.0116, 1.4339, 1.5654,
0.7124, -0.0567])
ReLU6
$$ f(x) = \begin{cases} 0, & \text{if } x \leq 0\\ 6, & \text{if } x \geq 6\\ x, & \text{otherwise} \end{cases} $$
relu6 = nn.ReLU6()
output = relu6(tensor)
print(output)
tensor([0.0000, 0.8944, 0.0000, 0.6875, 0.0526, 0.0000, 0.0000, 0.0000, 1.1088,
0.0000])
Other Linear Unit Variants
GELU
Gaussian Error Linear Unit
$$ GELU(x) = x \times \Phi(x) $$
where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution (mean = 0, standard deviation = 1)
There is an approximate version of GELU that is faster to compute but at the cost of exactness
$$ GELU(x) = 0.5 \times x \times (1 + \tanh(\sqrt{2/\pi} \times (x + 0.044715 \times x^3))) $$
or
$$ GELU(x) = x \times \sigma(1.702 \times x) $$
Motivation for GELU 💡
The motivation mentioned in the paper is based on the following observations:
- ReLU determinstaically multiplies by 0 or 1
- Dropout ( A regularization Technique ) also multiplies by 0 or 1, but stochastically
- It is possible to multiple with the 0-1 mask stochastically while also depending on the input in the following way ( this is similar to Adaptive Dropout, zoneout ) the mask is given by $m \sim Bernoulli(\Phi(x))$
Since $x\sim N(0, 1)$ after the batch normaliztion anyway, this means that inputs have high probablity of getting dropped when $x$ decreases.
They say that we often want deterministic decision from the output, so they proposed GELU as the expected transformation of the stochastic regularizer.
$$ E[x*m] = Ix * \Phi(x) + 0x * (1 - \Phi(x)) = x * \Phi(x) $$
I dont fully understand the motivation myself, but I guess having a partial idea is better than having no idea. They somehow want to take idea from dropout and activation functions and combine them to get a better activation function.
gelu = nn.GELU(approximate=False) # using the accurate version of GELU
output = gelu(tensor)
print(output)
tensor([-0.1688, 0.8783, -0.1445, -0.1510, 0.4747, -0.0525, 1.3252, 1.4735,
0.5428, -0.1618])
SiLU / Swish
Sigmoid Linear Unit
$$ f(x) = x \times \sigma(x) $$
where $\sigma(x)$ is the sigmoid function.
This is very similar to GELU but $\sigma(x)$ is used instead of $\Phi(x)$
There is also a version of Swish with learneaable parameter.
$$ f(x) = x \times \sigma(\beta x) $$
where $\beta$ is a learnable parameter
silu = nn.SiLU()
output = silu(tensor)
print(output)
tensor([-0.2518, 0.7628, -0.1713, -0.1825, 0.4206, -0.0544, 1.1579, 1.2948,
0.4780, -0.2051])
HardSwish
Introduced in Searching for MobileNetV3
$$ f(x) = \frac{x \times \text{ReLU6}(x + 3)}{6} $$
It is actually implemented piecewise as follows:
$$ f(x) = \begin{cases} 0, & \text{if } x \leq -3\\ x, & \text{if } x \geq +3\\ \frac{x.(x+3)}{6}, & \text{otherwise} \end{cases} $$
GLU and variants
This section is heavily based on the paper GLU Variants improve Transfomers
GLU
Gated Linear Unit (GLU) : A neural network layer defined by component-wise product of two linear transformations of the input
$$ \text{GLU}(x, W, V, b, c) = \sigma (Wx + b) \odot (Vx + c) $$
They also suggest omitting the activation, whih they call a bilnear layer
$$ \text{Bilinear}(x, W, V, b, c) = (Wx + b) \odot (Vx + c) $$
Note: The bias
term is often omitted
GLU Variants
Any activation function could be used in place of $\sigma$ in the GLU equation, giving rise to a family of GLU variants.
$$ \text{ReGLU}(x, W, V, b, c) = \text{ReLU}(Wx + b) \odot (Vx + c) $$ $$ \text{SiGLU}(x, W, V, b, c) = \text{SiLU(Wx + b)} \odot (Vx + c)\\ $$ $$ \text{GeGLU}(x, W, V, b, c) = \text{GELU}(Wx + b) \odot (Vx + c)\\ $$
Example Practical Implementation
# ReGLU could be implmeneted like this in pytorch
class ReGLU(nn.Module):
def __init__(self):
super(ReGLU, self).__init__()
input_dim = 10
hidden_dim = 20
self.W = nn.Linear(input_dim, hidden_dim)
self.V = nn.Linear(input_dim, hidden_dim)
self.relu = nn.ReLU()
def forward(self, x):
return self.W(x) * self.relu(self.V(x))
# Create a tensor with 10 random numbers
tensor = torch.randn(10)
# Create an instance of the ReGLU class
reglu = ReGLU()
# Apply the ReGLU function to the tensor
output = reglu(tensor)
print(output)
tensor([ 0.0000, 0.0074, 0.0000, -0.0000, -0.0753, -0.0598, -0.0000, -0.0000,
-0.0000, 0.1110, -0.0109, 0.2933, -0.0185, 0.0000, -0.0016, 0.0250,
0.0000, 0.3512, 0.0000, 0.0000], grad_fn=<MulBackward0>)
Implementations from Projects
Here are a few practical implementations from LLM models
GPT-2
GPT-2 uses an approximate version of GELU.
LLaMA
The python code
F.silu(self.w1(x)) * self.w3(x)
is the SwiGLU implementation, the whole function is for the FFN ( MLP layer ) in the transformers.
Gemma
The GEGLU is a little more hidden in this function, the screenshot shows the whole FFN ( MLP layer ) function, but if you carefully observe you can make out the GEGLU implementation. (Hint look at the fuse
variable)