tech evergreen

Transformer Activation Functions and their Details

Revisiting Activation Functions in Pytorch and their details


3 min read

Here are a few observations:

GPT-2, developed by OpenAI, opts for the GELU (Gaussian Error Linear Unit) activation function

On the other hand, LLaMA, a creation of Facebook Research, embraces SwiGLU activation function.

Meanwhile, Gemma, a PyTorch implementation by Google, adopts GeGLU activation functions.

So what are these new activation functions ? How should one go about implementing them in pytorch ?

In this blog post I try to understand the definitions of these activation functions and how they could be implemented in pytorch. At the end I will show the definitions from the actual implementaions in GPT-2, Gemma and LLaMA.

Imports

import torch
import torch.nn as nn

Create a new tensor with random values to use it as input for the activation functions

tensor = torch.randn(10)
print(tensor)
print(tensor.shape)

tensor([-0.8281, 1.0340, -0.4363, -0.4764, 0.6419, -0.1156, 1.4339, 1.5654, 0.7124, -0.5667]) torch.Size([10])

RELU Variants

RELU

Rectified Linear Unit

f(x)=max(0,x)f(x) = max(0, x)
relu = nn.ReLU()
output = relu(tensor)
print(output)

tensor([0.0000, 1.0340, 0.0000, 0.0000, 0.6419, 0.0000, 1.4339, 1.5654, 0.7124, 0.0000])

CReLU

Concatenated ReLU

f(x)=[ReLU(x),ReLU(x)]f(x) = [\text{ReLU}(x), \text{ReLU}(-x)]

Observe that the dimension of the output tensor is twice the input tensor.

crelu_output = torch.cat((relu(tensor), relu(-tensor)))
print(crelu_output)

tensor([0.0000, 1.0340, 0.0000, 0.0000, 0.6419, 0.0000, 1.4339, 1.5654, 0.7124, 0.0000, 0.8281, 0.0000, 0.4363, 0.4764, 0.0000, 0.1156, 0.0000, 0.0000, 0.0000, 0.5667])

Leaky ReLU

f(x)={textx,if x>0 negative_slope x,otherwisef(x) = \begin{cases} ```text x, & \text{if } x > 0\\\ \text{negative\_slope } * x, & \text{otherwise} ``` \end{cases}
leaky_relu = nn.LeakyReLU( negative_slope=0.1)
output = leaky_relu(tensor)
print(output)

tensor([-0.0828, 1.0340, -0.0436, -0.0476, 0.6419, -0.0116, 1.4339, 1.5654, 0.7124, -0.0567])

ReLU6

f(x)={text0,if x0 6,if x6 x,otherwisef(x) = \begin{cases} ```text 0, & \text{if } x \leq 0\\\ 6, & \text{if } x \geq 6\\\ x, & \text{otherwise} ``` \end{cases}
relu6 = nn.ReLU6()
output = relu6(tensor)
print(output)

tensor([0.0000, 0.8944, 0.0000, 0.6875, 0.0526, 0.0000, 0.0000, 0.0000, 1.1088, 0.0000])

(Left) Leaky ReLU, (Middle) ReLU, (Right)ReLU6; taken from pytorch documentation

Other Linear Unit Variants

GELU

Gaussian Error Linear Unit

GELU(x)=x×Φ(x)GELU(x) = x \times \Phi(x)

where Φ(x)\Phi(x) is the cumulative distribution function of the standard normal distribution (mean = 0, standard deviation = 1)

There is an approximate version of GELU that is faster to compute but at the cost of exactness

GELU(x)=0.5×x×(1+tanh(2/π×(x+0.044715×x3)))GELU(x) = 0.5 \times x \times (1 + \tanh(\sqrt{2/\pi} \times (x + 0.044715 \times x^3)))

or

GELU(x)=x×σ(1.702×x)GELU(x) = x \times \sigma(1.702 \times x)
Motivation for GELU 💡

The motivation mentioned in the paper is based on the following observations:

  1. ReLU determinstaically multiplies by 0 or 1
  2. Dropout ( A regularization Technique ) also multiplies by 0 or 1, but stochastically
  3. It is possible to multiple with the 0-1 mask stochastically while also depending on the input in the following way ( this is similar to Adaptive Dropout, zoneout ) the mask is given by mBernoulli(Φ(x))m \sim Bernoulli(\Phi(x))

Since xN(0,1)x\sim N(0, 1) after the batch normaliztion anyway, this means that inputs have high probablity of getting dropped when xx decreases.

They say that we often want deterministic decision from the output, so they proposed GELU as the expected transformation of the stochastic regularizer.

E[xm]=IxΦ(x)+0x(1Φ(x))=xΦ(x)E[x*m] = Ix * \Phi(x) + 0x * (1 - \Phi(x)) = x * \Phi(x)

I dont fully understand the motivation myself, but I guess having a partial idea is better than having no idea. They somehow want to take idea from dropout and activation functions and combine them to get a better activation function.

gelu = nn.GELU(approximate=False) # using the accurate version of GELU
output = gelu(tensor)
print(output)

tensor([-0.1688, 0.8783, -0.1445, -0.1510, 0.4747, -0.0525, 1.3252, 1.4735, 0.5428, -0.1618])

SiLU / Swish

Sigmoid Linear Unit

f(x)=x×σ(x)f(x) = x \times \sigma(x)

where σ(x)\sigma(x) is the sigmoid function.

This is very similar to GELU but σ(x)\sigma(x) is used instead of Φ(x)\Phi(x)

There is also a version of Swish with learneaable parameter.

f(x)=x×σ(βx)f(x) = x \times \sigma(\beta x)

where β\beta is a learnable parameter

silu = nn.SiLU()
output = silu(tensor)
print(output)

tensor([-0.2518, 0.7628, -0.1713, -0.1825, 0.4206, -0.0544, 1.1579, 1.2948, 0.4780, -0.2051])

HardSwish

Introduced in Searching for MobileNetV3

f(x)=x×ReLU6(x+3)6f(x) = \frac{x \times \text{ReLU6}(x + 3)}{6}

It is actually implemented piecewise as follows:

f(x)={text0,if x3 x,if x+3 x.(x+3)6,otherwisef(x) = \begin{cases} ```text 0, & \text{if } x \leq -3\\\ x, & \text{if } x \geq +3\\\ \frac{x.(x+3)}{6}, & \text{otherwise} ``` \end{cases}

(Left)GELU, SiLU graphs from here ; (Right)Hard Swish from pytorch Documentation

GLU and variants

This section is heavily based on the paper GLU Variants improve Transfomers

GLU

Gated Linear Unit (GLU) : A neural network layer defined by component-wise product of two linear transformations of the input

GLU(x,W,V,b,c)=σ(Wx+b)(Vx+c)\text{GLU}(x, W, V, b, c) = \sigma (Wx + b) \odot (Vx + c)

They also suggest omitting the activation, whih they call a bilnear layer

Bilinear(x,W,V,b,c)=(Wx+b)(Vx+c)\text{Bilinear}(x, W, V, b, c) = (Wx + b) \odot (Vx + c)

Note: The bias term is often omitted

GLU Variants

Any activation function could be used in place of σ\sigma in the GLU equation, giving rise to a family of GLU variants.

ReGLU(x,W,V,b,c)=ReLU(Wx+b)(Vx+c)\text{ReGLU}(x, W, V, b, c) = \text{ReLU}(Wx + b) \odot (Vx + c) \text{SiGLU}(x, W, V, b, c) = \text{SiLU(Wx + b)} \odot (Vx + c)\\\ \text{GeGLU}(x, W, V, b, c) = \text{GELU}(Wx + b) \odot (Vx + c)\\\

Example Practical Implementation

# ReGLU could be implmeneted like this in pytorch
class ReGLU(nn.Module):
def __init__(self):
super(ReGLU, self).__init__()
input_dim = 10
hidden_dim = 20
self.W = nn.Linear(input_dim, hidden_dim)
self.V = nn.Linear(input_dim, hidden_dim)
self.relu = nn.ReLU()
def forward(self, x):
return self.W(x) * self.relu(self.V(x))
# Create a tensor with 10 random numbers
tensor = torch.randn(10)
# Create an instance of the ReGLU class
reglu = ReGLU()
# Apply the ReGLU function to the tensor
output = reglu(tensor)
print(output)
tensor([ 0.0000, 0.0074, 0.0000, -0.0000, -0.0753, -0.0598, -0.0000, -0.0000,
-0.0000, 0.1110, -0.0109, 0.2933, -0.0185, 0.0000, -0.0016, 0.0250,
0.0000, 0.3512, 0.0000, 0.0000], grad_fn=<MulBackward0>)

Implementations from Projects

Here are a few practical implementations from LLM models

GPT-2

GELU implementaiton from GPT-2 model definition.

GPT-2 uses an approximate version of GELU.

LLaMA

SwiGLU implementaiton from LLaMA model definition.

The python code

F.silu(self.w1(x)) * self.w3(x)

is the SwiGLU implementation, the whole function is for the FFN ( MLP layer ) in the transformers.

Gemma

GeGLU implementaiton from Gemma model definition.

The GEGLU is a little more hidden in this function, the screenshot shows the whole FFN ( MLP layer ) function, but if you carefully observe you can make out the GEGLU implementation. (Hint look at the fuse variable)

  • #llm
  • #math