math | JoeLogs

Transformer Activation Functions and their Details

Here are a few observations: GPT-2, developed by OpenAI, opts for the GELU (Gaussian Error Linear Unit) activation function On the other hand, LLaMA, a creation of Facebook Research, embraces SwiGLU activation function. Meanwhile, Gemma, a PyTorch implementation by Google, adopts GeGLU activation functions. So what are these new activation functions ? How should one go about implementing them in pytorch ? In this blog post I try to understand the definitions of these activation functions and how they could be implemented in pytorch....

Understanding Andrejs Tokenizer Video

This blog is based on Andrejs video on Tokenizer📽️. In an interview with Lex Fridman, Andrej said, he takes 10 hrs to make 1 hr of content, this is me trying to decode the rest of the 9 hours. In this I wont be concentrating on the python implementation of BPE algorithm itself. The idea is to uncover the details and subtle points around the video and look closely in this very dense video lecture by Andrej....

FlashAttention: Before flash

In this blog I delve into the intricate mathematical subtilities that abound in Flash Attention I recently came across. The primary aim is to unravel these mathematical complexities, offering readers a key to unlocking a deeper comprehension of the complete paper. The focus of this blog remains exclusively on the mathematical nuances, tailored to resonate with those who possess a keen mathematical acumen. Flash Attention Transformers rely on a core operation called Attention Calculation....