# GPT from scratch

## Introduction

Nous allons dans ce cours construire "à la main" un réseau de neurones correpondant à un [GPT](https://en.wikipedia.org/wiki/Generative_pre-trained_transformer) (generative pre-trained transformer).

- [V+2017] Vaswany, A. _et al._ ["_Attention Is All You Need_"](https://arxiv.org/abs/1706.03762), 2017 (Transformer)
- [R+2018] Radford, A. _et al._ ["_Improving Language Understanding by Generative Pre-Training_"](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf), 2018 (GPT-1)
- [R+2019] Radford, A. _et_al._ ["_Language Models are Unsupervised Multitask Learners_"](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), 2019 (GPT-2)
- [B+2020] Brown T. B. _et al._ ["_Language Models are Few-Shot Learners_"](https://arxiv.org/abs/2005.14165), 2020, (GPT-3)
- [O+2020] [OpenAI ChatGPT blog post](https://openai.com/blog/chatgpt)

- Une vidéo très bien faite de [3Blue1Brown](https://www.3blue1brown.com/) sur le mécanisme d'attention: [_Attention in transformers, visually explained_](https://www.youtube.com/watch?v=eMlx5fFNoYc)

## Reprise

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
device = 'cpu'

In [2]:
class TextData(object):

    def __init__(self, filename, device=device):
        self.filename = filename
        with open(self.filename, 'r', encoding="utf-8") as f:
            self.text = f.read()
        self.size = len(self.text)
        self.chars = sorted(list(set(self.text)))
        self.vocab_size = len(self.chars)
        self.ctoi = {c:i for i,c in enumerate(self.chars)}
        self.itoc = {i:c for c,i in self.ctoi.items()}
        print(self.itoc)
        self.encode = lambda s: [self.ctoi[c] for c in s]
        self.decode = lambda l: ''.join([self.itoc[i] for i in l])
        self.data = torch.tensor(self.encode(self.text), dtype=torch.long, device=device)
        n = int(0.9*len(self.data))
        self.train_data = self.data[:n]
        self.val_data = self.data[n:]

    def __repr__(self):
        l = []
        chars_str = ''.join(self.chars)
        l.append("<TextData")
        l.append(f'  filename="{self.filename}"')
        l.append(f'  size="{self.size}"')
        l.append(f'  vocab_size="{self.vocab_size}">{chars_str.encode('utf-8')}</TextData>')
        return '\n'.join(l)

In [3]:
text = TextData('civil.md')
print(text)

{0: '\n', 1: ' ', 2: '"', 3: '#', 4: '%', 5: "'", 6: '(', 7: ')', 8: '*', 9: ',', 10: '-', 11: '.', 12: '0', 13: '1', 14: '2', 15: '3', 16: '4', 17: '5', 18: '6', 19: '7', 20: '8', 21: '9', 22: ':', 23: ';', 24: 'A', 25: 'B', 26: 'C', 27: 'D', 28: 'E', 29: 'F', 30: 'G', 31: 'H', 32: 'I', 33: 'J', 34: 'L', 35: 'M', 36: 'N', 37: 'O', 38: 'P', 39: 'Q', 40: 'R', 41: 'S', 42: 'T', 43: 'U', 44: 'V', 45: 'W', 46: 'X', 47: 'Y', 48: 'a', 49: 'b', 50: 'c', 51: 'd', 52: 'e', 53: 'f', 54: 'g', 55: 'h', 56: 'i', 57: 'j', 58: 'l', 59: 'm', 60: 'n', 61: 'o', 62: 'p', 63: 'q', 64: 'r', 65: 's', 66: 't', 67: 'u', 68: 'v', 69: 'x', 70: 'y', 71: 'z', 72: '\xa0', 73: '°', 74: 'É', 75: 'à', 76: 'â', 77: 'ç', 78: 'è', 79: 'é', 80: 'ê', 81: 'ë', 82: 'î', 83: 'ï', 84: 'ô', 85: 'ù', 86: 'û', 87: 'œ', 88: '–', 89: '—', 90: '€'}
<TextData
  filename="civil.md"
  size="1182082"
  vocab_size="91">b'\n "#%\'()*,-.0123456789:;ABCDEFGHIJLMNOPQRSTUVWXYabcdefghijlmnopqrstuvxyz\xc2\xa0\xc2\xb0\xc3\x89\xc3\xa0\xc3\xa2\xc

In [4]:
print(text.chars)
#text = TextData('tinyshakespeare.txt')
#print(text)

['\n', ' ', '"', '#', '%', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'x', 'y', 'z', '\xa0', '°', 'É', 'à', 'â', 'ç', 'è', 'é', 'ê', 'ë', 'î', 'ï', 'ô', 'ù', 'û', 'œ', '–', '—', '€']


In [5]:
print(text.encode("La loi ne dispose que pour l'avenir ; elle n'a point d'effet rétroactif."))

[34, 48, 1, 58, 61, 56, 1, 60, 52, 1, 51, 56, 65, 62, 61, 65, 52, 1, 63, 67, 52, 1, 62, 61, 67, 64, 1, 58, 5, 48, 68, 52, 60, 56, 64, 1, 23, 1, 52, 58, 58, 52, 1, 60, 5, 48, 1, 62, 61, 56, 60, 66, 1, 51, 5, 52, 53, 53, 52, 66, 1, 64, 79, 66, 64, 61, 48, 50, 66, 56, 53, 11]


In [6]:
print(text.decode(text.encode("La loi ne dispose que pour l'avenir ; elle n'a point d'effet rétroactif.")))

La loi ne dispose que pour l'avenir ; elle n'a point d'effet rétroactif.


In [7]:
# !pip install tiktoken
import tiktoken
enc = tiktoken.get_encoding('gpt2')
print(enc.n_vocab)
print(enc.encode("La loi ne dispose que pour l'avenir ; elle n'a point d'effet rétroactif."))
print(enc.decode([14772]))
print(enc.decode([14772, 2376]))
print(enc.decode([14772, 2376, 72]))
print(enc.decode([14772, 2376, 72, 497]))
print(enc.decode([14772, 2376, 72, 497, 34291]))
#print(enc.decode([14772, 2376, 72, 497, 34291, 8358, 12797, 300, 6, 4005, 343, 2162, 1288, 293, 299, 6, 64, 966, 288, 6, 14822, 316, 40560, 23528, 529, 361, 13]))

50257
[14772, 2376, 72, 497, 34291, 8358, 12797, 300, 6, 4005, 343, 2162, 1288, 293, 299, 6, 64, 966, 288, 6, 14822, 316, 40560, 23528, 529, 361, 13]
La
La lo
La loi
La loi ne
La loi ne dispose


In [8]:
print(text.data.shape)
print(text.data[:100])

torch.Size([1182082])
tensor([10, 10, 10,  0, 66, 56, 66, 58, 52, 22,  1, 26, 61, 51, 52,  1, 50, 56,
        68, 56, 58,  0, 51, 48, 66, 52, 22,  1, 14, 12, 14, 16, 10, 12, 13, 10,
        13, 17,  0, 10, 10, 10,  0,  0,  3,  3,  1, 42, 56, 66, 64, 52,  1, 62,
        64, 79, 58, 56, 59, 56, 60, 48, 56, 64, 52,  1, 22,  1, 27, 52,  1, 58,
        48,  1, 62, 67, 49, 58, 56, 50, 48, 66, 56, 61, 60,  9,  1, 51, 52, 65,
         1, 52, 53, 53, 52, 66, 65,  1, 52, 66])


## Contexte et exemples

In [9]:
context_size = 8
offset = 100
text.train_data[offset:offset+context_size+1] # 9 caractères, 8 exemples

tensor([ 1, 51, 52,  1, 58,  5, 48, 62, 62])

In [10]:
x = text.train_data[offset:offset+context_size]
y = text.train_data[offset+1:offset+context_size+1]
for t in range(context_size):
    context = x[:t+1]
    target = y[t]
    print(f"{context} -> {target}")


tensor([1]) -> 51
tensor([ 1, 51]) -> 52
tensor([ 1, 51, 52]) -> 1
tensor([ 1, 51, 52,  1]) -> 58
tensor([ 1, 51, 52,  1, 58]) -> 5
tensor([ 1, 51, 52,  1, 58,  5]) -> 48
tensor([ 1, 51, 52,  1, 58,  5, 48]) -> 62
tensor([ 1, 51, 52,  1, 58,  5, 48, 62]) -> 62


In [11]:
torch.manual_seed(1337)

def get_batch(text, split, batch_size, context_size, device=device):
    # generate a small batch of data of inputs x and targets y
    data = text.train_data if split == 'train' else text.val_data
    ix = torch.randint(len(data) - context_size, (batch_size,))
    x = torch.stack([data[i:i+context_size] for i in ix])
    y = torch.stack([data[i+1:i+context_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [12]:
batch_size = 4 # how many independent sequences will we process in parallel?
context_size = 8 # what is the maximum context length for predictions?

xb, yb = get_batch(text, 'train', batch_size, context_size)
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(context_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[66, 61, 67, 66, 52,  1, 50, 52],
        [52, 65,  1, 51, 64, 61, 56, 66],
        [ 0, 32, 58,  1, 65, 52, 64, 48],
        [60,  1, 50, 61, 60, 57, 61, 56]])
targets:
torch.Size([4, 8])
tensor([[61, 67, 66, 52,  1, 50, 52, 65],
        [65,  1, 51, 64, 61, 56, 66, 65],
        [32, 58,  1, 65, 52, 64, 48,  1],
        [ 1, 50, 61, 60, 57, 61, 56, 60]])
----
when input is [66] the target: 61
when input is [66, 61] the target: 67
when input is [66, 61, 67] the target: 66
when input is [66, 61, 67, 66] the target: 52
when input is [66, 61, 67, 66, 52] the target: 1
when input is [66, 61, 67, 66, 52, 1] the target: 50
when input is [66, 61, 67, 66, 52, 1, 50] the target: 52
when input is [66, 61, 67, 66, 52, 1, 50, 52] the target: 65
when input is [52] the target: 65
when input is [52, 65] the target: 1
when input is [52, 65, 1] the target: 51
when input is [52, 65, 1, 51] the target: 64
when input is [52, 65, 1, 51, 64] the target: 61
when input is [5

## Back to bigram

In [13]:
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):  # torch.nn.Module

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.forward
        
        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [14]:
m = BigramLanguageModel(text.vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)
import math
print(-math.log(1/text.vocab_size)) # + entropy

torch.Size([32, 91])
tensor(5.1769, grad_fn=<NllLossBackward0>)
4.51085950651685


In [15]:
idx = torch.zeros((1, 1), dtype=torch.long)
print(text.decode(m.generate(idx, max_new_tokens=100)[0].tolist()))


p-;%—ÉâEôHL:SôP7tùi)tù—0i;TDRX(ySâVP5L,im6ià12Q(o€pRgUQL.CœuoCÉN°Rq—É–nQtœUe PW6SxmçVN.CëbùEFâ;y°ttù


Notre modèle n'est pour l'instant pas entraîné, donc le modèle renvoie n'importe quoi.

La partie générative de la classe `BigramLanguageModel` est pour l'instant trop "puissante", puisque l'on a un bigram qui n'a besoin que du caractère précédent: elle va nous permettre de bâtir un modèle plus puissant.

## Optimisation du modèle

In [16]:
def train(text, model, max_iters, batch_size, context_size, device):
    optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
    for steps in range(max_iters): # increase number of steps for good results...
        # sample a batch of data
        xb, yb = get_batch(text, 'train', batch_size, context_size, device)
    
        # evaluate the loss
        logits, loss = m(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
    return loss

In [17]:
batch_size = 4
context_size = 8
max_iters = 20000
loss = train(text, m, max_iters, batch_size, context_size, device)
print(loss.item())


2.1984429359436035


In [18]:
idx = torch.zeros((1, 1), dtype=torch.long)
print(text.decode(m.generate(idx, max_new_tokens=300)[0].tolist()))



De êt pefoiden de ;x de nans fa de delaite lundérs doncl'uququrtiésit nitte ptis até, ce en'iora l'éconê, Lant.:5718Ré a s.
SïDêts 8Yél'uiou pe cou anat laiesusux l denté veu.
gais l palorontie aitené pe ese-1VNLes r lu equnasinde  fffr su det cesa mponté qule,


*


Lontranfoj18*


Laréeainn appèm


## Complexification du contexte: "Attention"

Transformer: faire en sorte que les tokens se "parlent".

In [19]:
@torch.no_grad()
def estimate_loss(text, model, eval_iters, batch_size, context_size, device='cpu'):
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(text, split, batch_size, context_size, device=device)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [20]:
def train(text, model, max_iters, learning_rate, eval_iters, batch_size, context_size, device):
    # create a PyTorch optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    
    for iter in range(max_iters):
    
        # every once in a while evaluate the loss on train and val sets
        if iter % eval_interval == 0:
            losses = estimate_loss(text, model, eval_iters, batch_size, context_size, device=device)
            print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
    
        # sample a batch of data
        xb, yb = get_batch(text, 'train', batch_size, context_size, device=device)
    
        # evaluate the loss
        logits, loss = model(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

In [21]:
model = BigramLanguageModel(text.vocab_size)
m = model.to(device)

In [22]:
# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
context_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
#device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'cpu'
eval_iters = 200
# ------------
torch.manual_seed(1337);

In [23]:
loss = train(text, m, max_iters, learning_rate, eval_iters, batch_size, context_size, device)

step 0: train loss 4.9269, val loss 4.9156
step 300: train loss 2.6870, val loss 2.7256
step 600: train loss 2.3753, val loss 2.4194
step 900: train loss 2.3190, val loss 2.3554
step 1200: train loss 2.2928, val loss 2.3295
step 1500: train loss 2.2888, val loss 2.3191
step 1800: train loss 2.2828, val loss 2.3131
step 2100: train loss 2.2786, val loss 2.3189
step 2400: train loss 2.2662, val loss 2.3109
step 2700: train loss 2.2724, val loss 2.3063


In [24]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(text.decode(m.generate(context, max_new_tokens=500)[0].tolist()))


Ile daios Les donnd'e à lau eutsusouerde. attre pu mis f sin atur heden serve es e lu u pr e 5**Ariontés.
Lomatêt dé dendone fabur pe à es Dans die cophés uniée à à dée lit décuêt dur dégantin'à ul crmpajurtitode dé ppprtsfocesuven lirorérs cut tratiesutintén-46° ll'agroncirt des nd'e lopacoun, d'eu d'aleuevaie sèsait tau Are Dunsin sureté le lontrior ortrs lus pra ne Fromes d'ut, chemémanu s de me, de aiass lufa ceimaies ppl'ionté estrgDW3***


Din i laus es qurionssat qu4*
29.I. utilidugienena


### Outils pour l'auto-attention

In [25]:
torch.manual_seed(1337)
B, T, C = 4, 8, 2  # Batch, Time, Channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [26]:
# On voudrait que les 8 tokens se "parlent", mais uniquement en arrière
# Par exemple, avec la moyenne
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t, C)
        xbow[b,t] = torch.mean(xprev, 0)
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [27]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [28]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)

True

In [29]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True

In [30]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


## Vers GPT avec auto-attention

### Préparation: position et LM head

Ajoutons des couches pour nous permettre d'ajouter un block d'auto-attention.

In [52]:
class GPTModel(nn.Module):  # torch.nn.Module

    def __init__(self, vocab_size, context_size, n_embd, device):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(context_size, n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.device = device
        
    def forward(self, idx, targets=None):
        # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.forward

        B, T = idx.shape
        
        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C) n_embd
        pos_emb = self.position_embedding_table(torch.arange(T, device=self.device)) # (T,C)
        x = tok_emb + pos_emb
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -context_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [53]:
# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
context_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
#device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'cpu'  # Mac M1,2,3 only
eval_iters = 200
n_embd = 32  # Dimension of embeddings !!

In [54]:
model = GPTModel(text.vocab_size, context_size, n_embd, device)
m = model.to(device)

In [55]:
loss = train(text, m, max_iters, learning_rate, eval_iters, batch_size, context_size, device)

step 0: train loss 4.9023, val loss 4.8894
step 300: train loss 2.3329, val loss 2.3867
step 600: train loss 2.3138, val loss 2.3614
step 900: train loss 2.3109, val loss 2.3473
step 1200: train loss 2.2962, val loss 2.3424
step 1500: train loss 2.3085, val loss 2.3475
step 1800: train loss 2.2982, val loss 2.3407
step 2100: train loss 2.2952, val loss 2.3299
step 2400: train loss 2.2910, val loss 2.3381
step 2700: train loss 2.2837, val loss 2.3290


In [56]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(text.decode(m.generate(context, max_new_tokens=200)[0].tolist()))


*

Lande, va enesreivure epervifintiluide dont 116-262ullsiviun heres laurubioivomane cexpançare caurêmêmbs da aionép coiverempra ha cévendalaut mivourgemes. s. le 3 cor Daréqun'in chende 15-167647917


### Self-attention

In [57]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32  # Batch, Time, Channels (Channels=n_embd)
x = torch.randn(B,T,C)

# Simple average
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x
out.shape

torch.Size([4, 8, 32])

In [58]:
tril

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [59]:
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

À ce stade, on ne voudrait pas que ceci soit uniforme:

```python
wei = torch.zeros((T,T))
```

car on veut donner plus d'importance à certains tokens. La manière dont l'auto-attention implémente ceci est que chaque token émette deux vecteurs: une clef (_key_) et une requête (_query_). Le vecteur requête représente ce que le token regarde. Le vecteur clef représente ce qu'il contient. L'affinité entre les tokens est calculé par le produit scalaire de ces vecteurs, qui devient `wei`.

In [60]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32  # Batch, Time, Channels (Channels=n_embd)
x = torch.randn(B,T,C)

#
# Self-Attention Mechanism
#
# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
v = value(x) # (B, T, 16)

# Communication !
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
# (T,T) est la matrice "d'affinité"

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ v

out.shape

torch.Size([4, 8, 16])

In [61]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

Notes:
- L'**Attention** est un **mécanisme de communication**: il peut être vu comme des noeuds (contenant des tokens) d'un graphe orienté qui se regardent les uns les autres, agrégeant de l'information en tant que sommes pondérées depuis tous les noeuds qui pointent vers eux, avec des poids dépendants des données
- Il n'y pas de notion d'espace, l'Attention agit simplement sur un ensemble de vecteurs. C'est pour cela qu'il faut encoder la position des tokens
- Chaque exemple d'un batch est traité indépendamment et ils ne se parlent pas
- Dans un Encodeur, on laisse les tokens communiquer aussi vers le futur. Le bloc conçu ici est un Décodeur.
- "Auto-attention" signifie que les clefs et valeurs sont produites à partir de la même source que les requêtes. Dans l'"attention-croisée" (_cross-attention_), les requêtes sont toujours produites par `x`, mais les clefs et les valeurs viennent d'une autre source, comme un module Encodeur.
- Normalisation: diviser `wei` par $1/\sqrt{\mathrm{head\_size}}$ (cf [V+2017]). Voir ci-dessous.


In [62]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [63]:
k.var()

tensor(1.0449)

In [64]:
q.var()

tensor(1.0700)

In [65]:
wei.var()

tensor(1.0918)

In [66]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [67]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

### Implémentation progressive

#### Single-head attention

In [68]:
# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
context_size = 8 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 1e-3
#device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'cpu'  # Mac M1,2,3 only
eval_iters = 200
n_embd = 32
vocab_size = text.vocab_size
decode = text.decode

In [69]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(context_size, context_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

In [70]:
class GPTModel(nn.Module):  # torch.nn.Module

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(context_size, n_embd)
        self.sa_head = Head(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.device = device
        self.context_size = context_size
        
    def forward(self, idx, targets=None):
        # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.forward

        B, T = idx.shape
        
        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C) n_embd
        pos_emb = self.position_embedding_table(torch.arange(T, device=self.device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.sa_head(x) # one head of self-attention (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -context_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [71]:
model = GPTModel()
m = model.to(device)

In [72]:
loss = train(text, m, max_iters, learning_rate, eval_iters, batch_size, context_size, device)

step 0: train loss 4.5783, val loss 4.5837
step 500: train loss 2.4978, val loss 2.5595
step 1000: train loss 2.3002, val loss 2.3516
step 1500: train loss 2.2241, val loss 2.2844
step 2000: train loss 2.1965, val loss 2.2500
step 2500: train loss 2.1885, val loss 2.2489
step 3000: train loss 2.1742, val loss 2.2525
step 3500: train loss 2.1616, val loss 2.2084
step 4000: train loss 2.1513, val loss 2.2243
step 4500: train loss 2.1454, val loss 2.2073


In [73]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=200)[0].tolist()))



Sannt ls éfa l'aiciéction prales ue 'u chocisatiprets, deu de les ceur son esseturitoiloun an enttu de de estrocioncgéndéser, laicte ut à concert dénisons mémponan nds sen se dil-ceu Sountérontérve d


#### Multi-head attention

In [75]:
# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
context_size = 8 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 1e-3
#device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'cpu'  # Mac M1,2,3 only
eval_iters = 200
n_embd = 32
vocab_size = text.vocab_size
decode = text.decode

In [76]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return out

In [77]:
class GPTModel(nn.Module):  # torch.nn.Module

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(context_size, n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4)  # 4 heads of 8-dimension
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.device = device
        self.context_size = context_size
        
    def forward(self, idx, targets=None):
        # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.forward

        B, T = idx.shape
        
        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C) n_embd
        pos_emb = self.position_embedding_table(torch.arange(T, device=self.device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.sa_heads(x)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -context_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [78]:
model = GPTModel()
m = model.to(device)
loss = train(text, m, max_iters, learning_rate, eval_iters, batch_size, context_size, device)
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=200)[0].tolist()))

step 0: train loss 4.5015, val loss 4.5036
step 500: train loss 2.5075, val loss 2.5622
step 1000: train loss 2.3196, val loss 2.3909
step 1500: train loss 2.2145, val loss 2.2808
step 2000: train loss 2.1447, val loss 2.2172
step 2500: train loss 2.1093, val loss 2.1741
step 3000: train loss 2.0700, val loss 2.1399
step 3500: train loss 2.0437, val loss 2.1043
step 4000: train loss 2.0091, val loss 2.0973
step 4500: train loss 1.9969, val loss 2.0769

**Art. 16376 10
**Art.

**

Lée lauren oux de sons prés renffan dons plité. 99 Pons oux ouvirt doitanit esnde posue de avemine ;
M, doits rorde éponce jétareclod'eciartants chamansonvenvent, pour un g


#### Feed Forward

In [79]:
class FeedForward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

In [80]:
class GPTModel(nn.Module):  # torch.nn.Module

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(context_size, n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4)  # 4 heads of 8-dimension
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.ffwd = FeedForward(n_embd)
        self.device = device
        self.context_size = context_size
        
    def forward(self, idx, targets=None):
        # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.forward

        B, T = idx.shape
        
        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C) n_embd
        pos_emb = self.position_embedding_table(torch.arange(T, device=self.device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.sa_heads(x)
        x = self.ffwd(x)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -context_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [81]:
model = GPTModel()
m = model.to(device)
loss = train(text, m, max_iters, learning_rate, eval_iters, batch_size, context_size, device)
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=200)[0].tolist()))

step 0: train loss 4.5644, val loss 4.5624
step 500: train loss 2.3969, val loss 2.4607
step 1000: train loss 2.2057, val loss 2.2809
step 1500: train loss 2.1100, val loss 2.1800
step 2000: train loss 2.0622, val loss 2.1328
step 2500: train loss 2.0287, val loss 2.0816
step 3000: train loss 1.9886, val loss 2.0675
step 3500: train loss 1.9664, val loss 2.0458
step 4000: train loss 1.9351, val loss 2.0276
step 4500: train loss 1.9268, val loss 2.0255

Dieu nation, enment de jugéancidi du tent ne le ;

L'acharicaire démanguer lêtres paut unon l'oun ches qu'en prempous dominon, pougemprésonner, à judisives diqui doifilie. 8848**

L'ectice la dusia no


#### Blocks

In [82]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = self.sa(x)
        x = self.ffwd(x)
        return x

In [83]:
class GPTModel(nn.Module):  # torch.nn.Module

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(context_size, n_embd)
        self.blocks = nn.Sequential(*[
             Block(n_embd, n_head=4),
             Block(n_embd, n_head=4),
             Block(n_embd, n_head=4),            
        ])
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
    def forward(self, idx, targets=None):
        # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.forward

        B, T = idx.shape
        
        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C) n_embd
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -context_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [84]:
model = GPTModel()
m = model.to(device)
loss = train(text, m, max_iters, learning_rate, eval_iters, batch_size, context_size, device)
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=200)[0].tolist()))

step 0: train loss 4.4855, val loss 4.4854
step 500: train loss 2.9375, val loss 2.9788
step 1000: train loss 2.5330, val loss 2.5884
step 1500: train loss 2.3371, val loss 2.4013
step 2000: train loss 2.2316, val loss 2.2936
step 2500: train loss 2.1454, val loss 2.2204
step 3000: train loss 2.1100, val loss 2.1788
step 3500: train loss 2.0905, val loss 2.1477
step 4000: train loss 2.0412, val loss 2.1034
step 4500: train loss 2.0173, val loss 2.0829

**Art. 45**

**Art. 514**

2*Art qu'oux 2*

**Art. 542-**

Le mondacsresre dess prout camonment mente l'obiel don danve la less de ne loi la sommiuon, ant bie sesutiion ense soin, avonenlioniemenves t


#### Skip-connections

[_Deep Residual Learning for Image Recognition_](https://arxiv.org/abs/1512.03385)

In [85]:
class FeedForward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )

    def forward(self, x):
        return self.net(x)

In [86]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

In [87]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = x + self.sa(x)
        x = x + self.ffwd(x)
        return x

In [88]:
model = GPTModel()
m = model.to(device)
loss = train(text, m, max_iters, learning_rate, eval_iters, batch_size, context_size, device)
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=200)[0].tolist()))

step 0: train loss 4.9069, val loss 4.9024
step 500: train loss 2.1360, val loss 2.2217
step 1000: train loss 1.9586, val loss 2.0465
step 1500: train loss 1.8847, val loss 1.9680
step 2000: train loss 1.7959, val loss 1.8737
step 2500: train loss 1.7720, val loss 1.8521
step 3000: train loss 1.7342, val loss 1.8007
step 3500: train loss 1.7068, val loss 1.7813
step 4000: train loss 1.6620, val loss 1.7315
step 4500: train loss 1.6416, val loss 1.7391

Est du disponsente, les délabliques.

**Art. 255-3**

Les dévents.

**Art. 19187**

Le Lors menéd de la covoue.

**Art. 1831-3**

Le, appotioir d'un mesureque sociciqus édénéant suopils, nation seus d


#### LayerNorm

[LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)

"Pre-norm formulation": la normalisation est avant l'attention et avant le MLP (FeedForward) plutôt qu'après.


In [90]:
# Note: juste ici pour illustrer le fait que cela ressemble au BatchNorm que nous avions codé précédemment
class LayerNorm1d:

  def __init__(self, dim, eps=1e-5):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

In [91]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

In [92]:
class GPTModel(nn.Module):  # torch.nn.Module

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(context_size, n_embd)
        self.blocks = nn.Sequential(*[
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            nn.LayerNorm(n_embd), 
        ])
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
    def forward(self, idx, targets=None):
        # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.forward

        B, T = idx.shape
        
        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C) n_embd
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -context_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [93]:
model = GPTModel()
m = model.to(device)
loss = train(text, m, max_iters, learning_rate, eval_iters, batch_size, context_size, device)
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=200)[0].tolist()))

step 0: train loss 4.7216, val loss 4.7150
step 500: train loss 2.1604, val loss 2.2379
step 1000: train loss 1.9662, val loss 2.0444
step 1500: train loss 1.8591, val loss 1.9343
step 2000: train loss 1.8022, val loss 1.8765
step 2500: train loss 1.7583, val loss 1.8362
step 3000: train loss 1.7159, val loss 1.7974
step 3500: train loss 1.7010, val loss 1.7782
step 4000: train loss 1.6622, val loss 1.7507
step 4500: train loss 1.6422, val loss 1.7089

43° 497-8, et à au qui et leu haGrientue ceulages.

Les bil un de caux prévients quie deministes réentre pour la des du batrimes de signepteir de la condait. Des ans les du moinient avivien'édampte es


#### Version finale: augmentation des hyperparamètres, Dropout (!! GPU nécessaire !!)

[_Dropout: a simple way to prevent neural network from overfitting_](https://dl.acm.org/doi/pdf/10.5555/2627435.2670313)

In [104]:
# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
context_size = 256 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
#device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'mps'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
torch.manual_seed(1337);

In [105]:
def get_batch(text, split, batch_size, context_size):
    # generate a small batch of data of inputs x and targets y
    data = text.train_data if split == 'train' else text.val_data
    ix = torch.randint(len(data) - context_size, (batch_size,))
    x = torch.stack([data[i:i+context_size] for i in ix])
    y = torch.stack([data[i+1:i+context_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [106]:
@torch.no_grad()
def estimate_loss(text, model, eval_iters, batch_size, context_size):
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(text, split, batch_size, context_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [115]:
def train(text, model, max_iters, learning_rate, eval_iters, batch_size, context_size):
    # create a PyTorch optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    
    for iter in range(max_iters):
    
        # every once in a while evaluate the loss on train and val sets
        if iter % eval_interval == 0:
            losses = estimate_loss(text, model, eval_iters, batch_size, context_size)
            print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
    
        # sample a batch of data
        xb, yb = get_batch(text, 'train', batch_size, context_size)
    
        # evaluate the loss
        logits, loss = model(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

In [108]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(context_size, context_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

In [99]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

In [109]:
class FeedForward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)


In [110]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

In [111]:
class GPTModel(nn.Module):  # torch.nn.Module

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(context_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx, targets=None):
        # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.forward

        B, T = idx.shape
        
        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C) n_embd
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.blocks(x)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -context_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [113]:
model = GPTModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

10.808923 M parameters


In [116]:
loss = train(text, m, max_iters, learning_rate, eval_iters, batch_size, context_size)

step 0: train loss 4.6449, val loss 4.6463
step 500: train loss 1.4024, val loss 1.4860
step 1000: train loss 0.9239, val loss 1.0439
step 1500: train loss 0.8030, val loss 0.9661
step 2000: train loss 0.7291, val loss 0.9369
step 2500: train loss 0.6661, val loss 0.9382
step 3000: train loss 0.6086, val loss 0.9333
step 3500: train loss 0.5604, val loss 0.9432
step 4000: train loss 0.5177, val loss 0.9492
step 4500: train loss 0.4753, val loss 0.9694


In [119]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=5000)[0].tolist()))


#*Art. 302-1**

Les actes de procédure par le consentement sous stipulant une personne périeure en jusqu'à concurrence de l'action, on est nulle.

**Art. 376**

Avant le terme convenu lorsque le bâtiment a été fait à celui qui le présente.

Le conjoint subsistant ou le champ participe du président de la écolonte un majeure ne peut être hors les cas exécutoires à concurrence de sa vue à son éducation.

###### Section 3 : La nouveillance des parts jusqu'au vivant des deuxième degrés à l'adoption de la communauté.

**Art. 1495**

L'erreur par deux ou les héritiers de la part de la succession, les mineurs ne peut compromettre du pas.

##### Sous-section 2 : De la preuve par tous les actes accomplis pour lesquels l'adoption ne peut être opposable que par ce mode et ne plus prescription.

**Art. 373-2**

En cas de décès de l'enfant est ordonnée par le divorce : soit entre l'autre parent ne faisant pas le droit et l'appauvrien du pronouvel, sans ce dernier n'est pas non plus placé pendant sa

## Remarques finales

### Traduction

Traduction du français à l'anglais:

```
# <--------- ENCODE ------------------><--------------- DECODE ----------------->
# les réseaux de neurones sont géniaux! <START> neural networks are awesome!<END>
```

Voir papier [Transformer](https://arxiv.org/abs/1706.03762).


### Size matters

Voir Table 2.1 page 8 du papier [GPT3](https://arxiv.org/abs/2005.14165)

### Fine-tuning (towards chatGPT)

Voir le fine tuning sur le blog [chatGPT](https://openai.com/blog/chatgpt)