Bigram

1 Load Raw Text

words = open('names.txt').read().splitlines()

The file is one name per line. Read the whole file, split on newlines.

raw file"emma\nolivia\nava\n..."

.splitlines()

list of 32,032 words["emma", "olivia", "ava", ...]

First 20 words:

2 Build the Alphabet & Index Map

chars = sorted(set(''.join(words)))
stoi  = {ch: i for i, ch in enumerate(chars)}
stoi['.'] = 26

Extract unique characters, sort, assign each an integer. . is the boundary token (index 26).

all text"emmaoliv..."

set → sort

26 uniquea b c ... z

+ "."

27 totala..z + .

Index mapping — every character ↔ integer:

3 Add Boundary Tokens

chars = ['.'] + list(word) + ['.']

Wrap each word with . so the model learns which letters start/end names.

Pick a word:

Original:

↓ prepend "." + append "."

With boundaries:

4 Extract Bigram Pairs

for x, y in zip(chars, chars[1:]):
    xs.append(stoi[x])
    ys.append(stoi[y])

Slide a 2-character window across the list. Each position yields one (input, target) pair.

Sliding window over "emma":

All extracted pairs:

5 Convert Characters → Integers

Look up each character in the index map from Step 2.

#	Input	→	Index	Target	→	Index

xs (input indices)

ys (target indices)

6 One-Hot Encode the Inputs

x_encoded = F.one_hot(xs, num_classes=27).float()

Each integer becomes a 27-element vector: all zeros except a 1 at that character's position.

Each input → one-hot row (green = activated bit):

7 Forward Pass

Trace the math for one pair through the full forward pass.

Trace pair:

7a Logits = one_hot × W

The one-hot vector picks row ? out of W. That row = the logits.

7b Counts = exp(logits)

Exponentiate every logit → all positive. Negative logit = small count, positive = large.

7c Probs = counts ÷ sum(counts)

Normalize so they sum to 1. This is the softmax operation.

Probability distribution (target = green, top 12 shown):

7d Loss = −log(prob of correct answer)

How surprised is the model? Lower probability → higher loss.

8 Backward Pass & Weight Update

W.grad = None
loss.backward()
W.data += -0.001 * W.grad

Compute how much each of the 729 weights influenced the loss, then nudge each one to reduce it.

grad > 0: weight pushes loss up → subtract · grad < 0: weight pushes loss down → add · lr = 0.001: tiny steps

9 Training Loop

Repeat steps 7–8 for 150 epochs over all examples.

Epoch: 0/150 Loss: —

Bigram Language Model