Chapter
🎯
Use Case
📄
Data
✂️
Tokenize
⚙️
Configure
⚡
Train
✨
Generate
Choose Your Use Case
Your LLM learns from text in this domain. The corpus shapes everything — vocabulary, grammar, and what the model "knows". Pick what excites you.
Training Data
This is the corpus your LLM will learn from. Every word, every pattern — the model internalises all of it. Garbage in, garbage out.
💡 Real LLMs train on trillions of tokens from books, websites, and code. Our mini-LLM trains on a few hundred words — enough to demonstrate the same concepts at human-readable scale.
Tokenization
Text must be broken into tokens before training. The tokenizer converts raw text into the sequence of units the model processes. Choose your strategy:
🔤
Word-Level
Each word = one token. Larger vocabulary, coherent readable output.
🔡
Character-Level
Each character = one token. Tiny vocabulary, experimental output.
Live Tokenization Preview
9
Preview Tokens
9
Unique in Preview
—
Corpus Tokens
💡 Real LLMs use Byte Pair Encoding (BPE) — iteratively merging the most common adjacent character pairs into sub-word units. GPT-4's tokenizer has ~100,000 tokens. Our n-gram model uses word or character tokens to keep the math readable.
Configure Your Model
These hyperparameters control how your LLM learns. Experiment to see how they affect training and output quality.
Context Window (n-gram order)3-gram (trigram)
Looks at the 2 previous tokens to predict the next. Good balance of coherence and variety.
Training Epochs40
More passes through training data = smoother loss curve. For n-gram models, the model converges in the first pass — epochs let you see the loss decay clearly.
Learning RateBalanced (0.001)
Balanced gives stable convergence. In transformer LLMs, AdamW with cosine LR decay is standard — peak rates of ~3×10⁻⁴ over 300B+ tokens.
💡 In production LLMs, these settings are vastly more complex: dynamic LR schedules, gradient clipping, mixed precision training, and distributed training across thousands of GPUs. Our n-gram model exposes the same concepts in milliseconds.
Model Summary
Training Your LLM
Ready to train…
—
Vocab Size
—
Corpus Tokens
—
N-gram Contexts
—
Perplexity
Generate Text
Your LLM is trained. Give it a prompt and watch it predict the next token, one at a time, using what it learned from the training corpus.
Temperature0.8
🧊 Conservative⚖️ Balanced🔥 Creative
Max Tokens40
Output