Build Your LLM — Interactive AI Learning Lab

Choose Your Use Case

Your LLM learns from text in this domain. The corpus shapes everything — vocabulary, grammar, and what the model "knows". Pick what excites you.

Training Data

This is the corpus your LLM will learn from. Every word, every pattern — the model internalises all of it. Garbage in, garbage out.

Use my own training text

💡 Real LLMs train on trillions of tokens from books, websites, and code. Our mini-LLM trains on a few hundred words — enough to demonstrate the same concepts at human-readable scale.

Tokenization

Text must be broken into tokens before training. The tokenizer converts raw text into the sequence of units the model processes. Choose your strategy:

🔤

Word-Level

Each word = one token. Larger vocabulary, coherent readable output.

🔡

Character-Level

Each character = one token. Tiny vocabulary, experimental output.

Live Tokenization Preview

9

Preview Tokens

9

Unique in Preview

—

Corpus Tokens

💡 Real LLMs use Byte Pair Encoding (BPE) — iteratively merging the most common adjacent character pairs into sub-word units. GPT-4's tokenizer has ~100,000 tokens. Our n-gram model uses word or character tokens to keep the math readable.

Configure Your Model

These hyperparameters control how your LLM learns. Experiment to see how they affect training and output quality.

Context Window (n-gram order)3-gram (trigram)

Looks at the 2 previous tokens to predict the next. Good balance of coherence and variety.

Training Epochs40

More passes through training data = smoother loss curve. For n-gram models, the model converges in the first pass — epochs let you see the loss decay clearly.

Learning RateBalanced (0.001)

Balanced gives stable convergence. In transformer LLMs, AdamW with cosine LR decay is standard — peak rates of ~3×10⁻⁴ over 300B+ tokens.

💡 In production LLMs, these settings are vastly more complex: dynamic LR schedules, gradient clipping, mixed precision training, and distributed training across thousands of GPUs. Our n-gram model exposes the same concepts in milliseconds.

Model Summary

Training Your LLM

Ready to train…

—

Vocab Size

—

Corpus Tokens

—

N-gram Contexts

—

Perplexity

Generate Text

Your LLM is trained. Give it a prompt and watch it predict the next token, one at a time, using what it learned from the training corpus.

Temperature0.8

🧊 Conservative⚖️ Balanced🔥 Creative

Max Tokens40