Build Your LLM — Interactive AI Learning Lab

Free browser-based tool to build your own Large Language Model from scratch — no API keys required. Learn tokenization, n-gram probability models, temperature sampling, perplexity, self-attention, and text generation through 10 interactive chapters and a 6-step guided LLM wizard.

← AI Studio▶ Demo
About this tool — how it works & FAQOpen ▾Close ▴

About Build Your LLM

This free browser-based tool teaches you how to build a Large Language Model from scratch — no API keys, no cloud compute, no accounts needed. All computation runs in your browser. You work through 10 interactive chapters covering every layer of LLM architecture, then use the 6-step wizard to train your own model on one of six domains: Fairy Tales, Movies, Cooking, Sports, Tech Docs, or Poetry. Under the hood, the wizard implements the same statistical foundations that power GPT, Claude, and LLaMA — tokenization, vocabulary construction, n-gram probability distributions, Laplace smoothing, temperature-scaled sampling, and perplexity measurement — so you understand exactly what your model is doing at every step.

What you'll learn

• Tokenization — split raw text into tokens using word-level, character-level, or simulated Byte Pair Encoding (BPE). See live token counts and vocabulary size. • N-gram models — build bigram, trigram, and 4-gram language models. Count co-occurrences, apply Laplace smoothing to handle unseen n-grams, and compute conditional probability tables. • Training — animate the loss curve as your model converges over 10–100 epochs. Watch perplexity drop in real-time from ~vocabulary-size toward convergence. • Temperature sampling — control creativity vs predictability. Divide logits by T before softmax: T < 1 makes the model conservative, T > 1 makes it creative and unpredictable. • Self-attention — visualize Q/K/V attention weights as a heatmap. See which words attend to which other words in a sentence. • Word embeddings — see a 2D scatter plot of token positions derived from co-occurrence vectors. Semantically similar words cluster together. • Text generation — stream text token by token using autoregressive sampling, with a live chart of top-k next-token probabilities. • Fine-tuning — compare a base model against a domain-fine-tuned version. See how training on domain text shifts the generated style. • Architecture — explore the full transformer pipeline: embedding layer → positional encoding → multi-head attention → feed-forward → layer norm → output softmax. • Prompting — build prompt templates for zero-shot, few-shot, role, and chain-of-thought styles.

10 interactive chapters

Chapter 1 — How LLMs Predict: Click word suggestions to see how probability drives next-token prediction. "Once upon a" → time / day / night / while / bright. Chapter 2 — Training Corpus: Explore your full training text. See word frequency counts, most common bigrams, and corpus statistics. Chapter 3 — Tokenization: Toggle between word, character, and BPE tokenizers. See each token highlighted and vocabulary size change live. Chapter 4 — Word Embeddings: Watch a 2D scatter plot build from co-occurrence counts. Similar words cluster in embedding space. Chapter 5 — Self-Attention: Interact with an attention heatmap on canvas. Q·Kᵀ / √dₖ weights show how tokens relate to each other. Chapter 6 — Transformer Architecture: Step through the full pipeline — embedding → positional encoding → multi-head attention → feed-forward → layer norm → softmax. Chapter 7 — Training Loss: Animate the loss curve with requestAnimationFrame. Exponential decay models the convergence shape of real neural network training. Chapter 8 — Temperature: Drag the temperature slider and watch the next-token probability distribution reshape in real-time. Chapter 9 — Fine-Tuning: Train a base model, then fine-tune on a new domain. Side-by-side comparison shows the shift in style and vocabulary. Chapter 10 — Prompting: Build zero-shot, few-shot, role, and chain-of-thought prompt templates. See how context window affects output.

6-step build wizard

Step 1 — Use Case: Pick from Fairy Tales, Movies, Cooking, Sports, Tech Docs, or Poetry. Each comes with ~200 words of training text and a unique accent color. Step 2 — Training Data: Preview your full corpus. See token count, unique token count, and average word frequency before training. Step 3 — Tokenize: Choose word-level or character-level tokenization. See the vocabulary list and token count side by side. Step 4 — Configure: Set n-gram size (bigram/trigram/4-gram), epochs (10–100), and temperature (0.1–2.0). Larger n-grams capture longer context but need more data. Step 5 — Train: Watch the animated loss curve drop as your model trains. The JavaScript engine builds the full n-gram probability table with Laplace smoothing, then calculates perplexity at each epoch checkpoint. Step 6 — Generate: Type a prompt, hit Generate, and watch your LLM stream text token by token. Adjust temperature on the fly. See the top-5 next-token candidates with probability bars.

The LLM engine — how it works

All computation runs in pure JavaScript — no backend, no API, no WebAssembly. Here's what the engine does:

1. Tokenize: Split corpus into tokens (words or characters). Build vocabulary = unique sorted tokens. 2. Count n-grams: Slide a window of size n across the token list and count every (context → next_token) pair. 3. Laplace smoothing: Add 1 to every count (including unseen pairs) to avoid zero-probability crashes. This is Add-1 (Laplace) smoothing, the simplest form of probability smoothing. 4. Probability table: Divide each count by its context total to get P(next_token | context). Store as a lookup map. 5. Perplexity: For each token in the corpus, look up its probability given the preceding context. Average the negative log probabilities, then exponentiate: PP = exp(−(1/N) × Σ log P(wᵢ | context)). 6. Temperature sampling: Given a context, retrieve candidate probabilities. Divide logits by T, run softmax to renormalize, then sample from the resulting distribution. 7. Backoff: If the n-gram context has never been seen, back off to (n-1)-gram, then bigram, then unigram frequency. This is the same strategy used by Kneser-Ney and Stupid Backoff in real NLP systems. 8. Autoregressive generation: Call nextToken() repeatedly, appending each predicted token to the context window, then sampling the next one from that updated context.

Key AI concepts explained

N-gram model: A statistical model that assigns probability to the next token based on the previous n-1 tokens. Bigram = 2 tokens, Trigram = 3, 4-gram = 4. The Markov assumption says only the last n-1 tokens matter — longer n-grams capture more context but require exponentially more training data.

Laplace smoothing: Adding a pseudocount of 1 to every (context, token) pair before computing probabilities. Without smoothing, any unseen n-gram gets P = 0, which crashes perplexity calculations and makes backoff necessary. Laplace is the simplest form; real LLMs use Kneser-Ney smoothing.

Temperature (T): Divides logits (log-probabilities) by T before softmax. T < 1 sharpens the distribution (greedy behavior). T = 1 leaves it unchanged. T > 1 flattens it (more random). Used in every production LLM including GPT, Claude, and Gemini.

Perplexity: The geometric mean of the inverse probability the model assigns to the test set. Lower = better. Perplexity = 1 means the model perfectly predicts every token. Perplexity = V (vocabulary size) means the model is guessing randomly. GPT-4 achieves perplexity of ~10 on standard benchmarks.

Self-attention: For each token, compute Q = Wq·x, K = Wk·x, V = Wv·x. Attention weight = softmax(Q·Kᵀ / √dₖ). Output = Σ (weight × V). This lets every token attend to every other token in the sequence, with relevance determined by the dot product of their Query and Key vectors.

Word embeddings: Dense vector representations of tokens. In a transformer, the embedding table E has shape (vocab_size × d_model). Semantically similar tokens end up with similar vectors because they appear in similar contexts during training — this is the distributional hypothesis.

Frequently asked questions

Do I need an API key, account, or internet connection to use this tool?

No — everything runs in your browser. The tool is a single self-contained HTML file with all logic written in JavaScript. There are no API calls, no server requests, no database, and no authentication. Once the page loads, you can use it completely offline. The n-gram model training, tokenization, text generation, and all visualizations run entirely on your device.

What is the difference between bigram, trigram, and 4-gram models?

The n in n-gram is the total sequence length (context + prediction). A bigram model (n=2) predicts the next token using only the 1 preceding token: P(word | previous_word). A trigram model (n=3) uses the 2 preceding tokens: P(word | word-2, word-1). A 4-gram model uses the 3 preceding tokens. Longer n-grams capture more context and produce more coherent text, but they need exponentially more training data — a trigram model needs to have seen every 2-word phrase, while a 4-gram needs every 3-word phrase. On a short 200-word training corpus, bigrams generalize better because there are more data points per context. Trigrams start to overfit on small corpora.

Why does my model generate repetitive or incoherent text?

This is expected behavior from n-gram models on small corpora. N-gram models have no semantic understanding — they only capture statistical co-occurrence patterns in the training text. With a 200-word corpus, the model has seen few examples per context, so it often falls back to the most frequent unigrams, producing repetitive phrases. Try: (1) reducing temperature below 1.0 for more coherent (if repetitive) output; (2) using a bigram instead of trigram model to avoid sparse contexts; or (3) switching to a longer training text. This limitation is exactly why transformer LLMs use neural networks — learned embeddings and attention allow generalization far beyond what was literally seen in training.

How is this related to GPT, Claude, or LLaMA?

GPT, Claude, and LLaMA are all autoregressive language models — they predict the next token given context, then append that token and predict the next one, just like the generator in this tool. The difference is in how they represent context: this tool uses an exact n-gram lookup table (a dictionary of counts), while transformer LLMs use deep neural networks with billions of parameters that learn continuous representations of meaning. The training objective is similar (maximize log probability of the next token), the evaluation metric is the same (perplexity), and the sampling techniques are identical (temperature, top-k, top-p). This tool is pedagogically the right starting point: master the concept here, then extend to understanding how transformers learn the same thing but with vastly more expressive representations.

What is Laplace smoothing and why does the model need it?

Without smoothing, any n-gram context the model has never seen during training gets a probability of 0/0 — undefined. When this context appears during generation or perplexity evaluation, the model crashes or produces infinite perplexity. Laplace (Add-1) smoothing adds a pseudocount of 1 to every possible (context, token) pair before dividing, so the minimum probability of any token given any context is 1/(count_total + vocab_size) rather than 0. This avoids the zero-probability problem at the cost of slightly overestimating rare events. Production LLMs use more sophisticated techniques like Kneser-Ney smoothing (which redistributes probability mass based on the diversity of contexts a word appears in, not just its frequency) or simply rely on the neural network's generalization to avoid zero-probability tokens.

How do I interpret the training loss curve?

The loss curve shows cross-entropy loss = −log P(correct_token | context), averaged over all tokens in the corpus, plotted against training epoch. Lower loss = the model assigns higher probability to the correct next tokens. The exponential decay shape (steep initial drop, then flattening asymptote) is characteristic of real neural network training. The final perplexity value = exp(loss). If your loss converges to ~2.5 nat, perplexity = exp(2.5) ≈ 12. For an n-gram model trained on 200 words with 100-word vocabulary, a final perplexity of 15–40 is typical. Note: the n-gram model's loss improves with more epochs because more training passes allow it to see all n-gram contexts and build tighter probability estimates with the full smoothed distribution.

Related tools & guides