Question 1

Can I build an LLM without GPU or cloud compute?

Accepted Answer

Yes — this tool runs entirely in your browser with zero server-side computation. It implements an n-gram language model, which is the statistical foundation behind modern LLMs. An n-gram model counts how often each sequence of words (bigram = 2 words, trigram = 3 words, 4-gram = 4 words) appears in your training text, then uses those counts as probability distributions to predict the next token. While transformer LLMs like GPT use neural networks instead, they learn the same core concept: given a context, predict the most probable next token. Running this in the browser teaches you tokenization, vocabulary construction, probability distributions, temperature sampling, and perplexity — all the foundational concepts — without needing a GPU or API key.

Question 2

What is tokenization and why does it matter for LLMs?

Accepted Answer

Tokenization is the process of splitting raw text into discrete units (tokens) that the model can process. Word tokenization splits on whitespace and punctuation (e.g., "Hello world" → ["hello", "world"]). Character tokenization splits every character (e.g., "cat" → ["c", "a", "t"]). Byte Pair Encoding (BPE) — used by GPT, Claude, and LLaMA — starts at the character level and iteratively merges the most frequent adjacent pairs into new tokens (e.g., "un" + "able" → "unable"), balancing vocabulary size against sequence length. The choice of tokenizer directly controls what the model can learn: word-level tokenizers struggle with rare words and morphology; character-level models handle any word but have long sequences; BPE hits the sweet spot. In this tool you can switch between all three and see the tokenized output live.

Question 3

What is temperature in text generation and how does it work?

Accepted Answer

Temperature (T) controls the randomness of token selection during generation. After the model computes a probability distribution over the vocabulary, temperature scaling divides each log-probability (logit) by T before applying softmax. When T < 1 (e.g., 0.3), the distribution becomes sharper — the highest-probability token dominates and generation is predictable and repetitive. When T = 1, the distribution is unchanged — outputs match training frequencies. When T > 1 (e.g., 1.5), the distribution flattens — lower-probability tokens get a bigger share, producing more creative and surprising outputs, at the cost of more incoherence. Temperature T = 0 is equivalent to greedy (argmax) decoding. Real LLMs like GPT and Claude use temperature alongside top-p (nucleus) sampling to balance coherence with diversity.

Question 4

What is perplexity and how do I know if my LLM is trained well?

Accepted Answer

Perplexity measures how "surprised" the model is by the test text — lower perplexity means the model assigns higher probability to the actual words, indicating better predictive quality. Formally, perplexity = exp(−(1/N) × Σ log P(wᵢ | context)), where N is the number of tokens and P is the model's predicted probability. An untrained model guessing randomly from a vocabulary of V words has perplexity ≈ V. A perfectly memorized training set has perplexity = 1. In practice, well-trained language models on benchmark text achieve perplexity of 10–50 depending on domain. In this tool, the training loss curve animates as perplexity drops epoch by epoch — the same visual pattern you see in real transformer training dashboards, where loss (which is just log perplexity) curves down and asymptotes as the model converges.

Question 5

How does self-attention work in transformer LLMs?

Accepted Answer

Self-attention lets every token in a sequence attend to every other token, weighted by relevance. For each token, three vectors are computed: Query (Q, what this token is looking for), Key (K, what this token offers), and Value (V, what information to pass forward). Attention weights are computed as softmax(Q·Kᵀ / √dₖ), where dₖ is the key dimension — the scaling prevents the dot products from growing too large and saturating the softmax. These weights are used to take a weighted sum of the Value vectors. The result is a context-aware representation of each token that knows what other tokens in the sequence are relevant. The interactive attention heatmap in Chapter 5 of this tool visualizes these weights as a matrix, so you can see which word pairs a model attends to most strongly in a given sentence.

Question 6

What is the difference between an n-gram model and a transformer LLM?

Accepted Answer

Both are language models that predict the next token given context, but they differ in how they represent and generalize from training data. An n-gram model stores exact counts of word sequences (bigrams, trigrams, etc.) and uses frequency ratios as probabilities. It cannot generalize beyond sequences it has seen — if a word combination never appeared in training, it falls back to shorter n-grams or uniform probability (with Laplace smoothing). A transformer LLM like GPT or Claude learns dense vector representations (embeddings) for every token, then uses layers of multi-head self-attention and feed-forward networks to model arbitrarily long dependencies. The neural network generalizes: words with similar meanings end up with similar embeddings, and the attention mechanism can relate any two tokens regardless of distance. The same fundamental concepts apply to both: tokenization, context windows, probability distributions, temperature sampling, and training loss — making the n-gram model the ideal pedagogical stepping-stone to understanding transformers.

Question 7

What is fine-tuning an LLM and when should I use it?

Accepted Answer

Fine-tuning is the process of continuing to train a pre-trained model on a smaller, domain-specific dataset to adapt its behavior. A base LLM (pre-trained on the entire web) knows general language patterns but may not know your specific domain's terminology, style, or format. Fine-tuning adjusts the model's weights on your data, shifting its probability distributions toward your domain while retaining the broad language understanding from pre-training. Common uses: making a medical LLM that uses clinical terminology; making a code model that follows your coding standards; making a customer service bot that responds in your company's voice. In this tool, Chapter 9 shows a direct comparison between a base n-gram model and a fine-tuned version, so you can see the shift in generated text vocabulary and style after domain-specific training.

Build Your LLM — Interactive AI Learning Lab

About Build Your LLM

What you'll learn

10 interactive chapters

6-step build wizard

The LLM engine — how it works

Key AI concepts explained

Frequently asked questions

Do I need an API key, account, or internet connection to use this tool?

What is the difference between bigram, trigram, and 4-gram models?

Why does my model generate repetitive or incoherent text?

How is this related to GPT, Claude, or LLaMA?

What is Laplace smoothing and why does the model need it?

How do I interpret the training loss curve?

Related tools & guides