The mental model that changes everything
An LLM is not a search engine. It is not a database. It is a probabilistic text completion machine — a system that, given a sequence of tokens, predicts the most contextually appropriate next token, then the next, then the next, until a stop condition is reached. That single insight changes how you design, evaluate, and govern every AI system you build on top of one.
Understanding this isn't academic. It directly informs why RAG outperforms fine-tuning for most enterprise use cases, why prompt engineering is a legitimate engineering discipline, and why hallucination cannot be patched out — only engineered around.
Tokenization: the LLM doesn't read words
The first thing that happens when you send a prompt is tokenization. The text is split into tokens — subword units that are roughly 4 characters or 0.75 words on average. "tokenization" becomes ["token", "ization"]. "ChatGPT" might be a single token or two, depending on the tokenizer.
This matters for three reasons. First, pricing is per token, not per word — a 1,000-word document is roughly 1,300 tokens. Second, the context window limit is measured in tokens: GPT-4 Turbo has a 128K token window; Claude 3 has 200K. Third, unusual words, code, and non-English text tend to tokenize less efficiently.
Embeddings and the meaning space
After tokenization, each token is converted to an embedding — a high-dimensional vector that encodes the token's meaning and context. Semantically similar words have vectors that are geometrically close: "king" and "queen" are near each other; "quarterly earnings" is far from both.
This vector representation is the foundation of Retrieval-Augmented Generation. When you build a RAG system, you convert your documents to embeddings and store them in a vector database — then retrieve the most semantically similar documents at query time.
The transformer: attention is all you need
The core of every modern LLM is the transformer architecture, introduced in the 2017 paper "Attention Is All You Need." The key mechanism is self-attention: for each token, the model calculates how much "attention" to pay to every other token in the context window when predicting what comes next.
In the sentence "The bank by the river was muddy," when the model processes "muddy," it attends heavily to "bank" and "river" to understand that "bank" means a riverbank, not a financial institution. This disambiguation — done across every token, across 96 attention heads, across 96 layers in GPT-4 — is what gives LLMs their remarkable language understanding.
Temperature: controlling randomness
At the output stage, the model produces a probability distribution over all possible next tokens. Temperature controls how you sample from that distribution. At temperature 0, outputs are deterministic and repeatable. At temperature 1.0, you sample proportionally to the probabilities. At temperature 1.5+, outputs become creative and unpredictable.
For production enterprise systems — classification, extraction, structured data generation — you almost always want temperature 0 or close to it. For creative applications, 0.7–1.0 is more appropriate.
Hallucination is not a bug
Hallucination — when the model generates confident, fluent text that is factually wrong — is not a defect that will be patched. It is a property of probabilistic generation. The model does not "know" what is true; it predicts what sounds true based on patterns in its training data.
Production systems manage hallucination through three mechanisms: grounding (giving the model verified context via RAG), output validation (a second model checks the output), and human-in-the-loop review for high-stakes decisions.