Oktana | How LLMs actually work: a technical primer for engineers and decision-makers

The mental model that changes everything

An LLM is not a search engine. It is not a database. It is a probabilistic text completion machine — a system that, given a sequence of tokens, predicts the most contextually appropriate next token, then the next, then the next, until a stop condition is reached. That single insight changes how you design, evaluate, and govern every AI system you build on top of one.

Understanding this isn't academic. It directly informs why RAG outperforms fine-tuning for most enterprise use cases, why prompt engineering is a legitimate engineering discipline, and why hallucination cannot be patched out — only engineered around.

Tokenization: the LLM doesn't read words

The first thing that happens when you send a prompt is tokenization. The text is split into tokens — subword units that are roughly 4 characters or 0.75 words on average. "tokenization" becomes ["token", "ization"]. "ChatGPT" might be a single token or two, depending on the tokenizer.

This matters for three reasons. First, pricing is per token, not per word — a 1,000-word document is roughly 1,300 tokens. Second, the context window limit is measured in tokens: GPT-4 Turbo has a 128K token window; Claude 3 has 200K. Third, unusual words, code, and non-English text tend to tokenize less efficiently.

Embeddings and the meaning space

After tokenization, each token is converted to an embedding — a high-dimensional vector that encodes the token's meaning and context. Semantically similar words have vectors that are geometrically close: "king" and "queen" are near each other; "quarterly earnings" is far from both.

This vector representation is the foundation of Retrieval-Augmented Generation. When you build a RAG system, you convert your documents to embeddings and store them in a vector database — then retrieve the most semantically similar documents at query time.

The transformer: attention is all you need

The core of every modern LLM is the transformer architecture, introduced in the 2017 paper "Attention Is All You Need." The key mechanism is self-attention: for each token, the model calculates how much "attention" to pay to every other token in the context window when predicting what comes next.

In the sentence "The bank by the river was muddy," when the model processes "muddy," it attends heavily to "bank" and "river" to understand that "bank" means a riverbank, not a financial institution. This disambiguation — done across every token, across 96 attention heads, across 96 layers in GPT-4 — is what gives LLMs their remarkable language understanding.

Why larger models are different, not just bigger: GPT-3 (175B parameters) can complete sentences. GPT-4 (estimated 1T+ parameters) can reason across long documents, write production-quality code, and catch logical errors in complex arguments. The capability jump between scales is non-linear — this is what "emergent behavior" means.

Temperature: controlling randomness

At the output stage, the model produces a probability distribution over all possible next tokens. Temperature controls how you sample from that distribution. At temperature 0, outputs are deterministic and repeatable. At temperature 1.0, you sample proportionally to the probabilities. At temperature 1.5+, outputs become creative and unpredictable.

For production enterprise systems — classification, extraction, structured data generation — you almost always want temperature 0 or close to it. For creative applications, 0.7–1.0 is more appropriate.

Hallucination is not a bug

Hallucination — when the model generates confident, fluent text that is factually wrong — is not a defect that will be patched. It is a property of probabilistic generation. The model does not "know" what is true; it predicts what sounds true based on patterns in its training data.

Production systems manage hallucination through three mechanisms: grounding (giving the model verified context via RAG), output validation (a second model checks the output), and human-in-the-loop review for high-stakes decisions.

How LLMs actually work: a technical primer for engineers and decision-makers

The mental model that changes everything

Tokenization: the LLM doesn't read words

Embeddings and the meaning space

The transformer: attention is all you need

Temperature: controlling randomness

Hallucination is not a bug

You might also like

Read Next!

How LLMs actually work: a technical primer for engineers and decision-makers

The mental model that changes everything

Tokenization: the LLM doesn't read words

Embeddings and the meaning space

The transformer: attention is all you need

Temperature: controlling randomness

Hallucination is not a bug

You might also like

From AppExchange to AgentExchange: What Salesforce’s Shift Means for Your Business

Is Your Salesforce Org Ready for Agentforce? A Practical 2026 Checklist

Salesforce Pricing Calculator: How to Estimate Costs and Plan Your Data Strategy

Read Next!