What makes a system agentic — and why it matters

An agentic system uses an LLM not just to generate text, but as a reasoning engine that plans and executes multi-step tasks autonomously. The agent perceives its context (the user’s request, available tools, memory of prior steps), plans what to do next, takes an action (calls a tool, writes a record, sends a message), observes the result, and repeats until the task is complete or a stop condition is reached.

The distinction from a chatbot is not cosmetic. A chatbot generates a response. An agent generates an outcome — a Salesforce record created, an email sent, a database query run, a contract analyzed. The same capability that makes agents powerful is what makes them risky: they act in the real world, and real-world actions are often irreversible.

The four-step agent loop in practice

Perceive: The agent receives its full context at the start of each reasoning step — the user’s original request, the list of available tools (with descriptions and schemas), its working memory of completed steps, and any retrieved context from your knowledge base. The quality of this context package determines the quality of everything that follows. Poorly described tools lead to wrong tool selection. Missing context leads to repeated work.

Plan: The agent produces a “thought” — a reasoning step that determines what action to take next. This is where chain-of-thought prompting matters: instructing the model to reason step-by-step before selecting an action significantly reduces wrong tool selection and improves error recovery. The plan should be explicit and logged.

Act: The agent executes the selected action — calling the chosen tool with the chosen parameters. Every action should be logged: which tool, what parameters, at what timestamp, in what task context. This is your audit trail.

Observe: The agent receives the tool’s response — a success result, an error, a partial result — and incorporates it into its next reasoning step. Good error handling here is critical: the agent should recognize failure types (rate limit vs. invalid parameter vs. system unavailability) and respond differently to each.

The most dangerous assumption in agentic system design: that the agent will always know when it’s wrong. It won’t. LLMs can confidently execute a sequence of wrong actions, producing outputs that look correct until a downstream process fails. This is why you need external validation — not just the agent’s own self-assessment — before irreversible actions are executed.

Classifying tool risk before you build

Before writing a single line of agent code, classify every tool the agent will have access to by two dimensions: reversibility (can the action be undone?) and impact scope (how many records, users, or systems does it affect?).

Read-only tools (CRM queries, web searches, document retrieval) are low-risk and can run autonomously. Write tools that affect a single record (create a task, update a contact field) are moderate-risk — they should be logged and ideally reviewable, but don’t require blocking approval in most cases. Tools that send external communications (email, Slack, SMS), modify multiple records at once, execute code, or make financial transactions are high-risk and should require human approval before execution in production.

This classification should be encoded in the agent’s system prompt and in the tool definitions themselves. An agent that is explicitly told “do not send emails without user confirmation” and has a “requires_approval: true” flag on the send_email tool is architecturally safer than one where the safety constraint lives only in the prompt.

Designing human-in-the-loop without killing the value

The most common mistake in enterprise agentic deployments is over-indexing on human review — creating approval gates so frequent that the agent provides no real time savings. The second most common mistake is under-indexing — deploying agents that can take consequential actions without any human visibility.

The right design specifies, before deployment, exactly which conditions trigger human review: action type (all delete operations), confidence threshold (agent uncertainty above X%), impact scope (actions affecting more than N records), or specific data sensitivity (any action touching PII). Everything else runs autonomously. See our production case studies for how we’ve balanced this in real deployments.

For Salesforce-native agentic deployments, Agentforce provides built-in HITL mechanisms through Omni-Channel routing that are worth evaluating before building custom approval workflows.

The observability requirement

An agentic system running in production without full observability is a system you cannot safely operate. You need to know, for every agent run: how many steps it took, which tools it called, what it decided at each step, where it failed, and what the final outcome was. This is not just for debugging — it’s for trust. Stakeholders who can see a complete trace of every agent decision are stakeholders who will approve expanding the agent’s capabilities. Stakeholders who see a black box will not.

LangSmith for LangChain-based agents and Salesforce’s built-in audit logging for Agentforce are the two most mature options in the enterprise space today.