๐Ÿ“บ Watch the video version: ThinkSmart.Life/youtube
๐ŸŽง
Listen to this article

Most developers interact with LLMs through a friendly abstraction layer โ€” a chat UI, a Python SDK, a POST to /v1/chat/completions. That's fine for getting things done. But if you're building agents, optimizing inference costs, or debugging why your tool call returned garbage, you need to understand what's actually happening underneath. This article goes one layer deeper: how tokens flow through a transformer, why an LLM "calling a function" is a polite fiction, and how the agent loop pattern converts a single-shot model into something that can reason across time.

Section 1: How an LLM Communicates โ€” The Token Machine

Before we talk about agents or tool calls, we need to establish the atom of LLM communication: the token. Everything else is built on top of this.

Tokens Are Not Words

A token is the smallest unit of text that a language model processes. Depending on the tokenizer, a token might be a full word, a subword fragment, a single character, or even a punctuation mark. The model never sees raw text โ€” it sees a sequence of integer IDs, where each ID maps to a token in the vocabulary.

For example, the phrase "agent loop" might tokenize as [31303, 6471] in one model and [1781, 9, 6506] in another. Token count drives everything: API cost, latency, context window limits, and KV cache memory.

Different models use different tokenizers (BPE, WordPiece, SentencePiece), and different models have different chat templates โ€” the specific format of special tokens wrapping your conversation (<|system|>, <|user|>, <|assistant|>, etc.). A Llama 3 system prompt looks nothing like a Mistral one at the token level. Using the wrong chat template for a model noticeably degrades output quality โ€” the model was trained expecting a specific format and gets confused when it doesn't match.

When you call Ollama or any inference server, the chat template is applied automatically before the input ever hits the transformer. It's invisible to you, but it's real and it matters.

Two Phases: Prefill and Decode

LLM inference has two distinct computational phases with very different performance characteristics. Knowing this helps you understand where your latency is coming from.

Prefill phase: All input tokens are processed in parallel through the attention mechanism. The transformer computes attention over the full prompt in one forward pass. This phase is compute-bound โ€” it's doing massive parallel matrix multiplications. The output is the KV cache: key-value tensors for every transformer layer, for every input token. This cache is the model's working memory of the prompt.

Decode phase: The model generates one token at a time, autoregressively. Each new token is sampled from a probability distribution over the full vocabulary (typically 32kโ€“128k tokens). That sampled token is appended to the sequence, and the model reads the KV cache plus the new token to generate the next one. This phase is memory-bound โ€” the bottleneck is reading the KV cache from GPU VRAM, not compute. This is why long-context models are expensive: the KV cache grows with sequence length and must be read on every decode step.

The decode loop terminates when one of three things happens:

Here's a streaming request to Ollama where you can watch decode happen token by token:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Explain KV cache in one sentence." }
  ],
  "stream": true
}'

# Each chunk arrives as one decoded token:
# {"message":{"role":"assistant","content":"The"},"done":false}
# {"message":{"role":"assistant","content":" KV"},"done":false}
# {"message":{"role":"assistant","content":" cache"},"done":false}
# ... (one token per line until done:true)
Prefill vs. Decode at a Glance: The first token takes longer (prefill is processing your entire prompt in parallel). Subsequent tokens arrive at a steady cadence (decode: one at a time from the KV cache). If your first token is slow, your prompt is too long. If all tokens are slow, you're memory-bandwidth constrained.

Sampling: Temperature and the Probability Distribution

Each decode step produces a logit vector โ€” one raw score per vocabulary token. A softmax function converts those scores into a probability distribution. The model then samples from that distribution. This is not deterministic (unless temperature is 0, which collapses to argmax โ€” always pick the top token).

Temperature scales the logits before softmax. High temperature (e.g., 1.5) flattens the distribution โ€” more creative, more variable, higher hallucination risk. Low temperature (e.g., 0.2) sharpens it โ€” more deterministic, more repetitive. Temperature 0 is greedy decoding.

Top-p (nucleus sampling) restricts the sample to the smallest set of tokens whose cumulative probability exceeds p. Top-k restricts to the top k tokens by probability. These parameters interact with temperature โ€” all three are tunable per API request.

The implication for tool calling (next section) is critical: argument values in tool calls are sampled just like any other tokens. A model at high temperature may hallucinate a plausible-sounding but incorrect argument. This is not a bug โ€” it's the fundamental nature of probabilistic sampling. Code defensively.

The Thinking Field: Internal Monologue Made Visible

Recent reasoning models โ€” Qwen3, DeepSeek R1, and OpenAI's o-series โ€” expose something that used to be invisible: the model's internal reasoning trace, separate from its final response.

When you call Ollama with "think": true, you see a message.thinking field alongside message.content. The thinking field arrives first โ€” it's the model's chain-of-thought scratchpad. Content follows with the final answer. This is what you see when running Ollama in raw mode and the model "sends a lot of messages" โ€” that verbose internal dialogue is the thinking trace.

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3:8b",
  "think": true,
  "stream": false,
  "messages": [
    { "role": "user", "content": "What is 17 ร— 23?" }
  ]
}'

# Response:
{
  "message": {
    "role": "assistant",
    "thinking": "The user wants 17 ร— 23. I can compute this: 17 ร— 20 = 340, 17 ร— 3 = 51, total = 391.",
    "content": "17 ร— 23 = 391."
  }
}

For streaming, thinking chunks arrive before content chunks:

from ollama import chat

stream = chat(model='qwen3', messages=[...], think=True, stream=True)
for chunk in stream:
    if chunk.message.thinking:
        print("THINKING:", chunk.message.thinking, end='')
    elif chunk.message.content:
        print("ANSWER:", chunk.message.content, end='')

Thinking tokens are real tokens โ€” they consume context window space and are decoded the same way as content tokens. The difference is fine-tuning: the model is trained to use this space for reasoning before committing to a final answer. GPT-OSS (OpenAI's open-source model) uses "think": "low", "medium", or "high" to control trace depth.

Why This Matters for Agents: Thinking traces are your best debugging tool. They tell you why the model made a decision, which tool it was considering calling, and where its reasoning went wrong. If an agent is producing bad outputs, read the thinking trace โ€” it usually pinpoints the exact failure point.

Section 2: Tool Calling โ€” The LLM That Can't Actually Call Anything

Here is the most important thing to understand about LLM tool calling, and it's worth stating plainly:

An LLM cannot call a function. It cannot make an HTTP request. It cannot query a database. It cannot execute code. All it can do is generate tokens.

What we call "tool calling" is a structured text generation pattern. The model outputs a specific JSON format describing which function to call and what arguments to pass. Your code โ€” the calling client, the agent framework, the middleware โ€” reads that JSON, executes the actual function, and feeds the result back as a new message. The model never touched the network. It just predicted tokens that look like a function call.

How Tool Schemas Work

To enable tool calling, you pass a list of tool definitions alongside your messages. Each definition is a JSON schema: function name, a natural-language description, and parameter types.

POST /v1/chat/completions
{
  "model": "gpt-4o",
  "messages": [
    { "role": "user", "content": "What's the weather in Miami right now?" }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "City and state, e.g. 'Miami, FL'"
            },
            "unit": { "type": "string", "enum": ["celsius", "fahrenheit"] }
          },
          "required": ["location"]
        }
      }
    }
  ]
}

That tool schema is converted to tokens and injected into the model's context during prefill, just like any other part of the prompt. Models fine-tuned for tool calling have learned to recognize this structure and respond with structured JSON when they decide a tool should be invoked.

The description field is doing heavy lifting. The model decides whether to call a tool based primarily on the function description and parameter descriptions โ€” not the name. Write descriptions that explain when to call the function, not just what it does. "Get the current weather for a city โ€” use this when the user asks about current weather conditions" is better than "Returns weather data."

The Complete Tool Call Sequence

When the model decides to use a tool, it returns finish_reason: "tool_calls" instead of "stop". The response looks like this:

{
  "choices": [{
    "finish_reason": "tool_calls",
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "arguments": "{\"location\": \"Miami, FL\", \"unit\": \"celsius\"}"
        }
      }]
    }
  }]
}

Your client code now takes over completely. Here's the full cycle:

response = llm.chat(messages, tools=tools)

while response.finish_reason == "tool_calls":
    # 1. Append the assistant's tool call message to history
    messages.append(response.message)

    for tool_call in response.tool_calls:
        # 2. Parse function name and arguments
        fn_name = tool_call.function.name
        fn_args = json.loads(tool_call.function.arguments)

        # 3. YOUR CODE executes the actual function โ€” not the LLM
        result = dispatch_function(fn_name, fn_args)

        # 4. Append the tool result as a new message
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result)
        })

    # 5. Send everything back โ€” model now has the tool result in context
    response = llm.chat(messages, tools=tools)

# Model produces final user-facing response
print(response.message.content)

The entire conversation history โ€” user message, model's tool call, tool result โ€” is now in context. The model processes all of it in the next prefill and either calls another tool or produces a final response. This cycle is the foundation of every agent loop.

The Hallucination Problem in Tool Arguments

Arguments are sampled tokens. A model running at temperature 0.7 might generate "location": "Miami, FL" or "location": "miami florida usa" or "location": "Miami" depending on what its probability distribution peaked on. All three are plausible โ€” none may match your API's expected format.

Some inference servers address this with constrained decoding: a state machine tracks which tokens are valid at each position given the JSON schema, then applies logit biasing to zero out impossible tokens before sampling. This guarantees the output is structurally valid JSON matching the schema โ€” but it can't prevent semantically wrong arguments (the model can still produce "unit": "celsius" when the user said Fahrenheit).

Production tool calling requires:

Key Insight: "There is no magic. This is good old-fashioned software development with some LLM sprinkled in." โ€” the LLM produces a structured string. Your code parses it, runs the function, and hands the result back. The LLM is a very sophisticated text-in, text-out function. Everything else is orchestration.

Section 3: Agent Loops โ€” The Engine That Makes Agents "Think Longer"

Now that we understand tokens and tool calls, we can explain what an agent actually is. The answer is almost anticlimactically simple:

Agent = LLM + Loop + Tools

The loop is what transforms a single-shot question-answering model into something that can plan, gather information, course-correct, and complete multi-step tasks. Without the loop, you have a very smart text generator. With the loop, you have something that behaves like it's thinking.

The ReAct Pattern: Reasoning + Acting

The dominant agent loop design comes from ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022). ReAct interleaves chain-of-thought reasoning with tool execution in a tight iterative loop. Instead of generating a complete plan upfront or making blind tool calls, the model thinks, acts, observes the result, and updates its thinking โ€” exactly how a human approaches a complex task.

The ReAct trajectory is formalized as a sequence of triples:

ฯ„ = (tโ‚, aโ‚, oโ‚, tโ‚‚, aโ‚‚, oโ‚‚, โ€ฆ)

Where:
  tแตข = Thought: what the model reasons at step i
  aแตข = Action: the tool call made at step i
  oแตข = Observation: the result returned by the tool

Each triple feeds into the next. The model's thought at step i+1 has access to everything: the original goal, all previous thoughts, all previous actions, and all observations to date. The context window is the agent's working memory.

Anatomy of One Loop Iteration

Here's what happens during a single ReAct iteration, in concrete terms:

# State: messages = [system_prompt, user_goal, ...previous_triples...]

# Step 1: THINK โ€” model generates a reasoning trace
# (With thinking models, this happens in message.thinking)
# "I need to find the current price of Bitcoin. I should use the get_price tool."

# Step 2: ACT โ€” model outputs a tool call
# finish_reason: "tool_calls"
# โ†’ { "name": "get_price", "arguments": { "symbol": "BTC" } }

# Step 3: OBSERVE โ€” your code executes the tool and captures the result
# โ†’ { "symbol": "BTC", "price": 84320.50, "currency": "USD" }

# Step 4: APPEND โ€” add thought, action, and observation to messages
messages.append({"role": "assistant", "content": "I need to get the BTC price...", "tool_calls": [...]})
messages.append({"role": "tool", "content": '{"price": 84320.50}'})

# Step 5: REPEAT โ€” call the model again with updated context
# The model now knows the price and can either use another tool or answer

Each observation narrows the problem. By iteration 3 or 4, the model typically has enough grounded information to produce a final answer. Each iteration adds tokens to the context โ€” which is why agent cost scales with complexity, not just output length.

Termination: How the Loop Knows When to Stop

The agent loop exits in one of two ways:

Exit conditionTriggered byImplication
end_turnModel returns finish_reason: "stop"Task is done; model chose to answer
max_iterationsYour loop counter hits the limitForced stop; model may not have finished
Error / exceptionTool execution fails unrecoverablyDepends on error handling strategy

Max iterations is not optional โ€” it's a hard requirement. Without it, a model caught in a confusion loop (repeatedly calling the same tool, never getting a satisfying result) will drain your budget and never return. IBM's guidance on ReAct agents puts it plainly: "Establishing a maximum number of loop iterations is a simple way to limit latency, costs and token usage, and avoid the possibility of an endless loop."

A minimal production loop looks like this:

MAX_ITERATIONS = 10
iteration = 0

while iteration < MAX_ITERATIONS:
    response = llm.chat(messages, tools=tools)
    iteration += 1

    if response.finish_reason == "stop":
        # Model decided it's done
        return response.message.content

    elif response.finish_reason == "tool_calls":
        messages.append(response.message)
        for tool_call in response.tool_calls:
            result = execute_tool(tool_call)
            messages.append(build_tool_result_message(tool_call.id, result))

    else:
        raise AgentError(f"Unexpected finish_reason: {response.finish_reason}")

# If we exit the loop without returning, the agent hit max iterations
return summarize_partial_result(messages)

Real agent frameworks (LangChain, LlamaIndex, OpenAI Agents SDK, Google ADK) wrap this pattern with observability, retry logic, parallel tool execution, and multi-agent coordination. But they're all variants of the same core loop.

Multi-Agent Loops

The loop pattern composes into multi-agent architectures. Google's Agent Development Kit formalizes this with LoopAgent:

# Writer drafts โ†’ Critic reviews โ†’ loop until quality threshold or max_iterations
LoopAgent(
    sub_agents=[WriterAgent, CriticAgent],
    max_iterations=5
)

Each sub-agent is itself running an internal loop. The outer LoopAgent orchestrates their outputs. This is how you get agents that write, review, and revise โ€” the same token-generation primitive, composed recursively.

Key Insight: The "hard stuff" in agent loops isn't the loop itself โ€” it's everything around it: error handling when tools fail, retries with backoff, context window management as the message history grows, cost controls when the loop runs long, and deciding what counts as "done." The loop is three lines of code. The production engineering is months of work.

Putting It All Together

Now the full picture: what actually happens when you ask an agent to "research the current Bitcoin price and summarize the market sentiment"?

  1. Your prompt is tokenized using the model's tokenizer and chat template. The tool definitions are injected. All of this becomes a sequence of token IDs.
  2. Prefill runs: the entire token sequence is processed in parallel through the transformer. The KV cache is populated. The model "understands" your request and the available tools.
  3. Decode begins: the model samples tokens one at a time. Because it was fine-tuned for tool calling, it produces a tool_calls message โ€” a JSON string telling your code to call get_price("BTC").
  4. Your code executes the tool. The Bitcoin price is returned. This result is appended to the message history.
  5. Another LLM call: now the context includes the price. The model reasons (possibly via a thinking trace) and decides to also call search_news("bitcoin market sentiment").
  6. Your code executes the second tool. Headlines are returned and appended.
  7. Final LLM call: the model has enough grounded information. It produces finish_reason: "stop" with a coherent summary. The loop exits.

Three LLM calls. Two tool executions. One agent loop. The model generated approximately 300โ€“600 tokens total. The "intelligence" came from the model's ability to recognize when it lacked information, select the right tool, and synthesize the results into a coherent answer.

Strip away the framework abstractions and every AI agent โ€” no matter how sophisticated โ€” reduces to this: a language model predicting tokens, occasionally predicting tokens that look like function calls, with a program around it that executes those calls and feeds the results back. The rest is engineering.

The Three Primitives:
  • Token generation โ€” the fundamental unit; autoregressive, probabilistic, one at a time
  • Tool calling โ€” structured text that your code interprets as a function invocation
  • The agent loop โ€” a while loop with an LLM call inside; exits when the model says stop or you force it

Everything else in the AI agent ecosystem โ€” memory, planning, multi-agent coordination, RAG โ€” is built on these three primitives. Know them well.


References

  1. Ollama โ€” Thinking Models โ€” Official documentation on the think parameter and message.thinking field for reasoning models (Qwen3, DeepSeek R1) in the Ollama API. โ†— docs.ollama.com
  2. BentoML โ€” How Does LLM Inference Work? โ€” Clear breakdown of prefill vs. decode phases, KV cache mechanics, and memory-bound vs. compute-bound inference bottlenecks. โ†— bentoml.com
  3. Chris Toolivier โ€” LLMs and Function/Tool Calling โ€” Practical walkthrough of the tool calling request/response cycle, including the "no magic" framing and client-side execution pattern. โ†— blog.christoolivier.com
  4. PromptingGuide.ai โ€” Function Calling โ€” Overview of tool/function calling across providers (OpenAI, Anthropic, Gemini), including JSON schema structure and multi-turn patterns. โ†— promptingguide.ai
  5. Yao et al. (2022) โ€” ReAct: Synergizing Reasoning and Acting in Language Models โ€” The original paper introducing the ReAct pattern (Thought โ†’ Action โ†’ Observation loop) for LLM-based agents. โ†— arxiv.org/abs/2210.03629
  6. IBM โ€” ReAct Agents โ€” Production guidance on the ReAct agent loop, including max iterations as a required safety control and the Thought-Action-Observation trajectory formalization. โ†— ibm.com
  7. Google Agent Development Kit โ€” LoopAgent โ€” Multi-agent loop pattern documentation, including LoopAgent(sub_agents=[...], max_iterations=N) for Writer/Critic compositional agents. โ†— google.github.io/adk-docs
  8. OpenAI โ€” Function Calling Reference โ€” Official API reference for the tools parameter, finish_reason: "tool_calls", and the multi-turn tool call message format. โ†— platform.openai.com