The Skill That Makes or Breaks Your Agent

⚠️ Heads up! AI moves ridiculously fast. By the time you're reading this, some of what I've written may already be outdated. Treat this as a snapshot of my journey, not a definitive guide.

Introduction

Series: My Journey to Building AI Agents
Audience: Senior Full Stack Developers · Solution Architects · Tech Leads
I'm exploring the prompt techniques that actually matter for building AI agents — zero-shot, few-shot, chain-of-thought, and structured output patterns. Turns out, how you talk to an LLM determines whether your agent is reliable or a coin flip.

Here's something that stuck with me from Anthropic's docs: specificity is the single most important lever in effective prompting. Being vague about format, length, or intent forces the model to guess — and it usually guesses wrong.

"Anthropic's documentation emphasizes that specificity is one of the most important factors in effective prompting."

That single line reframed how I think about prompting. It's not about clever tricks, it's about clear communication with a system that takes everything literally.

After digging into how LLMs work in the last article, I realized that understanding the engine is only half the story. The other half is knowing how to steer it. Prompt Engineering is where theory meets practice — it's the interface between what I want and what the model produces. And for agents that run autonomously, getting this right isn't optional.
In this article, I'm working through five core techniques: zero-shot, few-shot, chain-of-thought, system prompts, and structured outputs. For each one, I want to understand when it works, when it doesn't, and why it matters for agents. Let's figure this out together.

Zero-Shot Prompting: Start Here
Zero-shot means giving the model an instruction with no examples — just the task description. It sounds simple, but honestly? With modern models like Claude and GPT-4, it works better than I expected.
Anthropic's documentation emphasizes that "Claude responds well to clear, explicit instructions."— specificity in the instruction itself matters more than adding examples.

OpenAI's prompt engineering guide similarly recommends starting with zero-shot, then adding examples only when results are insufficient.

Few-Shot Prompting: Show, Don't Tell
When zero-shot falls short, examples fill the gap. The original GPT-3 paper by Brown et al. (2020) demonstrated that few-shot prompting — providing examples in context — could match or exceed fine-tuned models on several benchmarks.

Anthropic calls this "multishot prompting" and describes it as "one of the most effective ways to improve Claude's performance." Their key advice: provide diverse, relevant examples — and include edge cases.

Here's a comparison that helped me understand the tradeoff

When does zero-shot work?

  • Classification
  • Summarization
  • Translation
  • Simple Q&A tasks (model has seen extensively during training)

Where it struggles:

  • highly specialized formats
  • domain-specific reasoning
  • when the desired output structure is ambiguous

For agents, zero-shot has a practical advantage — it keeps token budgets low. In an agentic loop where the prompt gets re-sent every turn, saving tokens on examples compounds quickly.

So how many examples do I actually need?

ExamplesEffectResearch Finding
0Zero-shot baselineRelies entirely on training knowledge
1-2Format anchoringEstablishes output structure
3-5Sweet spotCaptures most performance gains (practitioner consensus; gains taper beyond this range
6-8Diminishing returnsMarginal improvement, higher token cost
10+Rarely justifiedConsumes context window for little gain

The research consensus: 3-5 diverse examples capture most of the benefit. Quality and diversity matter more than quantity.

Here's what I've found works for agent tool-calling — showing 2-3 examples of correct tool invocation eliminates most format errors:

# Few-shot examples for agent tool selection
examples = [
    {
        "user": "What's the weather in Tokyo?",
        "reasoning": "User needs current weather data — requires API call",
        "tool": "get_weather",
        "params": {"city": "Tokyo"}
    },
    {
        "user": "Summarize this PDF for me",
        "reasoning": "User has a document to analyze — no external API needed",
        "tool": "analyze_document",
        "params": {"action": "summarize"}
    }
]
💡
One thing that stood out: OpenAI notes that few-shot is particularly powerful when describing a style or format that is "easier to show than to tell." This makes intuitive sense — sometimes an example communicates what paragraphs of instructions can't.

Chain-of-Thought: Let the Model Think
This is the technique that genuinely surprised me with its effectiveness. Chain-of-thought (CoT) prompting asks the model to show its reasoning before giving an answer — and the results are dramatic.

Researchers discovered that AI models get dramatically better at solving problems just by being asked to show their work — like a student who scores way higher on a test when they write out their steps instead of just guessing the answer.

In one study (Wei et al), math problem accuracy on GSM8K jumped from 17% to 56% with chain-of-thought prompting alone — reaching 58% only when also paired with an external calculator. Even wilder (Kojima et al), another study found that adding the phrase "Let's think step by step" to a prompt bumped accuracy from 17.7% to 78.7%. That's five words turning a failing grade into an A.

CHAIN-OF-THOUGHT IN ACTION
════════════════════════════════════════════════════════════════

WITHOUT CoT:                          WITH CoT:

  Q: "A store has 15 apples.           Q: "A store has 15 apples.
   8 are sold, then 12 arrive.          8 are sold, then 12 arrive.
   How many?"                           How many? Think step by step."

  A: "20"  ✗                           A: "Starting: 15 apples
                                           After sales: 15 - 8 = 7
                                           After delivery: 7 + 12 = 19
                                           Answer: 19"  ✓

But here's the nuance that matters: CoT doesn't always help. Wang et al. (2022) Wei et al. (2022) showed it can actually hurt performance on simple tasks — for the easiest single-step problems, CoT improvements were either negative or very small. My mental model:

Task ComplexityCoT EffectExample
Simple recallNeutral or harmful"What's the capital of France?"
ClassificationMinimal benefitClear-cut sentiment analysis
Multi-step reasoningMajor improvementMath, logic, planning
Complex analysisEssentialCode debugging, multi-document synthesis

Anthropic's documentation recommends using chain of thought specifically for "complex tasks that require analysis" and suggests using XML tags like <thinking> to separate reasoning from final answers.

For agent design, this is directly relevant. The ReAct pattern (Yao et al., 2022) — which interleaves "Thought," "Act," and "Observe" steps — is essentially chain-of-thought applied to agents. The agent reasons about which tool to call, executes it, observes the result, then reasons again. Most modern agent frameworks use this as their default pattern.

System Prompts: The Agent's Constitution

If chain-of-thought is how agents reason, system prompts are how they're governed. Anthropic describes the system prompt as defining "Claude's role, personality, and rules", and for agents, it's essentially a constitution.

One technique that stood out: role prompting. Anthropic notes it works best when the role is specific — "a pediatric nurse" rather than just "a nurse." Specificity matters because it activates more relevant training patterns.

Here's how I think about system prompt architecture for agents:

SYSTEM PROMPT ARCHITECTURE
════════════════════════════════════════════════════════════════

  ┌─────────────────────────────────────────────────────────┐
  │  IDENTITY          "You are a senior DevOps engineer    │
  │                     specializing in Kubernetes..."      │
  ├─────────────────────────────────────────────────────────┤
  │  CAPABILITIES      "You have access to these tools:     │
  │                     - kubectl_exec: Run kubectl cmds    │
  │                     - read_logs: Fetch pod logs         │
  │                     - alert_team: Send Slack alerts"    │
  ├─────────────────────────────────────────────────────────┤
  │  CONSTRAINTS        "NEVER delete resources without     │
  │                     explicit user confirmation.         │
  │                     ALWAYS check pod status before      │
  │                     restarting."                        │
  ├─────────────────────────────────────────────────────────┤
  │  OUTPUT FORMAT      "Respond with a JSON action:        │
  │                     {tool, params, reasoning}"          │
  └─────────────────────────────────────────────────────────┘

The four layers I keep seeing in effective agent system prompts:

  1. Identity: Who is the agent? What domain expertise does it have?
  2. Capabilities: What tools/actions are available?
  3. Constraints: What must the agent never do? What requires confirmation?
  4. Output format: How should the agent structure its responses?

OpenAI's guidance aligns — they recommend system messages define both what the model should and should not do. For agents running autonomously, the "should not" is arguably more important.


Structured Outputs: Making Agents Parseable

Agents don't just generate text — they generate actions. And actions need to be machine-parseable. This is where structured output techniques become essential.


*This is Article 3 of 12 in my AI Agents learning journey.*


See you there.

Resources I'm Using