Don't stop here
Hand-picked guides our readers explore right after this one.
Prompts for Manus AI autonomous agent for research, data collection, web automation, and report generation
Read the guideExpert guide to Claude prompts with XML tags, artifacts, and complex reasoning
Read the guideAI prompts for idea generation, creative thinking, problem solving, and innovation
Read the guideMost AI agent demos fail in production not because the model is wrong, but because the prompt is built for a single-turn conversation instead of a multi-step autonomous execution. These templates are designed for real agent runs: clear goals, scoped tool access, explicit termination conditions, and fallback handling baked in.
A conversational prompt assumes you are in the loop at every step. An agent prompt assumes you are not. That difference changes everything about how you structure the instruction. You are not just telling the model what to do β you are writing a specification for an autonomous process that will make decisions, use tools, encounter errors, and produce output without checking in with you.
The three structural problems that kill most agent runs: no exit condition (the agent loops or drifts indefinitely), no fallback handling (the agent halts on the first tool error instead of continuing), and no format specification (the output is correct but unparseable by whatever system needs it). Fix those three things and you eliminate 80% of agent execution failures before you touch the underlying model.
In 2026, the major agent frameworks β Claude Computer Use, GPT-4o with tool calls, Devin for engineering, Manus for general tasks β all execute prompts differently but share the same failure pattern when the prompt is underspecified. The templates below are built to work across frameworks.
| Agent type | Primary tool | Critical prompt element | Most common failure |
|---|---|---|---|
| Research agent | Web search + synthesis | Termination condition + citation format | Over-retrieving, never stopping |
| Code agent | Code execution + file I/O | Scope (which files to touch) | Modifying unrelated files |
| Browser agent | Real web navigation | Fallback for login walls | Infinite CAPTCHA retry |
| Workflow agent | API calls + CRM/email | Write permission scope | Accidental record deletion |
| Multi-agent orchestrator | Spawns sub-agents | Handoff format spec | Data lost between agents |
| Data extraction agent | File reading + parsing | Output schema definition | Inconsistent field names |
Use this template for any agent tasked with gathering information from the web, documents, or databases and producing a structured summary.
You are a research agent. Your task: [specific research question].
SCOPE: Search for information about [topic]. Focus on [specific angle].
SOURCES: Use web search. Prioritize sources from [publication types, e.g., industry reports, official documentation, peer-reviewed research].
STEP LIMIT: Complete this task in no more than 6 search calls. Stop at 6 regardless of completeness.
FALLBACK: If a source requires login, paywall, or returns an error, note the URL and move to the next source. Do not retry.
TERMINATION: The task is complete when you have found at minimum [3] distinct sources addressing the research question.
OUTPUT FORMAT:
Summary: [2-3 sentence answer to the research question]
Sources: [list with URL, publication, and one-sentence description of relevance]
Confidence: [High/Medium/Low based on source quality]
Open questions: [what this research did not answer]
Use this when you need an orchestrator agent to break a complex goal into subtasks before executing them in sequence or in parallel.
You are a planning agent. Your goal: [high-level goal].
STEP 1 β DECOMPOSE: Break this goal into no more than 7 discrete subtasks. For each subtask, define:
- Task name
- Required input
- Expected output
- Tool or method needed
- Dependencies (which subtasks must complete first)
STEP 2 β SEQUENCE: Order the subtasks by dependency. Flag any that can run in parallel.
STEP 3 β CONFIRM: Output the task plan before executing anything.
EXECUTION: After outputting the plan, execute each subtask in order. For each: output the subtask name, the action taken, and the result before proceeding to the next.
STOP CONDITION: Stop after all subtasks complete OR if any subtask fails 2 consecutive attempts.
For agents with write access to real systems (CRM, email, calendar). The confirmation gate is not optional β include it every time.
You are a workflow automation agent with access to [tool list].
TASK: [Specific workflow description]
PERMISSION SCOPE:
- READ access: [list systems]
- WRITE access: [list systems β be specific]
- PROHIBITED: [list actions the agent must never take]
CONFIRMATION GATE: Before taking any action that creates, modifies, or deletes data, output:
"PROPOSED ACTION: [describe what you are about to do]"
Then wait for human confirmation before proceeding.
ERROR HANDLING: If a tool call returns an error, log the error and skip that item. Do not retry more than once. Continue to the next item.
OUTPUT: After completing all actions, provide a summary: Actions taken, Items skipped with reasons, Total records affected.
For agents navigating real websites. Works with Claude Computer Use, Operator, and open-source browser-use frameworks.
You are a browser agent. Navigate to [URL] and complete the following task: [task description].
NAVIGATION RULES:
- If you encounter a login screen: stop and report "login required at [URL]"
- If you encounter a CAPTCHA: stop and report "CAPTCHA at [URL]"
- If a page does not load within [3] attempts: skip it and report failure
- If you see a cookie consent popup: accept all cookies and continue
DATA TO EXTRACT: [Specific data fields to find on the page]
OUTPUT FORMAT: Return a JSON object with these fields: [field list with types]
VALIDATION: Before returning output, check that all required fields are present. Note any absent fields rather than omitting them silently.
DO NOT: Click any links outside the target domain. Do not submit any forms unless explicitly instructed.
The handoff is the most commonly broken part of a multi-agent pipeline. This template makes the handoff contract explicit for both the sending and receiving agent.
[SENDING AGENT INSTRUCTIONS]
You are completing [Task A]. When finished, output a handoff package:
HANDOFF TO: [Next agent name]
TASK COMPLETED: [Summary of what you did]
OUTPUT: [Your deliverable]
CONTEXT FOR NEXT AGENT: [What they need to know that is not in the output]
OPEN ISSUES: [Anything unresolved]
STATUS: [Complete / Partial β specify what is missing if partial]
[RECEIVING AGENT INSTRUCTIONS]
You are [Agent Name], receiving a handoff from [Previous agent].
Read the handoff package above. Your task: [Task B].
Use the OUTPUT from the previous agent as your input.
If STATUS is Partial, note the gaps in your own output.
Your deliverable format: [specify]
Define the goal in output terms
Do not describe the process β describe the deliverable. Agents that receive process instructions drift; agents that receive output specs converge.
Set the tool permission boundary
List exactly which tools the agent can use and with what access level. Everything not on the list is implicitly prohibited.
Write the termination condition first
Before any other instruction, decide: what does done look like? How many steps maximum? What happens at the step limit?
Build in a fallback for every tool call type
For every class of tool, write one sentence describing what to do when it fails. This prevents halt-on-error behavior.
Specify output format as a schema
A JSON schema or a filled-in template example produces consistent results. The more downstream automation depends on agent output, the more rigidly you should specify the format.
Test in sandbox before production
Run the full agent prompt on a 10% sample of the real task before giving it access to your actual systems. Evaluate completeness, step budget, and output format.
The agent landscape consolidated significantly in 2025 and early 2026. Claude (Anthropic) with extended thinking is the most reliable general agent for tasks requiring careful reasoning before action β it pauses and surfaces uncertainty rather than guessing. GPT-4o (OpenAI) with Operator handles ambiguity more aggressively, better for high-volume lower-stakes tasks. Devin (Cognition) remains strongest for software engineering agent tasks: multi-file code changes, PR creation, and debugging. Manus handles generalist computer-use and document-heavy workflows well.
For workflow automation connecting multiple SaaS tools, Make.com and Zapier AI agents remain the most practical option because they handle API authentication, error retry, and data mapping at the infrastructure level β tasks that model-level agents handle inconsistently. The best architecture for most business use cases in 2026 is a language model for decision-making combined with an established automation platform for execution.
A regular prompt assumes one turn: you give context, the model responds, done. An agent prompt has to account for multiple execution steps, tool calls, decision branches, and failure states all without you in the loop. The biggest structural difference is that an agent prompt must define a clear termination condition. Without it, agents loop, retry endlessly, or produce output no one asked for. The second critical difference is scope: agents need explicit permission boundaries. Instead of 'research this topic,' an agent prompt says 'search the web for X, summarize your findings in 3 bullet points, then stop β do not draft emails or take follow-up actions.' Specificity about what to do AND what not to do is what separates productive agent runs from expensive runaway processes.
The answer depends on the task type. For code-heavy autonomous work, Devin and GitHub Copilot Workspace handle multi-file edits with architectural awareness that single-model completions miss. For research and synthesis tasks, Perplexity's research mode and Claude's project-based memory produce reliable sourced outputs. For business workflow automation (CRM updates, email triage, document generation), Make and Zapier's AI agent layers handle API chaining better than pure language models. For general agentic tasks where you are still experimenting, Claude Sonnet with extended thinking and GPT-4o with tool calls both offer predictable behavior. Claude tends to be more conservative and stops when uncertain; GPT-4o tends to push through ambiguity. Neither trait is universally better β it depends on whether you want the agent to ask or act when stuck.
Three structural controls reduce loop risk significantly. First, set explicit step limits in the prompt: 'Complete this task in no more than 5 tool calls. If you have not reached a satisfactory result by step 5, stop and report what you found.' Second, define what done looks like before the agent starts, not after: 'The task is complete when you have produced a markdown file with at minimum 3 sourced findings and a clear recommendation.' Third, give the agent a fallback instruction for ambiguity: 'If you encounter a page that requires login or returns an error, note it in your report and move to the next source β do not retry the same URL more than once.' Agents that loop almost always do so because the completion criterion was implicit rather than explicit in the original prompt.
Yes, with important caveats. Browser-use agents like Operator (OpenAI), Claude Computer Use, and Manus can navigate real websites, fill forms, and extract structured data. The prompting challenge is that real websites are messy β login walls, CAPTCHAs, cookie consent popups, and dynamic JavaScript content all create failure points. Good agent prompts for web tasks include: a primary action (go to X and extract Y), an explicit fallback (if the page requires login, use a web search for the same information instead), and a format specification for the output (return results as a JSON array with these fields: title, URL, date, and one-sentence summary). Prompts that omit the fallback and format spec produce inconsistent results that require heavy post-processing.
A multi-agent workflow assigns different sub-tasks to specialized agents that hand off to each other β an orchestrator decomposes the goal, a researcher gathers data, an analyst interprets it, a writer produces the final output. You should use multi-agent architecture when a single long-context run becomes unreliable (context windows fill, models start hallucinating details from earlier in the thread) or when tasks benefit from specialization. The prompting challenge is writing clear handoff instructions: each agent needs to know what it received, what it is supposed to add, and what format to pass forward. The most common failure is the handoff prompt being too vague. 'Take the research and write a report' produces worse results than 'take the JSON research object and produce a 500-word narrative in second-person voice, citing at least 3 of the provided sources by name'.
The principle of least privilege applies to AI agents exactly as it does to software systems. Give the agent access to the tools it actually needs for this specific task, not all available tools. Instead of 'you have access to email, calendar, CRM, and file system,' scope it to 'you have access to the read-only CRM API for contact lookup only β do not send emails or create records.' A more constrained tool set produces more predictable execution because the agent has fewer decision branches. It also limits the blast radius if the agent misinterprets an instruction. For any agent with write access to real systems, include a confirmation gate: 'Before taking any action that modifies data, output what you are about to do and wait for human confirmation.'
Run the prompt in a sandboxed environment first with a representative but low-stakes version of the real task. If the agent is supposed to process invoices, give it 3 test invoices before pointing it at your actual accounts payable queue. Evaluate three things: does it complete the task, does it do so within the step budget you set, and does the output match the format you specified. The format check is often skipped and always matters β an agent that produces the right information in the wrong structure creates downstream errors in whatever system consumes its output. For complex agent workflows, run the first step of the pipeline manually for a week to validate output before chaining it to automated downstream steps.
Most agent frameworks handle persistent memory through vector stores, structured state files, or external databases. The prompting layer needs to explicitly instruct the agent to use whichever mechanism your setup provides. A practical pattern: 'At the start of each session, read the state file at [path/to/state.json]. At the end of each session, update the state file with: new contacts found, tasks completed, and open questions. Never overwrite the file β append to the existing records.' Without this instruction, agents either ignore memory tools or write over previous state rather than extending it.
Judgment-heavy tasks are where agent reliability degrades fastest. An agent can reliably execute rules; it cannot reliably substitute for domain expertise. The prompt solution is to make implicit judgment criteria explicit. Instead of 'identify the best candidate from these resumes,' write 'score each resume on these 5 criteria: [list criteria with 1-5 scale and scoring description for each]. Flag any resume that scores below 3 on criteria 1 or 2 for human review before proceeding.' This converts a judgment task into a rubric-following task, which agents execute reliably. Design your agent workflow with the assumption that 10-15% of tasks will need a human review gate, and build that gate into the prompt architecture from the start.
The five failure modes that come up consistently: First, no termination condition β the agent runs until it hits a token or cost limit rather than stopping on task completion. Second, ambiguous output format β the agent produces correct information in a format no downstream system can parse. Third, over-permissive tool scope β the agent has access to tools it does not need and occasionally uses them in unintended ways. Fourth, missing the fallback β no instruction for what to do when a tool returns an error, so the agent retries or halts instead of gracefully continuing. Fifth, underspecified output voice β for agents that produce content, no instruction on tone or style, so output varies wildly across runs. Fixing these five issues eliminates roughly 80% of agent execution problems.
Expert prompts for designing, building, and deploying autonomous AI agents β from single-agent task runners to multi-agent collaborative systems.
Single-agent task system design
Design an AI agent to accomplish: [describe the goal β e.g., "research a company and produce an investment memo"]. For this agent, specify: 1. System prompt: role, constraints, output format, and what to do when stuck 2. Tool set: exactly which tools it needs and with what permissions (read-only vs write) 3. Memory strategy: what context it needs to retain across steps 4. Termination criteria: how does it know it's done? 5. Guard rails: what actions should require human approval before executing? 6. Failure modes: what are the top 3 ways this agent could go wrong, and how to mitigate each?
Agent system prompt template
Write a system prompt for an AI agent with this role: [describe role β e.g., "customer support escalation agent"]. The system prompt must include: - Role definition and primary objective - The step-by-step process the agent should follow - What tools are available and when to use each - Output format for each type of task - What to do when the agent is uncertain or lacks information - Explicit prohibitions (what the agent must never do) - How to escalate or ask for clarification Keep the system prompt under 500 words. Test it by asking: would a capable human follow these instructions and produce the right result?
Custom tool specification
I'm building a custom tool for an AI agent called [tool name]. What it does: [description] Inputs available: [list data sources] Output: [what the tool should return] Write: 1. The tool function signature with typed parameters and return type 2. The tool description string (this is what the AI reads to decide when to use it β make it precise about when to call vs. not call this tool) 3. Input validation and error handling 4. A test case showing correct usage 5. Common misuse patterns and how the description prevents them Framework: [LangChain / OpenAI function calling / Anthropic tool use]
Agent evaluation framework
Build an evaluation framework for an AI agent that [describe task]. The eval should test: 1. Task completion rate: define what "complete" means for this task 2. Output quality: what makes an output good vs acceptable vs failure? 3. Efficiency: what's the acceptable range of steps/tokens to complete the task? 4. Safety: what outputs or actions would constitute a safety failure? 5. Edge cases: list 5 inputs that would stress-test the agent Write 10 evaluation test cases with: input, expected output behaviour, and pass/fail criteria. Include a scoring rubric for human raters to assess borderline outputs.
Multi-agent crew for content production
Design a CrewAI multi-agent system for [content production task β e.g., "weekly competitive intelligence report"]. Define each agent with: - Role name and backstory (2-3 sentences) - Goal (what this agent is responsible for) - Tools available - Expected output Suggested crew: Researcher β Analyst β Writer β Editor For each handoff, specify: what information passes between agents, and what the receiving agent does with it. Include the crew goal, task definitions, and the process (sequential vs hierarchical).
AutoGen conversation pattern
Design an AutoGen multi-agent conversation for: [task β e.g., "review a pull request and write test cases for it"]. Define: - Agent 1: [name, system message, model] - Agent 2: [name, system message, model] - (Additional agents if needed) - The initiating message that starts the conversation - Termination condition (what message or state ends the conversation) - Human proxy: when should a human be able to intervene? - Max conversation rounds Show the Python code to configure and run this conversation.
Stateful workflow graph
Design a LangGraph workflow for [task β e.g., "customer complaint resolution"]. The graph should have these nodes: [list 3-5 nodes β e.g., classify_complaint, look_up_order, draft_response, escalate, send_response] And these edges: - Normal flow: [node] β [node] - Conditional routing: after [node], route to [A] if [condition], else [B] - Looping: [node] can return to [earlier node] when [condition] For each node, specify: - What it does - Its inputs (from state) - What it adds to state - Possible next nodes Include the Python code structure for the StateGraph.
Agent orchestration and human oversight
Design an orchestration system for agents running [process β e.g., "automated invoice processing pipeline"] that handles: 1. Task queue: how incoming tasks are queued, prioritised, and assigned 2. State tracking: how to track which agent is working on what, with what status 3. Human-in-the-loop gates: which steps require human approval and how that's signalled 4. Error recovery: if an agent fails partway through, how does the system resume? 5. Audit log: what events to log and in what format for compliance 6. Monitoring: what alerts should fire and when (stuck agent, high error rate, SLA breach) Suggest the technology stack and sketch the data model.
Deep research agent prompt
You are a research agent. Your task: produce a comprehensive research brief on [topic]. Process: 1. Identify 5-7 key sub-questions that together answer the main topic 2. For each sub-question: search for relevant sources, extract key information, note source credibility 3. Synthesise findings across sources, noting where sources agree or conflict 4. Identify what is well-established vs uncertain or contested 5. Produce a structured brief: executive summary, key findings by theme, evidence quality assessment, gaps in current knowledge, and recommended further reading Cite your sources. Flag any claims you cannot verify. Aim for depth over breadth.
Competitive intelligence agent
Act as a competitive intelligence agent. Research [competitor company] and produce an intelligence brief. Search for and synthesise: 1. Recent product launches, updates, or announcements (last 6 months) 2. Pricing changes or new pricing tiers 3. Key hires or leadership changes 4. Funding, revenue, or growth signals 5. Customer sentiment: reviews, support complaints, community mentions 6. Strategic direction: blog posts, conference talks, job postings that signal roadmap Format: one-page brief with date-stamped findings. Note confidence level for each finding (confirmed / likely / rumour). Flag the 2-3 most strategically significant findings.
Company due diligence agent
You are a due diligence research agent. Research [company name] for a potential [investment / partnership / acquisition]. Investigate: 1. Business model and revenue sources 2. Market position and competitive landscape 3. Leadership team background and track record 4. Financial signals (funding history, revenue estimates, burn rate if available) 5. Technology stack and IP (patents, open source contributions) 6. Customer base: key clients, concentration risk, churn signals 7. Red flags: legal issues, employee reviews, regulatory actions, negative press Produce a structured report with confidence levels and source citations for each finding.
Industry news monitoring agent
Set up an industry monitoring agent for [industry/topic]. The agent should: 1. Track news about: [list 5-7 specific topics, companies, or trends to monitor] 2. For each item found: summarise in 2-3 sentences, assess significance (High/Medium/Low), tag by category 3. Filter out: [specify what to exclude β e.g., press releases, opinion pieces without data, duplicate coverage] 4. Output format: daily/weekly digest with items ranked by significance 5. Highlight: any item that represents a major competitive threat, market shift, or regulatory change Run this as a scheduled workflow. What sources should it monitor and how should it handle conflicting reports?
Automated data processing agent
Design an agent that processes [type of data β e.g., "incoming customer feedback from email and Typeform"] automatically. The agent should: 1. Ingest data from [sources] 2. Classify each item by [categories β e.g., feature request, bug report, compliment, complaint] 3. Extract structured fields: sentiment, urgency, product area affected, customer tier 4. Route to the appropriate team or system based on classification 5. Generate a weekly summary with trends and volume by category Specify: tool set needed, processing logic for each step, output format, and how errors or ambiguous cases are handled.
Email triage and response agent
Design an email triage agent for [use case β e.g., "sales inquiry inbox"]. The agent should: 1. Read and classify incoming emails by type: [list categories] 2. For standard enquiries: draft a personalised response using templates + context from CRM 3. For complex or high-value enquiries: flag for human review with a suggested response draft 4. For spam or irrelevant mail: archive without response 5. Log all actions to [CRM / spreadsheet / database] Define: the classification rules, draft response quality bar, escalation criteria, and how the agent should handle ambiguous emails it's unsure how to classify.
Automated reporting agent
Build an agent that generates a [weekly / monthly] [report type β e.g., "sales performance report"] automatically. Data sources: [list systems β CRM, analytics, spreadsheet, database] Report structure: [list sections β e.g., "executive summary, KPIs vs target, top performers, risks, recommended actions"] For each section, specify: - What data to pull and from where - How to calculate / aggregate it - What narrative or interpretation the agent should add (not just numbers) - What anomalies or thresholds should trigger a special callout Output: [format β PDF, HTML email, Slack message, Google Doc] Schedule: [timing and recipients]
Quality assurance and review agent
Design a QA agent that reviews [type of output β e.g., "blog posts before publication" / "code pull requests" / "customer proposals"]. The agent should check each item against: 1. [Quality criterion 1 β e.g., "factual accuracy: are all claims verifiable?"] 2. [Quality criterion 2 β e.g., "brand voice: does the tone match our guidelines?"] 3. [Quality criterion 3 β e.g., "completeness: are all required sections present?"] 4. [Quality criterion 4 β e.g., "formatting: does it follow the template?"] Output: a structured review with: pass/fail per criterion, specific issues with location (paragraph, line, section), and suggested fixes for each issue. Escalation: if [X criteria] fail, block publication and notify [person/channel].
Agent safety checklist
Review my AI agent design for safety risks and suggest mitigations. Agent purpose: [describe what the agent does] Tools it has access to: [list all tools and their permissions] Actions it can take autonomously: [list all actions without human approval] Actions that require human approval: [list gated actions] Please assess: 1. Worst-case failure mode: what's the most harmful thing this agent could do if it malfunctions? 2. Permission minimisation: are any tool permissions broader than strictly necessary? 3. Reversibility: which actions are irreversible and do they have appropriate gates? 4. Prompt injection risk: how could a malicious input manipulate the agent? 5. Audit trail: is there sufficient logging to reconstruct what happened if something goes wrong?
Agent governance framework
Design a governance framework for deploying AI agents in [organisation type β e.g., "a regulated financial services company"]. The framework should cover: 1. Agent inventory: how to document and register all deployed agents 2. Risk classification: how to categorise agents by risk level (Low / Medium / High) with criteria for each 3. Approval process: what review is required before deploying each risk class? 4. Ongoing monitoring: what metrics and alerts to maintain per agent 5. Incident response: what to do when an agent takes an unexpected or harmful action 6. Compliance: how to document agent behaviour for regulatory audit Keep it practical β this should be implementable by a team of 5 people, not require a compliance army.
Agent adversarial testing
Generate adversarial test cases for an AI agent that [describe agent purpose]. The agent has these tools: [list tools] And these constraints in its system prompt: [paste key constraints] Create test inputs designed to: 1. Jailbreak the agent into ignoring its constraints 2. Prompt injection: embed instructions in tool outputs or external data that try to redirect the agent 3. Resource abuse: inputs that could cause the agent to loop, make excessive API calls, or use extreme amounts of tokens 4. Social engineering: inputs that claim special authority or permissions 5. Boundary testing: inputs at the edges of what the agent is designed to handle For each test: input, expected safe response, and what a failure would look like.
Agent decision audit trail design
Design an audit logging system for an AI agent that [describe agent]. The audit log should capture: 1. For every agent run: start time, end time, initiating user/system, goal statement 2. For every tool call: tool name, inputs (sanitised β no credentials), outputs (truncated if large), timestamp, latency 3. For every decision point: the agent's reasoning, options considered, and path taken 4. For every action: what was done, what was changed, and whether human approval was obtained 5. Errors and retries: every failure with error details and recovery action 6. Final output: the agent's conclusion and confidence level Storage format, retention policy, and how to query the audit trail for a specific run or date range.
ReAct agent system prompt
Write a ReAct (Reason + Act) system prompt for an agent that [describe task]. The prompt must instruct the agent to: 1. Think: reason out loud about what to do before taking an action 2. Act: call a specific tool with specific inputs 3. Observe: interpret the tool result before deciding next steps 4. Repeat: loop until the goal is achieved or it determines it cannot proceed Available tools: [list tools] Termination: [what signals the task is done] Constraints: [list any "never do" rules] Output format: [what the final answer should look like] Include example reasoning traces showing the Thought β Action β Observation β Thought pattern.
Long-term memory system prompt
Write a system prompt for an agent that has access to a memory tool with these operations: - memory.save(key, value, description) β save a fact for later - memory.search(query) β retrieve relevant memories by semantic search - memory.forget(key) β remove a specific memory The prompt should instruct the agent to: 1. Proactively save information that will be useful in future interactions (user preferences, past decisions, important context) 2. Search memory at the start of each task to retrieve relevant prior context 3. Update memories when new information supersedes old 4. Use memory efficiently β save facts, not full conversation transcripts Include examples of what to save and what not to save.
Robust tool calling instructions
Write the tool-calling instructions section of a system prompt for an agent with these tools: [list tools with one-line descriptions] The instructions should cover: 1. When to call a tool vs reason from existing knowledge 2. How to handle tool errors (retry, try alternative tool, ask for help) 3. How to avoid unnecessary tool calls (don't call search for things you already know) 4. How to sequence tool calls when multiple are needed 5. What to do when a tool returns unexpected or empty results 6. How to cite tool results in the final response Include a decision tree: "Before calling a tool, ask yourself: [questions]"
Force consistent JSON output
Write a prompt addition that reliably makes an agent output structured JSON in this schema: [paste your desired JSON schema] The addition should: 1. Clearly specify the exact JSON structure expected 2. Include field descriptions and types for each key 3. Give an example of a correctly formatted output 4. Handle edge cases: what to put when a field is unknown or not applicable 5. Include an instruction to output ONLY the JSON object with no surrounding prose 6. Specify how to handle arrays: empty [] vs null vs omit entirely Test it by showing a sample input and the expected JSON output.
Level 1: Assisted
AI helps a human complete a task. Human stays in the loop for every decision. Example: Copilot suggesting code, ChatGPT drafting an email.
Level 2: Semi-Autonomous
AI completes multi-step tasks independently but with human checkpoints at key decisions. Example: Research agent that flags findings for review.
Level 3: Fully Autonomous
AI executes entire workflows end-to-end, only escalating for explicit exceptions. Requires robust safety guardrails and extensive testing before deploying.
Start at Level 1 or 2 for any new agent. Move to Level 3 only after validating behaviour across hundreds of real tasks.
Before going live with any AI agent, run through this checklist to catch the most common failure modes.
LangGraph
Graph-based orchestration with stateful cycles. Best for complex multi-step workflows with loops and conditional branching.
CrewAI
Role-based multi-agent framework. Each agent has a role, goal, and backstory. Best for collaborative agent teams with defined responsibilities.
AutoGen
Conversation-driven agents that collaborate via messages. Best for code generation tasks and human-in-the-loop workflows.
OpenAI Assistants API
Managed agent with persistent threads, file search, and code interpreter. Best for production use cases that need OpenAI-native tool access.
LangChain Agents
Flexible agent with a wide library of tools and integrations. Best for rapid prototyping and RAG-augmented agents.
Anthropic Tool Use
Structured tool calling via Claude API. Best for precise, controllable agents where output reliability is critical.