Prompt
Caching.
Reduce AI API costs by up to 90% and cut latency in half. 12 production templates for prefix caching, semantic caching, and the Double Caching framework.
The Caching Hierarchy
Not all caching is equal. Three tiers exist in production AI systems, each with different hit rates, complexity, and savings potential. Implement them in order β each layer compounds the savings of the previous one.
Exact Match Cache
Same prompt β cached response. Simplest to implement. Low hit rates unless users repeat identical queries. Best for FAQs and templated reports.
Prefix Cache
Cache the static beginning of prompts (system instructions, docs). Variable user content appended after. Hit rate scales with request volume. Supported natively by OpenAI and Anthropic.
Semantic Cache
Vectorize queries and match semantically similar ones. Cache responses for 'what is your refund policy?' and serve it for 'how do I get a refund?'. Requires vector store.
The Double Caching Framework
For SaaS applications at scale, combine Tier 2 and Tier 3 into the Double Caching Framework: prefix caching handles the static system context (handled by your AI provider), while a semantic cache layer in your application intercepts similar queries before they even reach the API.
Prefix cache active on all calls. System prompt, knowledge base, and few-shot examples are cached. You pay 10-50% for these tokens on cache hits.
Semantic similarity check before API call. Matching queries served from your cache at zero API cost. Hit rate: 60-80% for typical SaaS user bases.
Production Caching Templates
Copy-paste these templates into your system prompts and application code. Tags indicate which caching strategy each template implements.
System Prompt Prefix for Cached Context
Prefix Cache[SYSTEM β CACHE THIS BLOCK] You are an expert assistant for [COMPANY_NAME]. The following context is static and should be cached across all requests in this session. Company context: [PASTE_COMPANY_CONTEXT_HERE β product docs, FAQs, policies, knowledge base] Tone: [professional/casual/technical] Constraints: [PASTE_ANY_RULES] [END CACHED BLOCK] User query: [DYNAMIC_USER_INPUT]
Document Q&A with Cached Document Body
Exact Match[CACHED DOCUMENT β DO NOT REGENERATE] The following is the full text of [DOCUMENT_TITLE]. You will answer questions about it. Do not summarize or modify this document. [PASTE_FULL_DOCUMENT_TEXT] [END CACHED DOCUMENT] Based solely on the document above, answer the following question. If the answer is not in the document, say so explicitly. Question: [USER_QUESTION]
Multi-Turn Cache Refresh Trigger
Session Cache[SESSION_ID: {SESSION_ID}] [CACHE_VERSION: v1.2] [EXPIRES: {EXPIRY_TIMESTAMP}] This session has the following persistent context. Do not re-process or re-read this on subsequent turns β treat it as already loaded: User profile: [USER_PROFILE_DATA] Conversation history summary: [SUMMARIZED_HISTORY] Active document: [DOCUMENT_REFERENCE] Current task: [ACTIVE_TASK_DESCRIPTION] Instruction: Resume from the previous state. The user's next message continues this session. User: [NEXT_MESSAGE]
Code Review with Cached Codebase Context
Prefix Cache[CACHED CODEBASE CONTEXT] Language: [LANGUAGE] Framework: [FRAMEWORK + VERSION] Coding standards: [PASTE_STYLE_GUIDE] Common patterns used in this codebase: [PASTE_REPRESENTATIVE_CODE_SNIPPETS] Known constraints: [PERFORMANCE_REQUIREMENTS, SECURITY_RULES, etc.] [END CACHED CONTEXT] Review the following new code against the cached codebase context. Flag: 1. Style violations 2. Performance issues 3. Security concerns 4. Inconsistencies with existing patterns New code to review: [PASTE_NEW_CODE]
RAG with Semantic Cache Layer
Semantic Cache[SEMANTIC CACHE CHECK] Before processing this query, check if a semantically equivalent query has been answered in this session. If yes, return the cached answer with a note: "Cached response β similar query answered [X turns ago]." Query: [USER_QUERY] [VECTOR CONTEXT β Retrieved chunks] Similarity threshold used: 0.85 Retrieved documents: - [CHUNK_1 with source and score] - [CHUNK_2 with source and score] - [CHUNK_3 with source and score] [END VECTOR CONTEXT] Answer the query using the retrieved context. Cite sources inline.
Cost Calculator Prompt for Caching ROI
UtilityCalculate the cost savings from implementing prompt caching for the following system. Return a structured analysis. System parameters: - Monthly API requests: [NUMBER] - Average prompt tokens per request: [NUMBER] - Average completion tokens per request: [NUMBER] - Percentage of requests that share a common prefix: [0-100%] - Current model: [MODEL_NAME] - Current cost per 1K input tokens: $[PRICE] - Current cost per 1K output tokens: $[PRICE] - Cache hit rate expected: [0-100%] - Cache cost (if applicable, e.g. Anthropic charges 0.1x for cache reads): $[PRICE]/1K tokens Calculate: 1. Current monthly cost 2. Monthly cost with caching at expected hit rate 3. Monthly savings in dollars 4. Break-even cache hit rate (where caching becomes worth it) 5. Annual savings projection 6. Recommendation: implement caching? Y/N with reasoning
Long Document Analysis with Prefix Caching
Prefix Cache[IMMUTABLE DOCUMENT β CACHE IMMEDIATELY] The following 100,000-word document is the source of truth for all subsequent analysis tasks. It will not change during this session. Cache this prefix. [PASTE_LONG_DOCUMENT_HERE] [END IMMUTABLE DOCUMENT] Task [1 of N]: [FIRST_ANALYSIS_TASK] Note: Subsequent tasks will reference this same cached document. Prefix caching should be active from this point forward.
Chatbot System Prompt with Dynamic User Slot
Prefix Cache[STATIC SYSTEM CONTEXT β CACHE THIS BLOCK] You are [BOT_NAME], a [ROLE] for [BRAND_NAME]. Your personality: [PERSONALITY_DESCRIPTION] Your expertise areas: [LIST_TOPICS] Things you never do: [RESTRICTIONS] Escalation trigger phrases: [LIST_PHRASES_THAT_ROUTE_TO_HUMAN] Knowledge base (cache this): [PASTE_KNOWLEDGE_BASE_CONTENT] Response format rules: - Keep responses under [MAX_WORDS] words - Always end with a follow-up question or next step - Tone: [TONE] [END STATIC CONTEXT] [DYNAMIC β NOT CACHED] Current user ID: {user_id} Session start: {timestamp} User's message: {user_message}
Incremental Summary with Cache Checkpoint
Session Cache[CACHE CHECKPOINT v{VERSION}] Previous summary (do not re-summarize this portion): {CACHED_SUMMARY_FROM_PREVIOUS_CHUNKS} Processing status: Chunks 1-{N} complete. [END CACHE CHECKPOINT] New content to add to the running summary (chunk {N+1}): [PASTE_NEW_CHUNK] Instructions: 1. Read the cached summary for context only β do not re-summarize it. 2. Summarize ONLY the new chunk. 3. Merge the new summary with the cached summary into an updated comprehensive summary. 4. Output the new full summary (this becomes the next cache checkpoint).
Few-Shot Examples Block (Cacheable)
Prefix Cache[CACHEABLE FEW-SHOT EXAMPLES] The following examples define the exact output format and quality level required. Cache these examples β they apply to all requests in this session. Example 1: Input: [EXAMPLE_INPUT_1] Output: [EXAMPLE_OUTPUT_1] Example 2: Input: [EXAMPLE_INPUT_2] Output: [EXAMPLE_OUTPUT_2] Example 3: Input: [EXAMPLE_INPUT_3] Output: [EXAMPLE_OUTPUT_3] Pattern: [DESCRIBE_THE_PATTERN_THESE_EXAMPLES_DEMONSTRATE] [END CACHEABLE EXAMPLES] Now apply the same pattern to this new input: [NEW_INPUT]
Multi-Agent Shared Context Cache
Prefix Cache[SHARED AGENT CONTEXT β STATIC β CACHE ACROSS ALL AGENTS] Project: [PROJECT_NAME] Objective: [PROJECT_GOAL] Current state: [WHAT_HAS_BEEN_DONE_SO_FAR] Constraints: [HARD_CONSTRAINTS] Available tools: [LIST_TOOLS] Prior agent outputs: - Agent 1 (Research): [SUMMARY] - Agent 2 (Analysis): [SUMMARY] [END SHARED CONTEXT] Agent role: [THIS_AGENT_ROLE] Your specific task: [AGENT_TASK] Output format: [EXPECTED_OUTPUT] Do not re-read or re-summarize the shared context. Use it as background knowledge only.
Cache Invalidation Trigger Prompt
Cache Management[CACHE INVALIDATION REQUEST] The following cached context is now stale and must be replaced: Cache key: [CACHE_KEY_OR_DESCRIPTION] Reason for invalidation: [WHY_IT_CHANGED] Old version: [OLD_CACHED_CONTENT_SUMMARY] New content to cache (replace the above): [NEW_CONTENT_TO_CACHE] After updating, confirm: "Cache updated. New version active as of [timestamp]." Then proceed with: [FIRST_TASK_USING_NEW_CACHE]
Implementation Checklist
Follow this sequence to implement caching in production:
Audit your prompt structure
Identify what's static vs dynamic in each of your prompts. Static content that appears in every request is your caching target. Aim for 60%+ of tokens to be in the static prefix.
Move static content to the top
Restructure prompts: system instructions, reference docs, and few-shot examples all go before the dynamic user content. Even a single character change invalidates the cache from that point.
Enable provider-level caching
OpenAI: automatic for prompts 1,024+ tokens (no code change needed). Anthropic: add cache_control breakpoints to your system message. Google Gemini: use the Caching API with explicit TTL settings.
Monitor cache hit rates
Track usage_metadata.cache_read_input_tokens in API responses. Target 70%+ hit rate on high-volume endpoints. Low hit rates indicate your prompts are varying too much in the 'static' portion.
Add semantic cache layer (optional but high ROI)
Implement a Redis or Pinecone vector cache. Embed incoming queries, check similarity against cached queries (threshold: 0.85). Serve cached responses at zero API cost on hits.
Set up cache invalidation triggers
When your knowledge base or system prompt changes, explicitly invalidate affected cache entries. For time-sensitive content, set appropriate TTLs rather than relying on default expiry.
Frequently Asked Questions
Everything developers need to know about AI prompt caching in 2026.