Prompt
Checker.
Free six-criteria rubric for grading AI prompts. Specificity, structure, clarity, output format, role, and verification. Works with any LLM.
Score before you ship. Catch weak prompts before they waste tokens in production.
Describe what you want
3 prompt variations
Click Copy to useYou are a prompt engineer reviewing a prompt. # PROMPT TO REVIEW """ [paste the prompt to check] """ Severity: balanced (fail on issues below 3/5). # TASK Score the prompt 1 to 5 on each of these six criteria. One sentence of rationale per score. 1. Specificity: does the prompt define exact outputs, lengths, and formats? 2. Structure: are sections (role, task, context, output) clearly separated? 3. Clarity: is the language unambiguous and free of contradiction? 4. Output format: is the expected format explicit (JSON, prose, list, schema)? 5. Role and context: does the prompt frame who the model is and what the user needs? 6. Verification and safety: is there a self-check, fallback, or robustness consideration? # OUTPUT A table with columns: criterion, score (1-5), one-sentence rationale. Add a final "Overall" row with the average and a verdict (pass / needs work / fail).
# ROLE Senior prompt engineer auditing a prompt for production use. # PROMPT """ [paste the prompt to check] """ Severity: balanced (fail on issues below 3/5). # RUBRIC 1. Specificity: does the prompt define exact outputs, lengths, and formats? 2. Structure: are sections (role, task, context, output) clearly separated? 3. Clarity: is the language unambiguous and free of contradiction? 4. Output format: is the expected format explicit (JSON, prose, list, schema)? 5. Role and context: does the prompt frame who the model is and what the user needs? 6. Verification and safety: is there a self-check, fallback, or robustness consideration? # TASK For each criterion: 1. Score 1 to 5. 2. Quote the specific sentence or missing element that drove the score. 3. Suggest the smallest change that would raise the score by one point. # OUTPUT Structured report: criterion, score, evidence (quoted), improvement. End with an overall verdict and the top three improvements ranked by expected impact.
# TASK You want to compare prompts but only one prompt was provided. # PROMPT """ [paste the prompt to check] """ # INSTRUCTION Write a second version of the prompt that fixes its weakest criterion. Then score A (original) and B (your rewrite) on all six criteria in the rubric. End with the specific delta and why B is stronger. # RUBRIC 1. Specificity: does the prompt define exact outputs, lengths, and formats? 2. Structure: are sections (role, task, context, output) clearly separated? 3. Clarity: is the language unambiguous and free of contradiction? 4. Output format: is the expected format explicit (JSON, prose, list, schema)? 5. Role and context: does the prompt frame who the model is and what the user needs? 6. Verification and safety: is there a self-check, fallback, or robustness consideration?
Under the hood
Why most prompts fail the same way.
Specificity, structure, clarity, output format, role and context, verification. Most production prompt failures trace back to one of these six. The rubric is not exhaustive but it is remarkably predictive.
The checker produces a grading meta-prompt. You run it in any LLM and that model does the scoring. Using a different model to grade than you deploy on catches blind spots.
Prompts jump in score when you fix the lowest-rated criterion, not when you tweak across all six. Run the checker, find the 2/5, fix it, re-run. Three rounds beats twenty tiny edits.
Related free tools
Specialized generators for specific tasks.
FAQ
Questions about prompt scoring.
What does the prompt checker actually do?+
It produces a scoring meta-prompt. You paste your current prompt, pick criteria and severity, and the checker returns a prompt you run in any LLM. That LLM scores your original prompt against six quality criteria with evidence and fix suggestions. The model does the scoring, so you keep control over which model grades and what it costs.
What are the six criteria?+
Specificity (does the prompt define exact outputs and formats), structure (are sections clearly separated), clarity (is the language unambiguous), output format (is the expected format explicit), role and context (is the model's role framed), and verification (is there a self-check or robustness consideration). Together they cover the most common failure modes in production prompts.
Why should a different LLM grade my prompt?+
Running the grader on the same model you deploy to produces biased scores. Using a different model as the grader gives you a second perspective. For example, if you ship on GPT-5 but grade with Claude, Claude spots weak points GPT would gloss over. For the strictest audits, grade with two different models and look at where they disagree.
What is the difference between quick score, detailed, and compare?+
Quick score returns a six-row table with 1-5 scores and one-sentence rationale. Detailed adds quoted evidence from your prompt and a specific improvement for each criterion. Compare runs two prompts head to head and declares a winner. Start with quick, move to detailed when you want to fix issues, use compare when A/B testing variants.
How do I pick severity?+
Strict fails anything below 4/5 on any criterion, good for production prompts that hit customers. Balanced fails below 3/5, reasonable for most internal or ops-facing prompts. Lenient only flags issues below 2/5, useful for exploratory or prototype prompts where you want direction without blocking iteration. Match severity to how much the prompt's quality matters in deployment.
Can I grade a system prompt and user prompt separately?+
Yes. Paste the system prompt first, grade it. Then paste the user prompt template, grade it. Then paste the combination, grade the whole. Each gets its own six-criteria report and the issues tend to cluster differently. System prompts usually fail on role/context and verification. User prompts fail on specificity and output format.
Does a high score guarantee the prompt will work?+
No. The rubric catches most design flaws but the only real test is output quality on real inputs. Use the checker to catch obvious issues before you burn tokens testing. A prompt that scores 5/5 and still fails on real data is telling you the problem is not in the prompt's structure, it is in the task itself or the model's capability.
What should I do after I get a bad score?+
Use the detailed rubric version, look at the improvement column, and apply the top one or two fixes. Do not try to fix all six at once. Re-run the checker. Scores usually jump 1 to 2 points per round. Two or three rounds takes most prompts from 3/5 to 4.5/5. For the best results, pair this with the AI Prompt Optimiser to handle the actual rewrite.