Prompt
Checker.

Free six-criteria rubric for grading AI prompts. Specificity, structure, clarity, output format, role, and verification. Works with any LLM.

Score before you ship. Catch weak prompts before they waste tokens in production.

Describe what you want

Check type

Paste the prompt to check*

The checker runs six quality criteria by default. You can narrow the focus or add a second prompt to compare.

Criteria to evaluate

Target model the prompt runs against

Scoring severity

Second prompt (for comparison, optional)

What outcome should the prompt produce (optional)

3 prompt variations

Click Copy to use

Variant 1

Quick score (table output)

You are a prompt engineer reviewing a prompt.
# PROMPT TO REVIEW
"""
[paste the prompt to check]
"""
Severity: balanced (fail on issues below 3/5).
# TASK
Score the prompt 1 to 5 on each of these six criteria. One sentence of rationale per score.
1. Specificity: does the prompt define exact outputs, lengths, and formats?
2. Structure: are sections (role, task, context, output) clearly separated?
3. Clarity: is the language unambiguous and free of contradiction?
4. Output format: is the expected format explicit (JSON, prose, list, schema)?
5. Role and context: does the prompt frame who the model is and what the user needs?
6. Verification and safety: is there a self-check, fallback, or robustness consideration?
# OUTPUT
A table with columns: criterion, score (1-5), one-sentence rationale. Add a final "Overall" row with the average and a verdict (pass / needs work / fail).

Variant 2

Detailed rubric with evidence and fixes

# ROLE
Senior prompt engineer auditing a prompt for production use.
# PROMPT
"""
[paste the prompt to check]
"""
Severity: balanced (fail on issues below 3/5).
# RUBRIC
1. Specificity: does the prompt define exact outputs, lengths, and formats?
2. Structure: are sections (role, task, context, output) clearly separated?
3. Clarity: is the language unambiguous and free of contradiction?
4. Output format: is the expected format explicit (JSON, prose, list, schema)?
5. Role and context: does the prompt frame who the model is and what the user needs?
6. Verification and safety: is there a self-check, fallback, or robustness consideration?
# TASK
For each criterion:
1. Score 1 to 5.
2. Quote the specific sentence or missing element that drove the score.
3. Suggest the smallest change that would raise the score by one point.
# OUTPUT
Structured report: criterion, score, evidence (quoted), improvement. End with an overall verdict and the top three improvements ranked by expected impact.

Variant 3

Self-rewrite then compare

# TASK
You want to compare prompts but only one prompt was provided.
# PROMPT
"""
[paste the prompt to check]
"""
# INSTRUCTION
Write a second version of the prompt that fixes its weakest criterion. Then score A (original) and B (your rewrite) on all six criteria in the rubric. End with the specific delta and why B is stronger.
# RUBRIC
1. Specificity: does the prompt define exact outputs, lengths, and formats?
2. Structure: are sections (role, task, context, output) clearly separated?
3. Clarity: is the language unambiguous and free of contradiction?
4. Output format: is the expected format explicit (JSON, prose, list, schema)?
5. Role and context: does the prompt frame who the model is and what the user needs?
6. Verification and safety: is there a self-check, fallback, or robustness consideration?

Under the hood

Why most prompts fail the same way.

SIX CRITERIA

Specificity, structure, clarity, output format, role and context, verification. Most production prompt failures trace back to one of these six. The rubric is not exhaustive but it is remarkably predictive.

GRADER IS MODEL

The checker produces a grading meta-prompt. You run it in any LLM and that model does the scoring. Using a different model to grade than you deploy on catches blind spots.

FIX TOP ONE

Prompts jump in score when you fix the lowest-rated criterion, not when you tweak across all six. Run the checker, find the 2/5, fix it, re-run. Three rounds beats twenty tiny edits.

Related free tools

Specialized generators for specific tasks.

Free Tool

AI Prompt Optimiser

Rewrite after you know what to fix.

Free Tool

AI Prompt Generator

Start fresh with a structured builder.

Free Tool

XML Prompt Generator

Boost structure score with XML tags.

Free Tool

JSON Prompt Generator

Boost output format score with schemas.

FAQ

Questions about prompt scoring.

What does the prompt checker actually do?+

It produces a scoring meta-prompt. You paste your current prompt, pick criteria and severity, and the checker returns a prompt you run in any LLM. That LLM scores your original prompt against six quality criteria with evidence and fix suggestions. The model does the scoring, so you keep control over which model grades and what it costs.

What are the six criteria?+

Specificity (does the prompt define exact outputs and formats), structure (are sections clearly separated), clarity (is the language unambiguous), output format (is the expected format explicit), role and context (is the model's role framed), and verification (is there a self-check or robustness consideration). Together they cover the most common failure modes in production prompts.

Why should a different LLM grade my prompt?+

Running the grader on the same model you deploy to produces biased scores. Using a different model as the grader gives you a second perspective. For example, if you ship on GPT-5 but grade with Claude, Claude spots weak points GPT would gloss over. For the strictest audits, grade with two different models and look at where they disagree.

What is the difference between quick score, detailed, and compare?+

Quick score returns a six-row table with 1-5 scores and one-sentence rationale. Detailed adds quoted evidence from your prompt and a specific improvement for each criterion. Compare runs two prompts head to head and declares a winner. Start with quick, move to detailed when you want to fix issues, use compare when A/B testing variants.

How do I pick severity?+

Strict fails anything below 4/5 on any criterion, good for production prompts that hit customers. Balanced fails below 3/5, reasonable for most internal or ops-facing prompts. Lenient only flags issues below 2/5, useful for exploratory or prototype prompts where you want direction without blocking iteration. Match severity to how much the prompt's quality matters in deployment.

Can I grade a system prompt and user prompt separately?+

Yes. Paste the system prompt first, grade it. Then paste the user prompt template, grade it. Then paste the combination, grade the whole. Each gets its own six-criteria report and the issues tend to cluster differently. System prompts usually fail on role/context and verification. User prompts fail on specificity and output format.

Does a high score guarantee the prompt will work?+

No. The rubric catches most design flaws but the only real test is output quality on real inputs. Use the checker to catch obvious issues before you burn tokens testing. A prompt that scores 5/5 and still fails on real data is telling you the problem is not in the prompt's structure, it is in the task itself or the model's capability.

What should I do after I get a bad score?+

Use the detailed rubric version, look at the improvement column, and apply the top one or two fixes. Do not try to fix all six at once. Re-run the checker. Scores usually jump 1 to 2 points per round. Two or three rounds takes most prompts from 3/5 to 4.5/5. For the best results, pair this with the AI Prompt Optimiser to handle the actual rewrite.

You are a prompt engineer reviewing a prompt. # PROMPT TO REVIEW """ [paste the prompt to check] """ Severity: balanced (fail on issues below 3/5). # TASK Score the prompt 1 to 5 on each of these six criteria. One sentence of rationale per score. 1. Specificity: does the prompt define exact outputs, lengths, and formats? 2. Structure: are sections (role, task, context, output) clearly separated? 3. Clarity: is the language unambiguous and free of contradiction? 4. Output format: is the expected format explicit (JSON, prose, list, schema)? 5. Role and context: does the prompt frame who the model is and what the user needs? 6. Verification and safety: is there a self-check, fallback, or robustness consideration? # OUTPUT A table with columns: criterion, score (1-5), one-sentence rationale. Add a final "Overall" row with the average and a verdict (pass / needs work / fail).

# ROLE Senior prompt engineer auditing a prompt for production use. # PROMPT """ [paste the prompt to check] """ Severity: balanced (fail on issues below 3/5). # RUBRIC 1. Specificity: does the prompt define exact outputs, lengths, and formats? 2. Structure: are sections (role, task, context, output) clearly separated? 3. Clarity: is the language unambiguous and free of contradiction? 4. Output format: is the expected format explicit (JSON, prose, list, schema)? 5. Role and context: does the prompt frame who the model is and what the user needs? 6. Verification and safety: is there a self-check, fallback, or robustness consideration? # TASK For each criterion: 1. Score 1 to 5. 2. Quote the specific sentence or missing element that drove the score. 3. Suggest the smallest change that would raise the score by one point. # OUTPUT Structured report: criterion, score, evidence (quoted), improvement. End with an overall verdict and the top three improvements ranked by expected impact.

# TASK You want to compare prompts but only one prompt was provided. # PROMPT """ [paste the prompt to check] """ # INSTRUCTION Write a second version of the prompt that fixes its weakest criterion. Then score A (original) and B (your rewrite) on all six criteria in the rubric. End with the specific delta and why B is stronger. # RUBRIC 1. Specificity: does the prompt define exact outputs, lengths, and formats? 2. Structure: are sections (role, task, context, output) clearly separated? 3. Clarity: is the language unambiguous and free of contradiction? 4. Output format: is the expected format explicit (JSON, prose, list, schema)? 5. Role and context: does the prompt frame who the model is and what the user needs? 6. Verification and safety: is there a self-check, fallback, or robustness consideration?

Why most prompts fail the same way.

SIX CRITERIA

GRADER IS MODEL

The checker produces a grading meta-prompt. You run it in any LLM and that model does the scoring. Using a different model to grade than you deploy on catches blind spots.

FIX TOP ONE

Prompts jump in score when you fix the lowest-rated criterion, not when you tweak across all six. Run the checker, find the 2/5, fix it, re-run. Three rounds beats twenty tiny edits.

Questions about prompt scoring.

What does the prompt checker actually do?+

What are the six criteria?+

Why should a different LLM grade my prompt?+

What is the difference between quick score, detailed, and compare?+

How do I pick severity?+

Can I grade a system prompt and user prompt separately?+

Does a high score guarantee the prompt will work?+

What should I do after I get a bad score?+

Prompt
Checker.

Describe what you want

3 prompt variations

Why most prompts fail the same way.

Specialized generators for specific tasks.

AI Prompt Optimiser

AI Prompt Generator

XML Prompt Generator

JSON Prompt Generator

Questions about prompt scoring.

What to read next

Prompt Engineering for Data Analysis

Claude Prompts

ChatGPT Prompts

Prompt
Checker.

Describe what you want

3 prompt variations

Why most prompts fail the same way.

Specialized generators for specific tasks.

AI Prompt Optimiser

AI Prompt Generator

XML Prompt Generator

JSON Prompt Generator

Questions about prompt scoring.