What are the 12 failure classes this linter checks?

(1) Missing role definition (no "You are" / "Your role" / "Act as" in first 200 chars); (2) too short (under 50 tokens / 200 chars — likely under-specified); (3) too long (over 2000 tokens / 8000 chars — cacheable, flag for prompt caching opportunity); (4) ambiguous action verbs ("help" / "assist" / "handle" without specific task noun); (5) missing output-format specification (no JSON / markdown / list / schema mention); (6) missing examples for complex prompts (over 800 chars with zero example markers); (7) too many examples (over 10 — context bloat); (8) negative-only direction (only "do not" / "never" / "avoid" without positive steering); (9) conflicting instructions (heuristic pair detection — concise/thorough, formal/casual); (10) missing scope boundary (no topic/domain constraint, critical for agents); (11) token bloat (redundant politeness phrases — over 5 "please" / "kindly" / "thank you"); (12) missing fallback / uncertainty handling (no "if uncertain" / "if cannot" branch — highest-ROI addition).

Why 3-5 examples and not more or fewer?

Empirically 3-5 examples cover the common case + 2-3 edge cases, yielding 15-40% quality improvement over zero-shot on classification / extraction / generation tasks. Above 5 examples gives diminishing returns — each additional example adds context tokens without proportional quality gain. Above 10 examples hurts: context bloat, diluted signal, models start copying example structure too literally rather than generalizing. Curate for DIVERSITY not quantity: common case + 1-2 edge cases (ambiguous inputs, unusual formats) + 1 refusal case (inputs where correct response is decline-or-clarify). Consistent labeling reinforces pattern. Avoid cherry-picking easy examples; hard ones teach most.

My prompt has 3000 tokens — should I compress or cache?

Both, in that order. FIRST compress: remove politeness phrases (please / kindly — token bloat without benefit), consolidate repeated instructions (models understand once), trim examples to most-diverse 3-5, eliminate conflicting instructions (pick one direction not both). Target 30-50% reduction; most verbose prompts carry that much redundancy. SECOND cache: remaining 1500-2000 token prompt is prime caching candidate. Anthropic (5-min TTL) + OpenAI (1-hour TTL) both offer prompt caching that cuts cached-prefix cost to 10% of rate + does NOT count cached tokens against TPM. Break-even: 2 reads per TTL window. For production apps with >10 req/TTL sharing preamble: always enable caching.

How does the task-type rubric differ?

5 task types adjust expectations. CONVERSATION: role + tone matter most; format can be open; short prompts OK. CLASSIFICATION: format MUST be specified (schema / enum); examples strongly recommended; fallback critical (what to do if input does not match any class). EXTRACTION: format MUST be JSON or structured; examples required; schema must be explicit. GENERATION: format often free but style should be specified; examples useful for style matching; constraints matter (length, tone, content). AGENT: scope MUST be defined; tool usage boundaries required; multi-step reasoning instructions useful; fallback for unsupported requests critical. Linter applies task-specific thresholds — an agent prompt without scope gets WARN severity; a conversation prompt without scope gets INFO.

Prompt Template Linter — 12 structural failure checks, per-dimension score, ranked recommendations

Name: Prompt Template Linter
Author: Kenny Tan

Most prompt failures are structural, not semantic. Missing role definition leaves the model picking generic-assistant voice; no output format specification produces inconsistent parseable JSON; negative-only instructions (do not hallucinate) leave the positive action space undefined; no fallback instruction means the model guesses when uncertain — the primary cause of production hallucination. This linter runs 12 pattern checks that don't require an LLM call — paste your system prompt, get structural assessment in 100ms. Replaces the 'read your prompt three times' heuristic with a systematic check that catches the patterns you don't notice when you wrote it.

Free Private Linter

What this prompt linter delivers

12 structural failure checks — missing role, too short, too long-cacheable, ambiguous verbs, missing format, missing examples, too many examples, negative-only direction, conflicting instructions, missing scope, token bloat, repeated phrases, missing fallback. Each flagged with fix guidance.
Per-dimension scoring breakdown — role clarity / task clarity / output format / examples / constraints / fallback — each scored 0-100 with color-coded progress bar. Reveals where the prompt is strong vs structurally weak.
Task-type-aware rubric — 5 task types (conversation / classification / extraction / generation / agent) each with different expectations. Extraction prompts MUST specify format; agent prompts MUST define scope; conversation can be more open.
Impact-ranked recommendations — 5 highest-ROI improvements ordered by measured empirical impact. Fallback instruction is #1 (reduces hallucination 30-70%); explicit role is #2; format specification is #3.
Bigram-repetition detection — catches "please help" / "do not" repeated 4+ times — common token-bloat pattern that models do NOT benefit from. Models understand instructions once; repetition burns tokens.
Positive-vs-negative direction ratio — counts instructional verbs. Over 3x negative ("do not", "never", "avoid") vs positive ("do", "generate", "return") flags negative-only direction. Models respond better to "do X" than "do not do Y".
Conflict-pair heuristic — detects patterns like "be concise" + "be thorough" coexisting — forces the model to pick one semi-randomly per call, causing output variance.
3 paste-ready demo prompts — strong (senior engineer triage classifier with examples + format + fallback + scope), thin (one-line "help the user"), bloated (verbose politeness + contradictions + redundancy). Shows what each end of the spectrum looks like.
100% client-side — prompts often contain proprietary instructions, brand voice, tool definitions, customer-specific context. Paste field NOT persisted to URL-state (opsec: no leak via referrer / history / shared links).

Failure pattern catalog derived from Anthropic prompt-engineering documentation (docs.anthropic.com/en/docs/build-with-claude/prompt-engineering), OpenAI prompt-engineering guide (platform.openai.com/docs/guides/prompt-engineering), Google Gemini prompting guide, Schulhoff S. et al. (2024) "The Prompt Report" open-source survey (arXiv:2406.06608), and empirical failure-mode analysis from production LLM deployments 2024-2026. Bigram repetition + politeness-count thresholds calibrated from inspection of publicly-shared system prompts. Linter is structural — semantic correctness + task-specific performance require A/B testing against a held-out eval set (Promptfoo, LangSmith, Humanloop).

Paste a system prompt — linter scores 12 structural failure classes (missing role, ambiguous verbs, missing output format, missing fallback, conflicting instructions, scope drift, token bloat) plus per-dimension breakdown (role / task-clarity / format / examples / constraints / fallback). Works for conversation, classification, extraction, generation, and agent prompts.

Privacy: prompts commonly contain proprietary instructions, brand voice, tool definitions, customer-specific context. Tool does NOT persist paste to URL-state. Client-side only; nothing transmitted.

Prompt task type Different task types have different prompt quality expectations. Linter adjusts rubric per type. System prompt Paste just the system message, not the full conversation. If you have a role + instructions in separate files, concatenate before pasting.

Common prompt anti-patterns — reference

Prompt anti-patterns scored by frequency in production LLM failures with fix pattern
Anti-pattern	Example	Failure mode	Fix pattern
Vague role	"You are a helpful assistant"	Model defaults to generic tone + scope	Specify domain + style + constraints explicitly
No fallback	"Answer the user's question"	Model hallucinates when it cannot answer	"If unsure, reply exactly: I do not have that information"
Negative-only	"Do not make things up"	Model cannot use what-not-to-do as a goal	Pair with positive steer: "Respond only with verifiable facts"
No format spec	"Summarize the document"	Format drifts across calls — sometimes bullets, sometimes prose	"Respond in a 3-sentence paragraph, no markdown"
0 or 10+ examples	No few-shot, or 20 few-shot	0 = weak steering; 10+ = token waste + example-overfit	3-5 diverse examples covering edge cases
Ambiguous scope	"Handle customer inquiries"	Model answers off-topic questions	Enumerate allowed scope; specify out-of-scope handling
Buried instruction	Key constraint in paragraph 3	Model under-weights late-prompt content	Lead with constraints; repeat critical ones at end
Contradictory rules	"Be concise" + "Include all context"	Model arbitrarily picks one; output inconsistent	Resolve priorities: "Be concise; prefer clarity over completeness"

Paste your system prompt into the linter above to get a specific anti-pattern breakdown with per-issue fix suggestions.

Why structural prompt issues matter more than most prompt engineers think

Quick answer: the failure modes that cause production prompts to perform inconsistently are usually structural, not semantic. The writer understands the intent perfectly — the model receives ambiguous direction. Common patterns: "You are a helpful assistant" (no specific role), "Help the user with their question" (no task specification, no format, no scope), "Do not hallucinate" (negative-only with no positive steering). These pass a casual read-through but produce inconsistent output under real traffic because the model is guessing at what format, scope, and fallback behavior you want.

This linter catches the patterns you miss when reading your own prompt. Fresh eyes + pattern-match rules surface issues the original author walked past. It is not a replacement for A/B testing against real eval sets — structural linting is a floor check; semantic correctness requires actual task-performance evaluation via eval harnesses like Promptfoo or LangSmith.

The 6 dimensions of prompt quality

Quick answer: structural prompt quality breaks down into 6 measurable dimensions. (1) ROLE CLARITY — does the prompt open with a specific role definition? "You are a [specific role]" beats generic "You are an assistant". (2) TASK CLARITY — are action verbs specific? "classify", "extract", "generate", "summarize" beat "help", "assist", "handle". (3) OUTPUT FORMAT — is the format explicitly specified? JSON schema, markdown structure, numbered list, line limits. (4) EXAMPLES — does a complex prompt include 1-5 few-shot examples covering common + edge cases? (5) CONSTRAINTS — are boundaries positive not negative? "Stay on topic X" beats "Do not discuss Y"; "Be concise (max 3 sentences)" beats "Be brief". (6) FALLBACK — does the prompt handle uncertainty? "If uncertain, say so" reduces hallucination more than any other single intervention.

Tool scores each dimension 0-100 with weighted average for overall score. Weights reflect empirical impact: task-clarity (25%), output-format (20%), examples + role + constraints (15% each), fallback (10%). Low scores on role, format, or fallback are common + high-impact to fix.

Fallback instruction — the single highest-ROI prompt improvement

Quick answer: adding "If uncertain, say so clearly. Do not guess. State what data would be needed to answer correctly." to a system prompt reduces hallucination rate 30-70% empirically across production LLM deployments. This single addition is the highest-ROI change most prompts can make. Anthropic + OpenAI both document this in official prompt-engineering guides. Yet surveys of production prompts show 40-60% lack any fallback instruction.

Mechanism: LLMs default to "helpful assistant mode" where they try to answer every question, confabulating details if the training data lacks grounding. Explicit fallback gives the model permission to say "I don't know" — which is usually the correct answer when the model genuinely doesn't know. Without permission, the model invents. Add the instruction + hallucination drops. Tool flags missing fallback as a WARN-severity issue on any prompt over 300 characters + recommends it as #1 highest-impact fix.

Negative-only direction — why "do not hallucinate" doesn't work

Quick answer: prompts that rely exclusively on negative instructions ("do not X", "never Y", "avoid Z") leave a vast undefined positive action space. The model understands what NOT to do but doesn't know what to do INSTEAD. Result: random behavior in the undefined space. Example: "Do not hallucinate" tells the model not to make things up, but doesn't tell it what to do when uncertain — so it hallucinates anyway because it thinks helpfulness is the positive direction. Pair with "Instead, say I do not have enough information to answer this precisely" + hallucination drops.

Rule of thumb: every "do not X" should be paired with a "do Y instead". Count negative instructions; if they exceed 3x positive instructions, rebalance. Tool surfaces this as a ratio (negative / positive verb count) + flags imbalance.

Examples — where 3-5 beats both 1 and 10

Quick answer: few-shot examples dramatically improve consistency on classification + extraction + generation tasks. Empirically, 3-5 examples cover the common case + 2-3 edge cases, yielding 15-40% quality improvement over zero-shot. Above 5 examples gives diminishing returns — each additional example adds context tokens without proportional quality gain. Above 10 examples hurts: context bloat, diluted signal, often models start copying example structure too literally rather than generalizing.

Curate examples for DIVERSITY, not quantity. Include: the common case, 1-2 edge cases (ambiguous inputs, unusual formats), 1 refusal case (inputs where the correct response is decline-or-clarify). Format: "Input: [...] Output: [...]". Consistent labeling across examples reinforces the pattern. Avoid cherry-picking easy examples; the hard ones are where examples help most.

Scope boundary — critical for agent prompts

Quick answer: agent prompts (tool-using, multi-step, persistent conversation) drift in undefined-scope settings. Without "Stay focused on [domain]; if asked about unrelated topics, redirect to [scope] or decline politely", agents answer off-topic questions, use tools for unrelated tasks, or lose focus across turns. Explicit scope boundary anchors agent behavior + improves consistency across long conversations.

Common anti-pattern: agent prompt defines capabilities but not scope. Agent can search the web, send email, and query databases — but nothing says which topics are in-scope. Result: user asks off-topic question, agent answers (wasting tool budget + confusing subsequent turns). Fix: add explicit scope ("You handle only X, Y, Z. For other requests, acknowledge + redirect").

Token bloat — politeness that doesn't help

Quick answer: politeness phrases (please, kindly, would you, could you, thank you) are a common form of token bloat in system prompts — they add tokens without changing model behavior. Models do not need politeness; instructions alone are understood. A 2000-token politeness-heavy prompt can often be reduced to 1700-1800 tokens without quality loss, saving 10-15% on every request.

Caveat: some evidence suggests politeness marginally helps Claude on edge cases (Anthropic engineering has noted this publicly). Effect is small + inconsistent; the token savings generally outweigh any marginal benefit. Exception: user-facing response tone should remain polite; system prompt instructional tone can be terse. Tool flags over-5 politeness phrases as token-bloat INFO.

Repeated instructions — models understand once

Quick answer: repeating the same instruction 3-5 times in a system prompt does not improve compliance. Models understand first mention; subsequent mentions add tokens without changing behavior. Common patterns: "Be helpful. Please be helpful. Remember to be helpful." or "Use JSON format. Output must be JSON. Return JSON only." Consolidate to a single emphatic statement.

Tool detects bigram repetition (4+ occurrences of the same two-word phrase, excluding common words like "the", "to", "of"). Usually reveals unintentional repetition that happens during prompt iteration — writer adds a reinforcement, forgets to remove the original. Consolidate.

What this linter does NOT catch

Four classes require different tools. (1) SEMANTIC CORRECTNESS — linter catches structural issues, not whether the prompt actually produces correct output for your task. A structurally-strong prompt can still be wrong for the task. Test via A/B evaluation against a held-out eval set. (2) MODEL-SPECIFIC OPTIMIZATION — some patterns work better on Claude vs GPT vs Gemini (e.g. XML tags for Claude, markdown for GPT). Linter is model-agnostic; per-model optimization requires vendor-specific prompt-engineering docs. (3) PROMPT INJECTION RESILIENCE — linter does not test whether the prompt is vulnerable to user-input injection attacks. Use dedicated injection-testing like PortSwigger's prompt-injection labs or Lakera red-team tool. (4) TASK-SPECIFIC CALIBRATION — linter uses general prompt-engineering heuristics; domain-specific prompts (medical, legal, code) have specialized requirements beyond general-purpose checks.

Sources + further reading

Anthropic Prompt Engineering Guide (docs.anthropic.com/en/docs/build-with-claude/prompt-engineering) — canonical source for Claude-specific prompt patterns including role definition, chain-of-thought, XML tags, few-shot examples. OpenAI Prompt Engineering Guide (platform.openai.com/docs/guides/prompt-engineering) — GPT-specific patterns covering structured outputs, function calling, temperature. Google Gemini Prompting Guide (ai.google.dev/gemini-api/docs/prompting-strategies) — complementary perspective. Schulhoff S. et al. (2024) "The Prompt Report: A Systematic Survey of Prompting Techniques," arXiv:2406.06608 — comprehensive open-source survey covering 58 prompting techniques across 6 categories. Anthropic Prompt Library (anthropic.com/prompts) + OpenAI Cookbook (cookbook.openai.com) — production-tested prompt examples. White J. et al. (2023) "A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT," arXiv:2302.11382 — early academic framework for prompt patterns. For prompt evaluation: Promptfoo (promptfoo.dev), LangSmith (smith.langchain.com), Humanloop (humanloop.com) — production eval harnesses. For production hallucination reduction: Anthropic constitutional AI papers + OpenAI safety + alignment research.

Kenny Tan Co-founder & Technical Lead

AI Model Evaluation Systems · LLM Benchmark Infrastructure · Prompt Engineering Research

Cross-domain expertise in software engineering, content systems, and infrastructure architecture.

Reviewed by Shanire — Co-founder & Creative Lead

21 April 2026 Also at: kennytan.net

Prompt Template Linter Tool v1 · canonical sources cited inline above · runs entirely client-side, no data transmitted

Tell us what could be better

Two questions. Takes 30 seconds. We read every reply.

More tools

AI Token Counter Counter · Count tokens

AI Context Window Planner Planner · Plan a context budget

RAG Chunking Planner Planner · Plan a RAG chunk + index budget

Informational & educational tool. Outputs are for educational purposes and do not constitute medical, legal, financial, tax, or professional advice, and are not a substitute for consultation with a qualified physician, attorney, accountant, or licensed professional. If a result indicates immediate danger — fire, carbon monoxide, acute exposure, or a life-threatening scenario — call emergency services (911 in the US, 995 in Singapore, or your local equivalent), evacuate if air quality is compromised, and ventilate affected areas immediately. For poisoning, contact your regional poison control center. Tool outputs should be verified against authoritative sources before relying on them for decisions with health, safety, legal, or financial consequences.