Prompt Template Linter — 12 structural failure checks, per-dimension score, ranked recommendations

Most prompt failures are structural, not semantic. Missing role definition leaves the model picking generic-assistant voice; no output format specification produces inconsistent parseable JSON; negative-only instructions (do not hallucinate) leave the positive action space undefined; no fallback instruction means the model guesses when uncertain — the primary cause of production hallucination. This linter runs 12 pattern checks that don't require an LLM call — paste your system prompt, get structural assessment in 100ms. Replaces the 'read your prompt three times' heuristic with a systematic check that catches the patterns you don't notice when you wrote it.

Free Private Linter

Paste a system prompt — linter scores 12 structural failure classes (missing role, ambiguous verbs, missing output format, missing fallback, conflicting instructions, scope drift, token bloat) plus per-dimension breakdown (role / task-clarity / format / examples / constraints / fallback). Works for conversation, classification, extraction, generation, and agent prompts.

Privacy: prompts commonly contain proprietary instructions, brand voice, tool definitions, customer-specific context. Tool does NOT persist paste to URL-state. Client-side only; nothing transmitted.

Common prompt anti-patterns — reference

Prompt anti-patterns scored by frequency in production LLM failures with fix pattern
Anti-patternExampleFailure modeFix pattern
Vague role"You are a helpful assistant"Model defaults to generic tone + scopeSpecify domain + style + constraints explicitly
No fallback"Answer the user's question"Model hallucinates when it cannot answer"If unsure, reply exactly: I do not have that information"
Negative-only"Do not make things up"Model cannot use what-not-to-do as a goalPair with positive steer: "Respond only with verifiable facts"
No format spec"Summarize the document"Format drifts across calls — sometimes bullets, sometimes prose"Respond in a 3-sentence paragraph, no markdown"
0 or 10+ examplesNo few-shot, or 20 few-shot0 = weak steering; 10+ = token waste + example-overfit3-5 diverse examples covering edge cases
Ambiguous scope"Handle customer inquiries"Model answers off-topic questionsEnumerate allowed scope; specify out-of-scope handling
Buried instructionKey constraint in paragraph 3Model under-weights late-prompt contentLead with constraints; repeat critical ones at end
Contradictory rules"Be concise" + "Include all context"Model arbitrarily picks one; output inconsistentResolve priorities: "Be concise; prefer clarity over completeness"

Paste your system prompt into the linter above to get a specific anti-pattern breakdown with per-issue fix suggestions.

Why structural prompt issues matter more than most prompt engineers think

Quick answer: the failure modes that cause production prompts to perform inconsistently are usually structural, not semantic. The writer understands the intent perfectly — the model receives ambiguous direction. Common patterns: "You are a helpful assistant" (no specific role), "Help the user with their question" (no task specification, no format, no scope), "Do not hallucinate" (negative-only with no positive steering). These pass a casual read-through but produce inconsistent output under real traffic because the model is guessing at what format, scope, and fallback behavior you want.

This linter catches the patterns you miss when reading your own prompt. Fresh eyes + pattern-match rules surface issues the original author walked past. It is not a replacement for A/B testing against real eval sets — structural linting is a floor check; semantic correctness requires actual task-performance evaluation via eval harnesses like Promptfoo or LangSmith.

The 6 dimensions of prompt quality

Quick answer: structural prompt quality breaks down into 6 measurable dimensions. (1) ROLE CLARITY — does the prompt open with a specific role definition? "You are a [specific role]" beats generic "You are an assistant". (2) TASK CLARITY — are action verbs specific? "classify", "extract", "generate", "summarize" beat "help", "assist", "handle". (3) OUTPUT FORMAT — is the format explicitly specified? JSON schema, markdown structure, numbered list, line limits. (4) EXAMPLES — does a complex prompt include 1-5 few-shot examples covering common + edge cases? (5) CONSTRAINTS — are boundaries positive not negative? "Stay on topic X" beats "Do not discuss Y"; "Be concise (max 3 sentences)" beats "Be brief". (6) FALLBACK — does the prompt handle uncertainty? "If uncertain, say so" reduces hallucination more than any other single intervention.

Tool scores each dimension 0-100 with weighted average for overall score. Weights reflect empirical impact: task-clarity (25%), output-format (20%), examples + role + constraints (15% each), fallback (10%). Low scores on role, format, or fallback are common + high-impact to fix.

Fallback instruction — the single highest-ROI prompt improvement

Quick answer: adding "If uncertain, say so clearly. Do not guess. State what data would be needed to answer correctly." to a system prompt reduces hallucination rate 30-70% empirically across production LLM deployments. This single addition is the highest-ROI change most prompts can make. Anthropic + OpenAI both document this in official prompt-engineering guides. Yet surveys of production prompts show 40-60% lack any fallback instruction.

Mechanism: LLMs default to "helpful assistant mode" where they try to answer every question, confabulating details if the training data lacks grounding. Explicit fallback gives the model permission to say "I don't know" — which is usually the correct answer when the model genuinely doesn't know. Without permission, the model invents. Add the instruction + hallucination drops. Tool flags missing fallback as a WARN-severity issue on any prompt over 300 characters + recommends it as #1 highest-impact fix.

Negative-only direction — why "do not hallucinate" doesn't work

Quick answer: prompts that rely exclusively on negative instructions ("do not X", "never Y", "avoid Z") leave a vast undefined positive action space. The model understands what NOT to do but doesn't know what to do INSTEAD. Result: random behavior in the undefined space. Example: "Do not hallucinate" tells the model not to make things up, but doesn't tell it what to do when uncertain — so it hallucinates anyway because it thinks helpfulness is the positive direction. Pair with "Instead, say I do not have enough information to answer this precisely" + hallucination drops.

Rule of thumb: every "do not X" should be paired with a "do Y instead". Count negative instructions; if they exceed 3x positive instructions, rebalance. Tool surfaces this as a ratio (negative / positive verb count) + flags imbalance.

Examples — where 3-5 beats both 1 and 10

Quick answer: few-shot examples dramatically improve consistency on classification + extraction + generation tasks. Empirically, 3-5 examples cover the common case + 2-3 edge cases, yielding 15-40% quality improvement over zero-shot. Above 5 examples gives diminishing returns — each additional example adds context tokens without proportional quality gain. Above 10 examples hurts: context bloat, diluted signal, often models start copying example structure too literally rather than generalizing.

Curate examples for DIVERSITY, not quantity. Include: the common case, 1-2 edge cases (ambiguous inputs, unusual formats), 1 refusal case (inputs where the correct response is decline-or-clarify). Format: "Input: [...] Output: [...]". Consistent labeling across examples reinforces the pattern. Avoid cherry-picking easy examples; the hard ones are where examples help most.

Scope boundary — critical for agent prompts

Quick answer: agent prompts (tool-using, multi-step, persistent conversation) drift in undefined-scope settings. Without "Stay focused on [domain]; if asked about unrelated topics, redirect to [scope] or decline politely", agents answer off-topic questions, use tools for unrelated tasks, or lose focus across turns. Explicit scope boundary anchors agent behavior + improves consistency across long conversations.

Common anti-pattern: agent prompt defines capabilities but not scope. Agent can search the web, send email, and query databases — but nothing says which topics are in-scope. Result: user asks off-topic question, agent answers (wasting tool budget + confusing subsequent turns). Fix: add explicit scope ("You handle only X, Y, Z. For other requests, acknowledge + redirect").

Token bloat — politeness that doesn't help

Quick answer: politeness phrases (please, kindly, would you, could you, thank you) are a common form of token bloat in system prompts — they add tokens without changing model behavior. Models do not need politeness; instructions alone are understood. A 2000-token politeness-heavy prompt can often be reduced to 1700-1800 tokens without quality loss, saving 10-15% on every request.

Caveat: some evidence suggests politeness marginally helps Claude on edge cases (Anthropic engineering has noted this publicly). Effect is small + inconsistent; the token savings generally outweigh any marginal benefit. Exception: user-facing response tone should remain polite; system prompt instructional tone can be terse. Tool flags over-5 politeness phrases as token-bloat INFO.

Repeated instructions — models understand once

Quick answer: repeating the same instruction 3-5 times in a system prompt does not improve compliance. Models understand first mention; subsequent mentions add tokens without changing behavior. Common patterns: "Be helpful. Please be helpful. Remember to be helpful." or "Use JSON format. Output must be JSON. Return JSON only." Consolidate to a single emphatic statement.

Tool detects bigram repetition (4+ occurrences of the same two-word phrase, excluding common words like "the", "to", "of"). Usually reveals unintentional repetition that happens during prompt iteration — writer adds a reinforcement, forgets to remove the original. Consolidate.

What this linter does NOT catch

Four classes require different tools. (1) SEMANTIC CORRECTNESS — linter catches structural issues, not whether the prompt actually produces correct output for your task. A structurally-strong prompt can still be wrong for the task. Test via A/B evaluation against a held-out eval set. (2) MODEL-SPECIFIC OPTIMIZATION — some patterns work better on Claude vs GPT vs Gemini (e.g. XML tags for Claude, markdown for GPT). Linter is model-agnostic; per-model optimization requires vendor-specific prompt-engineering docs. (3) PROMPT INJECTION RESILIENCE — linter does not test whether the prompt is vulnerable to user-input injection attacks. Use dedicated injection-testing like PortSwigger's prompt-injection labs or Lakera red-team tool. (4) TASK-SPECIFIC CALIBRATION — linter uses general prompt-engineering heuristics; domain-specific prompts (medical, legal, code) have specialized requirements beyond general-purpose checks.

Sources + further reading

Anthropic Prompt Engineering Guide (docs.anthropic.com/en/docs/build-with-claude/prompt-engineering) — canonical source for Claude-specific prompt patterns including role definition, chain-of-thought, XML tags, few-shot examples. OpenAI Prompt Engineering Guide (platform.openai.com/docs/guides/prompt-engineering) — GPT-specific patterns covering structured outputs, function calling, temperature. Google Gemini Prompting Guide (ai.google.dev/gemini-api/docs/prompting-strategies) — complementary perspective. Schulhoff S. et al. (2024) "The Prompt Report: A Systematic Survey of Prompting Techniques," arXiv:2406.06608 — comprehensive open-source survey covering 58 prompting techniques across 6 categories. Anthropic Prompt Library (anthropic.com/prompts) + OpenAI Cookbook (cookbook.openai.com) — production-tested prompt examples. White J. et al. (2023) "A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT," arXiv:2302.11382 — early academic framework for prompt patterns. For prompt evaluation: Promptfoo (promptfoo.dev), LangSmith (smith.langchain.com), Humanloop (humanloop.com) — production eval harnesses. For production hallucination reduction: Anthropic constitutional AI papers + OpenAI safety + alignment research.

Prompt Template Linter Tool v1 · canonical sources cited inline above · runs entirely client-side, no data transmitted

Informational & educational tool. Outputs are for educational purposes and do not constitute medical, legal, financial, tax, or professional advice, and are not a substitute for consultation with a qualified physician, attorney, accountant, or licensed professional. If a result indicates immediate danger — fire, carbon monoxide, acute exposure, or a life-threatening scenario — call emergency services (911 in the US, 995 in Singapore, or your local equivalent), evacuate if air quality is compromised, and ventilate affected areas immediately. For poisoning, contact your regional poison control center. Tool outputs should be verified against authoritative sources before relying on them for decisions with health, safety, legal, or financial consequences.