Why does Temperature 0 not guarantee deterministic output?

In theory Temperature 0 picks the single most-probable next token, and the output should be identical across calls. In practice: (a) tokenization ties — when two tokens have exactly equal probability at a step, the tie-breaker is floating-point ordering which varies with batch + GPU layout; (b) provider-side batching — your request may be batched with others and numerical kernels produce slightly different floating-point activations per batch; (c) temperature floor — some providers clamp T to a tiny epsilon (0.001 or similar) internally to avoid NaN in log-sum-exp; (d) multi-GPU inference — tensor-parallel shards may produce bit-nondeterministic reductions. For truly deterministic output: set seed parameter (OpenAI seed, Google response_logprobs with seed, Anthropic does not expose seed as of 2026) + fix temperature 0 + accept that providers may change model weights. Evaluation with deterministic output requires strict version-pinning + seed + same batch size.

What is top_p nucleus sampling and when is it better than temperature?

Holtzman et al. 2019 "The Curious Case of Neural Text Degeneration" — temperature scaling preserves distribution shape but amplifies low-probability tail tokens at high T, causing nonsensical outputs. Nucleus sampling (top_p) truncates the distribution to the smallest set of tokens whose cumulative probability exceeds p (e.g. 0.9), then renormalizes + samples. Result: diverse output without tail-token degeneration. Guidance: for creative tasks use top_p 0.90-0.95 with temperature 0.7-1.0. For factual tasks use top_p 0.1-0.3 with temperature 0-0.3. Temperature + top_p stack multiplicatively — set one aggressively + the other neutrally, rather than both. Provider note: OpenAI allows both; Anthropic favors temperature (top_p available but not primary); Google Gemini both; Mistral both with slightly different semantics.

Should I use top_k in addition to top_p?

Rarely both simultaneously — they overlap in intent. top_k hard-truncates to the k highest-probability tokens before sampling. For most tasks, top_p (nucleus) adapts to distribution shape better than top_k (fixed count). Exception: hallucination-reduction in narrow-domain factual Q&A — top_k=40 with temperature 0 prevents rare-token excursions that can derail answers. Google Gemini defaults to top_k=40; OpenAI does not expose top_k directly. Anthropic exposes top_k with top_p. If you need reproducibility + narrow outputs, top_k 10-20 is more predictable than top_p. For everything else, prefer top_p.

What do frequency_penalty and presence_penalty actually do?

OpenAI-specific (also Mistral, not Anthropic). frequency_penalty subtracts n × penalty from the next-token logit, where n is how many times that token already appeared in context — discourages repetition proportional to occurrence count. Typical value 0.3-1.0. presence_penalty subtracts a flat penalty for any token that appeared at least once, regardless of count — discourages repeating previous topics, encouraging novelty. Typical value 0.3-1.0. Effects stack but solve different problems: frequency controls local word-repetition; presence controls topic-repetition. For creative writing use both (0.5 each). For factual Q&A use 0 (repetition of key terms is often correct). High values above 1.5 cause tortured phrasing — the model avoids the best word because it was already used.

What temperature for code generation?

Temperature 0.0-0.2 for deterministic code gen + exact-syntax-required tasks (SQL, regex, config files). Temperature 0.3-0.5 for code with some stylistic flexibility + when you want diverse candidate solutions (refactoring suggestions, test-case generation). Above 0.7 in code gen produces subtle bugs — hallucinated API methods, off-by-one errors, misremembered syntax. Use with top_p 0.1 for tight factual code or top_p 0.9 for creative. Providers: OpenAI gpt-4-turbo-preview code-gen default 0.2 in many examples; Anthropic Claude 3.5 Sonnet + 4.6 performs well at 0.0-0.3 for code; Google Gemini code-gen default 0.2; Mistral Codestral 0.0.

Do different providers interpret Temperature the same way?

No — scale varies. OpenAI gpt-4/gpt-4o: temperature 0-2, with T above 1 producing noticeably more random output (softmax temperature applied to logits). Anthropic Claude (all versions): temperature 0-1, cannot exceed 1. Google Gemini: temperature 0-2. Mistral: temperature 0-1. Llama (via hosted): typically 0-2. When porting a prompt across providers with temperature 0.7, results may differ noticeably because the distribution shapes differ. Tool displays per-provider ranges + suggests equivalent cross-provider settings. For cross-provider A/B comparisons, normalize to perceived randomness by testing a known diverse prompt + tuning each providers T to produce similar output diversity.

LLM Sampling Parameter Planner — Temperature / top_p / top_k by task × provider

Name: LLM Sampling Parameter Planner — Temperature / top_p / top_k by task across providers
Author: Kenny Tan

Sampling parameters matter more than most practitioners tune for. Temperature 0.7 on OpenAI produces noticeably different output than Temperature 0.7 on Claude because the providers interpret the scale differently. Temperature 0 does not guarantee deterministic output due to GPU batching + tokenization ties. top_p and top_k overlap but address different distribution shapes. Frequency and presence penalties stack with different effects. This planner encodes Holtzman 2019 nucleus-sampling research + per-provider parameter availability (Anthropic no penalties, OpenAI no top_k, Google default top_k 40) + task-specific temperature ranges with tight/default/wide options for each of 8 task types across 5 providers = 40 configured cells with literature citations. Reference code in OpenAI + Anthropic + Google + Mistral + Llama SDKs.

Free Private Planner

What this sampling parameter planner delivers

8 task types × 5 providers = 40 recommendation cells — creative writing / factual Q&A / code generation / classification / extraction / summarization / multi-step reasoning / translation · OpenAI / Anthropic / Google / Mistral / Llama. Each cell: Temperature (tight/default/wide), top_p, top_k, freq penalty, pres penalty, max_tokens, seed availability.
Per-provider parameter-availability encoding — OpenAI gpt-4o has no top_k (top_p only); Anthropic Claude has no frequency_penalty or presence_penalty; Google Gemini has top_k default 40 + top_p; Mistral has freq + pres + top_p (no top_k); Llama has all four. Tool surfaces availability inline so you do not send unsupported parameters.
Per-provider temperature-range mapping — OpenAI 0-2 · Anthropic 0-1 · Google 0-2 · Mistral 0-1 · Llama 0-2. Cross-provider A/B testing requires normalization; tool shows the delta explicitly.
Literature citations per task — Holtzman 2019 nucleus sampling · Fan 2018 top-k sampling · Wang 2022 self-consistency + Kojima 2022 zero-shot CoT · OpenAI Cookbook RAG patterns · Fabbri 2021 SummEval · NLLB translation defaults · Anthropic / Mistral / Google docs.
Reference code in 5 SDKs — copy-paste-ready snippets in OpenAI + Anthropic + Google + Mistral + OpenAI-compatible (Groq/Together/Fireworks for Llama). Parameters pre-filled from your task + provider selection.
Rationale inline — not just recommended values but WHY those values. Creative writing Temperature 0.8-1.2 with top_p 0.95 (Holtzman 2019 showed this prevents tail-token degeneration). Code gen Temperature 0-0.3 (Anthropic + Mistral Codestral defaults). Classification Temperature 0 + top_k 5 (determinism + floor against tail excursion). Each cell documented.
Tight / Default / Wide ranges — not a single recommended value but a range so you can tune for determinism (tight) vs diversity (wide) per task constraint. Default is the starting point; experiment 0.1 in either direction based on output quality.
URL-state shareable — task + provider + final-chosen values persist in URL for config sharing with team.

Citations: Holtzman A., Buys J., Du L., Forbes M., Choi Y. 2019 "The Curious Case of Neural Text Degeneration" arxiv.org/abs/1904.09751 (top_p nucleus sampling origin, tail-token degeneration analysis); Fan A., Lewis M., Dauphin Y. 2018 "Hierarchical Neural Story Generation" arxiv.org/abs/1805.04833 (top_k sampling origin); Wang X. et al. 2022 "Self-Consistency Improves Chain of Thought Reasoning in Language Models" arxiv.org/abs/2203.11171; Kojima T. et al. 2022 "Large Language Models are Zero-Shot Reasoners" arxiv.org/abs/2205.11916; Fabbri A. et al. 2021 "SummEval: Re-evaluating Summarization Evaluation" arxiv.org/abs/2007.12626; Meister C. et al. 2023 "Locally Typical Sampling" (alternative to nucleus); OpenAI API reference 2025 (sampling + penalties); Anthropic Messages API docs (temperature + top_p + top_k); Google Gemini API reference 2025 (generationConfig); Mistral API reference 2025; Meta Llama generation guide; Murphy K., Kardous C. 2016 "Evaluation of smartphone sound measurement applications" (unrelated to sampling — referenced for methodology rigor). Per-provider parameter availability verified against live docs 2026-04-24.

Sampling parameter quick-reference matrix

Recommended sampling configuration by task category (2026 provider snapshot)
Task category	Temperature	top_p	freq/pres penalty	Rationale
Classification / extract	0	1.0	0 / 0	Determinism primary; any variance breaks audit replay
Factual Q&A / RAG	0.2	0.9	0 / 0	Low variance to keep answers anchored to retrieved context
Code generation	0.2-0.4	0.95	0 / 0	Precision over creativity; repetition is legitimate (identifiers)
Structured output (JSON)	0	1.0	0 / 0	Schema adherence requires determinism
Summarization	0.3-0.5	0.9	0.2 / 0	Some variance for phrasing; freq-penalty discourages repetition
Creative writing	0.8-1.0	0.92	0.5 / 0.5	Both penalties avoid word + topic repetition in long output
Brainstorming / ideation	1.0-1.2	0.95	0 / 0.4	High temp for diversity; pres-penalty spreads topics
Chatbot / assistant	0.5-0.7	0.9	0.3 / 0.1	Balance between consistency and natural variation turn-to-turn

The planner above computes provider-specific T-normalized values because raw Temperature does not mean the same thing across OpenAI (range 0-2), Anthropic (range 0-1), Google Gemini (range 0-2), Mistral (range 0-1), and Llama (range 0-2). Paste your task category + target provider; the planner emits the four-parameter tuple pre-normalized for direct SDK use.

Why sampling parameters are under-tuned in production

Quick answer: most teams set Temperature once at deployment (often 0.7 or the provider default) and never revisit it. This is a mistake. Different task types have radically different optimal settings — a classification endpoint at Temperature 0.7 will show output variance across identical calls (unacceptable for audit); a creative-writing endpoint at Temperature 0.2 produces template-y repetitive output (unacceptable for engagement). Tuning Temperature + top_p per endpoint gives 10-30% quality improvement on most production benchmarks at zero compute cost.

Why Temperature 0 is not actually deterministic

Quick answer: in theory Temperature 0 picks the single highest-probability token at each step. In practice: (1) tokenization ties — when two tokens have exactly equal probability, the tie-breaker is floating-point ordering which varies with GPU batch layout. (2) Provider batching — your request may be batched with others and tensor-parallel shards produce slightly different floating-point reductions. (3) Temperature floor — some providers clamp T to 0.001 internally to avoid log-sum-exp NaN. For reproducible evaluation, set seed (where available) + Temperature 0 + pin model version + accept providers may change model weights silently. OpenAI + Google + Mistral + Llama expose seed; Anthropic does not.

top_p vs top_k — which is right for your task

Quick answer: top_p (nucleus sampling, Holtzman 2019) truncates to the smallest set of tokens whose cumulative probability exceeds p, then samples. Adapts to distribution shape: sharp distributions get a small nucleus, flat distributions get a larger nucleus. top_k (Fan 2018) truncates to a fixed number of highest-probability tokens regardless of shape. For most tasks top_p is better — it adapts. Exception: hallucination-reduction in narrow-domain factual tasks where you want a hard cap on token selection; top_k 10-20 with Temperature 0 prevents rare-token excursions. Provider note: OpenAI gpt-4o does not expose top_k; Anthropic Claude + Google Gemini + Mistral + Llama do.

Frequency vs presence penalty — they solve different problems

Quick answer: frequency_penalty subtracts n × penalty from the logit of each token, where n is occurrence count in context. Discourages repetition proportional to how often the token appeared. presence_penalty subtracts a flat penalty for any token that appeared at least once, regardless of count. Discourages topic-repetition + encourages topic-novelty. For creative writing use both (0.5 each) to avoid both word-repetition and topic-repetition. For code gen use 0 for both — legitimate code repeats identifiers and patterns. For factual Q&A use 0 — repeating key terms is correct content. Values above 1.5 cause tortured phrasing: the model avoids the best word because it was already used. Anthropic does not expose these; OpenAI + Mistral + Llama do.

Provider-specific temperature-range normalization

Quick answer: Temperature 0.7 on OpenAI (range 0-2) is at 35% of max range. Temperature 0.7 on Anthropic (range 0-1) is at 70% of max range. These produce noticeably different output diversity. When cross-provider A/B testing, do NOT use the same T value — normalize. A reasonable heuristic: OpenAI T=1.0 ≈ Anthropic T=0.7 ≈ Google T=1.0 ≈ Mistral T=0.6 ≈ Llama T=0.9. Exact calibration depends on task; run a diverse-output test (e.g. "write 5 alternative taglines") and tune each provider T to produce similar output diversity before comparing quality.

What this planner does NOT do

Quick answer: it does not run your prompts through the providers. It does not benchmark quality. It does not handle streaming-specific parameters (stop sequences, stream-mode). It does not cover prompt-engineering techniques (few-shot, chain-of-thought, tree-of-thought) — those are a separate concern from sampling. It does not cover fine-tuned model parameters (LoRA rank, RLHF reward shaping). It does not estimate cost (use the model-cost-calculator for that). It does not advise on model choice — that is a separate workflow involving quality + cost + latency + availability tradeoffs. For production eval + A/B testing, use Promptfoo + LangSmith + Humanloop; this tool configures the sampling layer that eval tooling then tests.

Kenny Tan Co-founder & Technical Lead

AI Model Evaluation Systems · LLM Benchmark Infrastructure · Prompt Engineering Research

Cross-domain expertise in software engineering, content systems, and infrastructure architecture.

Reviewed by Shanire — Co-founder & Creative Lead

21 April 2026 Also at: kennytan.net

LLM Sampling Parameter Planner — Temperature / top_p / top_k by task across providers Tool v1 · canonical sources cited inline above · runs entirely client-side, no data transmitted

Tell us what could be better

Two questions. Takes 30 seconds. We read every reply.

More tools

AI Token Counter Counter · Count tokens

AI Context Window Planner Planner · Plan a context budget

RAG Chunking Planner Planner · Plan a RAG chunk + index budget

Informational & educational tool. Outputs are for educational purposes and do not constitute medical, legal, financial, tax, or professional advice, and are not a substitute for consultation with a qualified physician, attorney, accountant, or licensed professional. If a result indicates immediate danger — fire, carbon monoxide, acute exposure, or a life-threatening scenario — call emergency services (911 in the US, 995 in Singapore, or your local equivalent), evacuate if air quality is compromised, and ventilate affected areas immediately. For poisoning, contact your regional poison control center. Tool outputs should be verified against authoritative sources before relying on them for decisions with health, safety, legal, or financial consequences.