Prompt Engineering for Production: Moving Beyond Demos
Prompts that work in demos fail in production because demos don't have variance.
There's a gap between prompts that impress in a demo and prompts that work reliably across thousands of real user inputs. Most teams cross from demo to production without realizing they're solving a different problem.
Demo prompt engineering: "make this work for my examples."
Production prompt engineering: "make this work for the distribution of inputs I haven't seen yet, including weird ones."
Here's what changes when you make that shift.
The distribution problem
Your demo inputs are curated. They're the inputs you thought of, formatted in ways that made sense to you, at the length you naturally gravitate toward.
Real user inputs are:
- Shorter or longer than expected
- Ambiguous in ways you didn't anticipate
- Multilingual or code-switched
- Containing typos, truncated context, or implicit references
- Attempting things outside the intended scope
A production prompt needs to handle all of these gracefully. Not perfectly — but with defined, predictable behavior on edge cases rather than silent failures or confusing outputs.
The first step to production-ready prompt engineering is building a test set from real inputs as early as possible. Even 30-50 real examples will surface distribution issues that no amount of synthetic testing will catch.
Constraint-first design
Most prompt engineering advice focuses on instruction quality: be clear, be specific, give examples. That's all correct. But there's a layer below instructions that matters more in production: constraints.
Constraints define what the model should NOT do:
You are a customer support assistant for Acme Inc.
CONSTRAINTS:
- Do not discuss competitor products
- Do not make pricing commitments not listed in the product catalog
- Do not provide legal or medical advice; redirect to appropriate resources
- If asked about topics outside your scope, say: "I can help with [X]. For [other topic], you'll want to contact [resource]."
Instructions tell the model what to do on the happy path. Constraints tell it how to behave on the edges. In production, the edges are where you live.
A useful design exercise: for every prompt you write, write five things the model should NEVER do. Then add explicit constraints for each one.
Format as a reliability tool
Output format consistency is one of the most underappreciated reliability levers. Inconsistent output format means:
- Downstream parsing breaks unpredictably
- UI components need conditional logic
- Users see inconsistent experiences
Two patterns that enforce format:
1. JSON mode with explicit schema
Respond with valid JSON matching this schema:
{
"intent": "<string: one of [search, navigate, create, edit, delete]>",
"target": "<string: the object the user wants to act on>",
"parameters": "<object: key-value pairs extracted from the request>",
"confidence": "<string: one of [high, medium, low]>"
}
2. Output template
Structure your response exactly as follows:
SUMMARY: [one sentence, max 20 words]
DETAIL: [2-3 sentences]
NEXT_STEP: [one actionable sentence or "None"]
The second approach is good for human-readable output where full JSON is unnecessary. Either approach is dramatically more reliable than free-form text output.
Decomposition over complexity
As prompts grow, reliability degrades. A prompt that does five things produces output that's inconsistently good across all five things. The model is trying to optimize a complex objective, and it will trade off different goals against each other depending on the specific input.
The fix is decomposition: split a complex prompt into a chain of simple prompts, each with one clear task.
Bad (one complex prompt):
Extract the user's intent, classify it, format the response according to their detected expertise level, and add a follow-up question if the query seems unresolved.
Better (four simple prompts chained):
Step 1: Extract intent → structured output
Step 2: Classify expertise level → enum output
Step 3: Format response for that expertise level → formatted text
Step 4: Assess resolution → bool, generate follow-up if needed
Each step is testable independently. Failures are isolated and easy to fix. The tradeoff is latency — 4 calls vs. 1. For many use cases, the reliability gain is worth it.
Few-shot examples: what they actually do
Few-shot examples work by establishing the distribution of expected inputs and outputs. They tell the model: "this is the space we're operating in, these are the kinds of inputs I care about, these are the kinds of outputs I expect."
Common mistakes with few-shot examples:
- Too similar: examples from the same narrow input type won't generalize
- Not covering edge cases: the examples only show happy path inputs
- Misaligned output style: the example outputs don't match what you actually want
A well-designed few-shot example set covers:
- The core use case (2-3 examples)
- Edge cases you know about (1-2 examples)
- A "scope limit" example showing how to handle out-of-scope queries
Don't add more than 5-6 examples. Beyond that, the model starts pattern-matching to examples rather than following instructions — which produces brittle, input-sensitive behavior.
Eval-driven iteration
The only way to know if a prompt change actually improved things is to measure it. This requires:
- A representative test set (pull from real logs when possible)
- Defined pass/fail criteria for each test case
- Automated evaluation (LLM-as-judge, rule-based checks, or both)
- Version tracking so you can compare old vs. new
Without this, you're intuition-driving in the dark. "I feel like this is better" is not a production engineering standard.
A minimal eval workflow:
- 50 test cases across core use case, edge cases, and failure modes
- Each case has an expected output or acceptance criterion
- Run evals before AND after any prompt change
- Don't deploy if eval score drops, even on metrics you weren't trying to improve
Prompt versioning
Production prompts change. New requirements, discovered edge cases, model updates all require prompt changes. Without version control, you lose:
- The ability to roll back a bad change
- Context on why a change was made
- The eval results that validated a version
Treat prompts as first-class code artifacts:
- Store in version control alongside code
- Commit messages that explain what changed and why
- Tag each version with the eval results that approved it
- A clear rollback process for production incidents
Production prompt engineering is mostly about controlling variance: building prompts that produce predictable, consistent behavior across the full distribution of inputs you'll encounter — not just the ones you designed for.
The techniques above aren't exotic. They're the boring fundamentals that the best AI engineering teams apply consistently. Start with your test set, add constraints, enforce format, decompose complexity, and measure everything.
For a comprehensive checklist covering prompt design, evals, and production reliability, see the AI Production Reliability Checklist.
Building in public
Follow the journey
as I build AI tools, products, and a serious founder life.
No spam. Unsubscribe anytime.