LLM Agnostic Architecture: Build with REST Prompts
Last quarter, a fintech team swapped their classification model from GPT-4o to Claude Sonnet for cost reasons. The migration took two weeks. Not because the APIs were incompatible. Because the system prompts were copy-pasted across four microservices, each with slightly different wording, and nobody was sure which string was canonical.
They had built provider-agnostic inference (OpenAI SDK here, Anthropic SDK there) but not provider-agnostic prompts. The instructional text was still married to each service's deploy cycle.
That gap is common. Teams treat model selection as configuration and prompt text as code. For LLM-agnostic architecture, both need to live outside the application binary.
What LLM-agnostic actually means
"LLM agnostic" gets used loosely. Sometimes it means "we can call multiple APIs." Sometimes it means "we use LangChain." Neither is sufficient.
A genuinely provider-agnostic AI application has three separable layers:
- Prompt content: the natural language that defines persona, rules, and output format
- Inference routing: which model, which endpoint, which API key
- Application logic: orchestration, tool calls, business rules, persistence
Layers 2 and 3 belong in code. Layer 1 should not. When prompts live in source files, every provider swap, every tone tweak, and every compliance edit drags the full stack along.
The 2025 State of AI Engineering Survey found that teams commonly run 2–3 LLM providers simultaneously. Cost, capability fit, and failover all push toward multi-provider setups. If your prompts are hardcoded per service, that flexibility becomes expensive fast.
REST-delivered prompts fix the coupling. Your application fetches plain text from a prompt registry, then passes that string to whichever SDK you call. The registry does not know or care whether the next hop is OpenAI, Anthropic, a local Ollama instance, or AWS Bedrock.
The REST prompt pattern
The pattern is deliberately boring. That is the point.
Application startup or request handler
→ HTTP call to prompt registry (GET or POST with variables)
→ Receive { content: "You are a ..." }
→ Pass content to provider SDK as system instruction
→ Run inference as usual
PromptForge exposes this at POST /api/v1/prompts/{id} with optional variables and version parameters. Any equivalent registry works the same way. What matters is the contract: plain text in, plain text out.
Here is a minimal TypeScript example that works across providers:
async function fetchPrompt(
promptId: string,
variables: Record<string, string>,
version: "latest" | "stable" | number = "stable",
): Promise<string> {
const res = await fetch(
`https://www.promptforge-app.com/api/v1/prompts/${promptId}`,
{
method: "POST",
headers: {
Authorization: `Bearer ${process.env.PROMPTFORGE_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ version, variables }),
},
);
const { content } = await res.json();
return content as string;
}
// OpenAI
const system = await fetchPrompt("support-agent", { tone: "formal" });
await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "system", content }, { role: "user", content: userMessage }],
});
// Anthropic: same string, different parameter name
await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
system,
messages: [{ role: "user", content: userMessage }],
});
// Ollama: same string, local endpoint
await fetch("http://localhost:11434/v1/chat/completions", {
method: "POST",
body: JSON.stringify({
model: "llama3.3",
messages: [{ role: "system", content }, { role: "user", content: userMessage }],
}),
});
The prompt fetch is identical. Only the inference call changes. That is the architecture.
For provider-specific parameter names and pitfalls, see our LLM-specific prompt management hub and the integration guides for OpenAI, Anthropic, DeepSeek, and Ollama.
Four design decisions that keep you agnostic
1. Store templates, not finished strings
Hardcoding "You are a helpful assistant." in a fetch response handler is still hardcoding. Store templates with {{variable}} placeholders so one prompt serves multiple contexts:
You are a {{role}} for {{product}}. Respond in {{language}}.
Always follow {{compliance_framework}} guidelines.
Interpolation happens at fetch time. Version history tracks template changes, not every possible variable combination.
2. Pin versions in production, float in staging
Use _version=latest in development so every save is visible immediately. Use _version=stable in production so prompts only change when someone promotes a version. We covered the channel model in Stable Is the New Production.
Provider swaps become safer when you can say: "production is running prompt v14 on Claude; staging is testing v17 on GPT-4o." The version number travels with the text, independent of model ID.
3. Keep tool schemas in code, instructions in the registry
Function calling, JSON schemas, and API tool definitions are structural. They belong in code with type checking and tests. The natural language that tells the model when and how to use those tools belongs in the prompt registry.
Mixing both into one giant string in a .ts file is how teams end up with undeployable prompt edits because someone touched a Zod schema in the same commit.
4. Abstract the fetch, not the SDK
Do not wrap OpenAI and Anthropic behind a single generate() function unless you have a strong reason. SDKs diverge in streaming, tool use, and error shapes. Abstract prompt retrieval only:
interface PromptClient {
get(id: string, vars: Record<string, string>, version?: string): Promise<string>;
}
Your services inject PromptClient. Each service still calls its provider SDK directly. You gain testability (mock the prompt layer) without fighting the lowest common denominator.
What this buys you in production
Faster provider migration. Swap model: "gpt-4o" to model: "claude-sonnet-4-20250514" and keep the same prompt ID. Retune wording in the registry if the new model needs different instructions. No string hunting across repos.
Independent iteration. Product can adjust tone and compliance language without a deploy. Engineering owns inference reliability. The Amplify survey reports 70% of teams update prompts monthly; decoupling prevents that cadence from bottlenecking on CI/CD.
Consistent audit trail. One version history per prompt, regardless of which model consumed it. When a user complaint arrives, you check which prompt version was active, not which git commit shipped Tuesday.
Failover without forked prompts. If OpenAI returns 503, route to a backup provider using the same fetched system string. You are not maintaining duplicate prompt copies per vendor.
Honest limitations
REST prompts add a network hop. PromptForge averages under 50 ms per fetch; still non-zero. For latency-critical paths, cache the response in-process with a 30 to 120 second TTL. You trade prompt freshness for speed. Pick the TTL based on how often your team actually edits prompts.
Air-gapped environments cannot call a cloud registry per request. Pre-fetch at deploy time or run a sync job that writes prompts to local storage. The architecture still works; the delivery mechanism changes.
Not every string is a prompt. User messages, retrieved documents, and conversation history stay in application logic. Only the reusable instructional layer moves to the registry. Teams that try to version entire conversation transcripts in a prompt tool usually regret it.
Where PromptForge fits
PromptForge is a prompt registry built for this pattern: {{variable}} templates, immutable versions, latest and stable channels, and a REST API that returns plain text your application passes to any provider.
It does not replace your inference stack. It sits beside it. Your code picks the model; PromptForge supplies the words.
If you are starting from hardcoded strings today, the migration path is incremental:
- Extract one high-churn system prompt into the registry
- Point one service at the API with
_version=stable - Prove rollback works (promote the previous version, confirm behaviour changes)
- Move the next prompt
You do not need to boil the ocean. You need one prompt out of the codebase to feel the difference.
Next steps
Read Why Hardcoding System Prompts Is an Anti-Pattern for the failure modes this architecture avoids. For the deployment workflow, see Decoupling Prompt Updates from CI/CD.
Then pick your highest-churn prompt, store it as a template, and fetch it before your next inference call. Provider-agnostic architecture starts with one string that no longer lives in git.