Why "Just Call the API" Doesn't Scale
The entry point for most teams' LLM use in business processes is a direct API call wired into a script: send the prompt, receive the response, do something with the text. This works for single-step, single-model, single-context use cases. It breaks under almost every condition that a real multi-step business workflow introduces.
The conditions where it breaks: when the LLM call is one step in a chain of steps and the output needs to be a structured data type that the next step can use programmatically; when different steps in the workflow have different sensitivity requirements and shouldn't all go to the same model or the same provider; when the cost of calling a large model for every step is unjustifiable for low-complexity classification tasks; when the LLM call can fail and the workflow needs to handle retries, fallback models, and partial results without breaking the entire instance.
LLM orchestration is the layer that handles these conditions. It's not a product category — it's an architectural requirement for any production use of LLMs in multi-step business processes. This piece covers the core concepts: structured outputs, temperature and model selection, provider routing, and error handling. These apply whether you're building an orchestration layer yourself or evaluating one built into a workflow platform.
Structured Outputs: The Non-Negotiable Requirement
In a multi-step workflow, an LLM's output is an input to the next step. If that output is free-form text, the next step has to parse it — and parsing free-form text is where brittle dependencies are born. The LLM rephrases the same logical output slightly differently across instances, the downstream parser misses a field, and the workflow either throws an error or silently passes the wrong data forward.
Structured output schemas solve this. Rather than asking the LLM to "summarize the contract and flag risks," you define a JSON schema for the expected output: { "summary": string, "risk_level": "low" | "medium" | "high", "flagged_clauses": [{ "clause_id": string, "reason": string }] }. You instruct the model to produce output conforming to this schema, and you validate the response against it before advancing the workflow.
Modern LLM APIs support structured output enforcement directly — OpenAI's function calling and JSON mode, Anthropic's tool use schema — but these are provider-specific implementations with different behavior around partial compliance, error cases, and field ordering. An orchestration layer abstracts the provider-specific format and presents a single schema definition interface to the workflow author, then handles the provider-side translation and validation internally.
The practical impact: when a workflow instance fails because an AI step produced invalid output, the error is surfaced at the step level with the actual model response alongside the expected schema. The ops team can see exactly what went wrong, adjust the system prompt, and replay the instance. Without schema validation, the failure may not surface until several steps later, and the debugging path is much longer.
Temperature, Top-P, and When They Matter
Temperature and top-p are the sampling parameters that control how deterministic or creative an LLM's output is. Temperature of 0 (or close to it) produces near-deterministic outputs — the model picks the most probable next token at each step. Higher temperature produces more varied outputs. Top-p (nucleus sampling) works similarly by restricting the probability mass considered at each step.
For business workflow steps, temperature is a consequential setting that most workflow authors don't think about explicitly. A classification step — "is this contract high, medium, or low risk?" — should run at very low temperature. You want the most probable, consistent answer, not creative variation. A generation step — "write a first-draft summary of this deal for the account manager" — can run at moderate temperature; some variation in phrasing is acceptable and arguably desirable.
The failure mode for temperature in business workflows: using default settings (often 0.7 or 1.0) for classification and extraction steps because the default produces plausible-looking output in testing. In production, with thousands of instances, high temperature on a classification step produces inconsistent routing. The same contract gets classified as "medium risk" in 80% of instances and "high risk" in 20% — not because the contract changed, but because the sampling varied. That inconsistency undermines the entire value of having an AI classification step.
Rule of thumb for workflow orchestration: extraction and classification steps default to temperature 0 or 0.1. Generation and summarization steps use 0.3–0.6 depending on how much variation is acceptable. Document this in the AI agent node configuration, not in the system prompt — it's a model parameter, not a prompt instruction.
Multi-Provider Routing: Which Step Gets Which Model
A common pattern in production workflow deployments is using different LLM providers for different steps based on task complexity, cost, and data sensitivity. A growing software company's contract review workflow might route the initial clause extraction step to a smaller, faster, cheaper model that handles structured extraction well — and route the final risk assessment step, which requires more nuanced reasoning, to a more capable model from a different provider.
The factors that drive routing decisions in practice:
- Complexity: High-reasoning tasks (legal risk assessment, multi-factor eligibility evaluation) warrant a more capable model. Structured extraction and classification tasks don't need the same capability and cost less at lower tiers.
- Data sensitivity: Steps that process customer PII, contract terms, or financial data may need to route to a private deployment or an on-premises model rather than a public cloud provider. This is a compliance consideration, not just a preference.
- Cost per step: At scale, the cost difference between running every step on a frontier model versus running extraction on a smaller model and reasoning on a frontier model can be substantial. Multi-provider routing gives workflow authors control over this at the step level.
- Latency: Some workflow steps are in a critical path where the human reviewer is waiting for the AI output before they can act. For those steps, a faster model with slightly lower capability may be the right tradeoff.
An orchestration layer that only supports a single LLM provider forces all workflow steps onto the same model. That's acceptable for simple workflows but becomes a constraint as workflows grow in complexity, volume, and sensitivity diversity.
Error Handling: The Part That Actually Matters in Production
LLM API calls fail. Not often, but predictably in categories: rate limit exceeded, model unavailable, response timeout, malformed output that fails schema validation, content filter rejection. In a workflow that runs hundreds of instances per day, each failure mode will occur, and the workflow's behavior when it occurs determines whether the failure is recoverable.
The minimum error handling architecture for a production AI agent step: a configurable retry policy (N retries with exponential backoff for transient errors), a fallback model or provider (if the primary model is unavailable, route to a secondary), schema validation with explicit failure routing (if the output doesn't match the expected schema, route to a human review step rather than advancing), and a max-retry-exceeded path (if all retries fail, pause the instance and notify the workflow owner rather than silently failing).
We're not saying every workflow needs all of these. For a low-stakes, low-frequency workflow, simpler error handling is fine. The issue is that most teams start with no error handling and add it only after a production failure. By then, they've already had instances fail silently, produce incorrect downstream data, or block without notification. The cost of adding error handling before deployment is almost always lower than the cost of diagnosing the first production failure that would have been caught by it.
LLM orchestration for business processes is an engineering discipline, not a configuration exercise. The teams that treat it as a configuration exercise — setting prompts and moving on — tend to have a good first month and a difficult third month when the edge cases accumulate. The teams that treat it as an engineering discipline — schema validation, error handling, provider routing, temperature control — run workflows that are still reliable six months later because the failure modes were designed out before the first instance ran.
