MMNTM logo
Return to Index
Technical Deep Dive

The Probabilistic Stack: Engineering for Non-Determinism

LLMs break the fundamental assumption of software engineering: deterministic inputs produce deterministic outputs. New patterns required.

MMNTM Research Team
10 min read
#AI Engineering#Best Practices#Streaming#Reliability

The Probabilistic Stack: Engineering for Non-Determinism

The Paradigm Shift

For decades, software engineering rested on a foundational assumption: deterministic inputs produce deterministic outputs.

A database query returns a predictable row. A REST API returns a structured JSON object. A function call returns a specific type. You could write tests that asserted exact equality. You could debug by reproducing exact inputs.

Large Language Models break this assumption entirely.

An LLM API call returns a stream of tokens that is:

  • Probabilistic — Same input can produce different outputs
  • Variable in latency — Response time fluctuates based on output length
  • Unstructured by default — Prose, not types

This shift created an immediate engineering vacuum. The tools built for deterministic systems don't work. The patterns developed over decades don't apply. We need a new stack.

The Early Chaos (2022-2023)

When GPT-3 went mainstream, developers faced a brutal integration challenge:

No Standardization:

  • OpenAI used one API schema
  • Anthropic used another
  • Each provider had unique streaming behaviors
  • Switching models meant rewriting the data access layer

Manual Stream Parsing:

  • Raw Server-Sent Events (SSE) over HTTP
  • Custom TextDecoder implementations
  • Homegrown state machines for token accumulation
  • No library to handle the "typing" effect users expected

No Type Safety:

  • Responses were arbitrary strings
  • Parsing into structured data was error-prone
  • Validation happened after the fact (if at all)
  • Runtime crashes from unexpected output formats

The result: every team built bespoke solutions. Code that worked with GPT-3.5 broke with GPT-4. Migrations were nightmares. Technical debt accumulated faster than features shipped.

The Integration Gap

The core problem was a mismatch between developer training and the new reality.

Developers trained for:

  • Request → Response (atomic)
  • Predictable types
  • Exact reproducibility
  • Test with assertions

LLMs provide:

  • Request → Stream (incremental)
  • Probabilistic output
  • Approximate reproducibility at best
  • Test with... fuzzy matching? Vibe checks?

This is the "integration gap"—the distance between how developers think about systems and how LLMs actually behave.

Frameworks like Vercel AI SDK emerged specifically to bridge this gap, treating LLMs like databases: providing ORMs for AI that create consistent, type-safe abstraction layers over a fragmented ecosystem.

Streaming: Not Optional

In deterministic systems, you can wait for the complete response. With LLMs, waiting is death.

Model TierTypical Response Time (Complete)User Tolerance
Frontier reasoning models15-45 seconds~5 seconds
Balanced models10-30 seconds~5 seconds
Fast models3-8 seconds~5 seconds

The math doesn't work. Even fast models often exceed user tolerance, and frontier models are 3-9x slower than users will wait.

Streaming solves this: Begin displaying the response as tokens generate. First-byte latency drops from 15 seconds to ~200ms. Users see progress and wait.

But streaming introduces complexity:

State Management:

  • Track partial response as it builds
  • Handle mid-stream errors gracefully
  • Support cancellation (abort controller)
  • Manage multiple concurrent streams

UI Challenges:

  • Render incrementally without layout thrashing
  • Handle code blocks that arrive across multiple chunks
  • Manage scroll position as content grows
  • Show loading states appropriately

Protocol Complexity:

  • Parse Server-Sent Events
  • Handle reconnection on connection drops
  • Multiplex different data types (text, tool calls, errors)

The move to streaming is mandatory for production LLM applications. It's also significantly harder to implement than traditional request/response.

Structured Outputs: Forcing Reliability

The second major pattern: forcing LLMs to output structured data.

Raw LLM output is prose. Prose is great for humans, terrible for systems. You can't reliably extract a boolean from "Well, I think the answer is probably yes, but it depends on..."

Structured Output Pattern:

  1. Define a schema (typically Zod or JSON Schema)
  2. Instruct the model to output JSON matching the schema
  3. Validate the output
  4. Retry or repair if validation fails
// Instead of: "Analyze this review and tell me if it's positive"
// Which returns: "This review appears to be generally positive..."
 
// Use structured output:
const schema = z.object({
  sentiment: z.enum(['positive', 'negative', 'neutral']),
  confidence: z.number().min(0).max(1),
  keyPhrases: z.array(z.string()).max(5)
})
 
const result = await generateObject({
  model: openai('gpt-4'),
  schema,
  prompt: 'Analyze this review...'
})
 
// result.object is typed and validated
// { sentiment: 'positive', confidence: 0.85, keyPhrases: [...] }

Key Insight: Modern SDKs don't just pass the schema to the model. They:

  • Translate schema to model-specific format (JSON Schema for OpenAI, Input Schema for Anthropic)
  • Validate streaming output in real-time
  • Auto-retry on validation failure
  • Optionally repair malformed JSON before giving up

This transforms unreliable prose into reliable types. It's the bridge between probabilistic generation and deterministic systems.

The Middleware Pattern

Complex AI applications need more than raw model calls. They need:

  • Context from databases (RAG)
  • Guardrails for safety
  • Logging and observability
  • Rate limiting and caching
  • Input/output transformation

The Middleware Pattern wraps model calls with pre- and post-processing:

User Input
    ↓
[Pre-Processing Middleware]
    - Validate input
    - Check rate limits
    - Embed query for RAG
    - Retrieve relevant documents
    - Augment system prompt with context
    ↓
LLM Call
    ↓
[Post-Processing Middleware]
    - Validate output schema
    - Check for PII/safety issues
    - Log for observability
    - Cache if appropriate
    ↓
Application Response

RAG Pipeline Example:

  1. Intercept: Middleware intercepts the model call
  2. Embed: Extract last user message, send to embedding model
  3. Retrieve: Query vector database for relevant chunks
  4. Augment: Append chunks to system prompt
  5. Forward: Release modified request to LLM

This encapsulation keeps retrieval logic separate from presentation logic. The chat UI doesn't know or care that RAG is happening—it just sees better responses.

The Abstraction Trade-Off

Provider abstraction (swap openai('gpt-4') for anthropic('claude-3')) is powerful. It's also dangerous.

What You Gain

Portability: Change models without rewriting code Experimentation: A/B test different providers easily Future-proofing: New models work with existing code Cost optimization: Route to cheaper models for simple queries

What You Lose

Visibility: Abstraction hides what's actually happening Provider Features: Advanced capabilities require escape hatches Debugging: Harder to isolate where problems originate Performance Tuning: Provider-specific optimizations become difficult

Silent Failure Patterns

Abstractions can fail in non-obvious ways:

Configuration Mismatches: If client configuration (e.g., maxSteps: 5) doesn't match server configuration (e.g., maxSteps: 3), tool execution loops fail silently. The model tries to call a tool, but the server terminates early. No error—just broken behavior.

Error Swallowing: Stream protocols embed errors as data (2: {"error":...}). If error handling isn't implemented correctly, errors appear as the generation simply stopping. No console logs, no exceptions—just silence.

Leaky Abstractions: Provider-specific features (Anthropic's cache control, OpenAI's logprobs) require bypassing the abstraction layer. Once you're accessing raw provider APIs, you've lost the portability the abstraction promised.

Debugging Non-Determinism

Traditional debugging doesn't work when outputs vary. New approaches required:

Observability Over Reproducibility

You can't reproduce exact outputs. Instead, instrument everything:

  • Full request/response logging (with token-level timing)
  • Prompt versioning (track which prompt produced which outputs)
  • Cost tracking (per-request token usage)
  • Latency percentiles (P50 is vanity, P95 is reality)

Evaluation Over Assertion

You can't assert response === "Expected output". Instead:

  • LLM-as-judge: Have another model evaluate quality
  • Semantic similarity: Embedding-based comparison
  • Rule-based checks: Does output contain required fields? Forbidden content?
  • Human evaluation: Regular sampling and scoring

See State of Evals for comprehensive evaluation strategies.

Prompt Engineering as Code

Prompts are code now. Treat them accordingly:

  • Version control prompts alongside code
  • Test prompts against evaluation sets
  • Review prompt changes like code changes
  • Document prompt decisions and rationale

Fallback Chains

Non-determinism means failures happen. Plan for them:

async function reliableGeneration(input) {
  try {
    return await generateWithGPT4(input)
  } catch (e) {
    console.warn('GPT-4 failed, falling back to Claude')
    return await generateWithClaude(input)
  }
}

The Reliability Hierarchy

Different applications need different reliability guarantees:

Use CaseReliability NeedApproach
Chat/CreativeLowRaw streaming, basic validation
Data ExtractionMediumStructured output with retry
Workflow AutomationHighSchema validation + human review
Financial/LegalCriticalFull HITL approval + audit trail

See HITL Firewall for patterns that match reliability requirements to oversight levels.

The Testing Problem

How do you test probabilistic systems?

Property-Based Testing

Don't test exact outputs. Test properties:

// Instead of: expect(result).toBe("The capital is Paris")
// Test properties:
expect(result.sentiment).toBeOneOf(['positive', 'negative', 'neutral'])
expect(result.confidence).toBeGreaterThan(0)
expect(result.confidence).toBeLessThan(1)
expect(result.keyPhrases.length).toBeLessThanOrEqual(5)

Golden Set Testing

Maintain a set of input/output pairs that represent "good enough" responses. Test that new model versions or prompt changes don't degrade below this baseline.

Regression Detection

Track metrics over time. Alert when:

  • Average response quality drops
  • Error rates increase
  • Latency percentiles shift
  • Cost per query spikes

Chaos Engineering for AI

Intentionally inject failures:

  • Provider timeouts
  • Malformed responses
  • Rate limit errors
  • Network interruptions

Verify your system degrades gracefully.

The New Mental Model

Building on the probabilistic stack requires a mental shift:

DeterministicProbabilistic
Request → ResponseRequest → Stream
Assert exact equalityEvaluate approximate quality
Debug by reproductionDebug by observation
Test with fixturesTest with evaluation sets
Errors are exceptionalErrors are expected
Type safety at compile timeSchema validation at runtime

This isn't worse—it's different. And once you internalize the patterns, you can build systems that deterministic approaches simply couldn't support.

The Bottom Line

The probabilistic stack is a new discipline. It requires:

  1. Streaming by default — Wait time is abandonment
  2. Structured outputs — Force reliability through schemas
  3. Middleware architecture — Separate concerns cleanly
  4. Observability investment — You can't debug what you can't see
  5. Evaluation frameworks — Replace assertions with quality measurement
  6. Graceful degradation — Expect failures, plan fallbacks

The teams that master these patterns ship reliable AI products. The teams that fight the non-determinism—trying to force LLMs into deterministic boxes—struggle indefinitely.

Embrace the probability.


See also: Vercel AI SDK Guide for the leading implementation of these patterns, and Agent Failure Modes for what happens when probabilistic systems go wrong.

The Probabilistic Stack: Engineering for Non-Determinism