Error Handling Strategies for Probabilistic Code Execution


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
A single LLM-generated function can return syntactically valid Python that produces a different result, or a different error, on every invocation. Without non-determinism-aware error handling, agents silently return wrong results or burn through API budgets on doomed retries.
How to Handle Errors in Probabilistic Code Execution
- Instrument every agent code execution with OpenTelemetry spans carrying prompt hashes, attempt numbers, and token estimates.
- Classify each exception dynamically as terminal, retryable-with-mutation, or retryable-without-mutation using confidence scoring.
- Mutate the correction context between retries by feeding the error message, traceback, and prior output back into the agent's next prompt.
- Enforce hard boundaries on both retry attempts and cumulative token spend to prevent runaway costs.
- Capture terminal failures in Sentry with full execution context, agent ID, and correction strategy metadata.
- Wrap agent-executed functions with a self-correction decorator that gates retries on error classification and budget.
- Store all intermediate code generations—including failed attempts—to build training signal for improving correction callbacks.
- Route alerts by error taxonomy: page on-call for terminal errors, digest retryable exhaustions, and trigger automated reviews on budget breaches.
Table of Contents
- Why Deterministic Error Handling Breaks in Probabilistic Systems
- Prerequisites
- Telemetry Setup
- The Anatomy of Failure in Probabilistic Code
- Observability-First Error Architecture
- The Self-Correction Decorator: A Practical Pattern for Self-Healing Agents
- Classifying Errors: Deciding What Deserves a Retry
- Putting It All Together: An End-to-End Pipeline
- Toward Reliable AI Systems
Why Deterministic Error Handling Breaks in Probabilistic Systems
A single LLM-generated function can return syntactically valid Python that produces a different result, or a different error, on every invocation. AI error handling cannot rely on the foundational assumption behind traditional try/catch patterns: that identical inputs yield identical failures. When self-healing agents generate and execute code at runtime, the error surface becomes probabilistic. The same prompt, the same temperature, the same model version can produce semantically divergent outputs across consecutive calls. Without non-determinism-aware error handling, agents silently return wrong results or burn through API budgets on doomed retries.
This article addresses runtime errors in agent-generated or agent-executed code paths, not model training or inference latency. The thesis is straightforward: probabilistic programming environments require error handling that is context-aware, budget-aware, and self-correcting. What follows covers the anatomy of non-deterministic failures, an observability-first architecture using OpenTelemetry and Sentry, a concrete self-correction decorator pattern, an error classification framework, and production hardening guidance.
Prerequisites
The code examples in this article assume the following dependencies. Pin versions to avoid breaking changes:
opentelemetry-api>=1.20.0,<2.0
opentelemetry-sdk>=1.20.0,<2.0
sentry-sdk>=1.40.0,<3.0
pandas>=2.0.0
Install with:
pip install "opentelemetry-api>=1.20.0,<2.0" "opentelemetry-sdk>=1.20.0,<2.0" "sentry-sdk>=1.40.0,<3.0" "pandas>=2.0.0"
Python 3.9 or later is required. For accurate token counting in production, also install your provider's tokenizer (e.g., tiktoken for OpenAI models).
Telemetry Setup
Configure a single TracerProvider once at your application entry point. All other modules obtain tracers via trace.get_tracer(...) without touching the provider. This avoids the silent span loss that occurs when multiple modules each call set_tracer_provider().
# telemetry.py — import and call configure_telemetry() ONCE at application startup
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
_initialized = False
def configure_telemetry() -> None:
global _initialized
if _initialized:
return
provider = TracerProvider()
# Replace ConsoleSpanExporter with OTLPSpanExporter in production
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
_initialized = True
# main.py (application entry point)
from telemetry import configure_telemetry
configure_telemetry() # Must be called before any module that calls trace.get_tracer()
from agent_executor import execute_generated_code # noqa: E402
from self_correct_decorator import self_correct # noqa: E402
The Anatomy of Failure in Probabilistic Code
Categories of Non-Deterministic Errors
Failures in agent-generated code fall into distinct categories that traditional exception hierarchies were never designed to capture.
Semantic drift occurs when the generated code is syntactically correct but logically wrong: a function that should filter rows by date instead filters by ID, silently producing bad data.
Stochastic API failures include rate limits, token cap exhaustion, and model refusals. They surface unpredictably depending on concurrent load and content policies.
You will also encounter schema violations, where the output shape changes between runs. A function returns a flat dictionary on one invocation and a nested list on the next.
The most insidious category is cascading context corruption. A bad intermediate result from one agent step gets fed forward as context to the next, compounding errors through the pipeline. By the time you detect it, the root cause is several steps upstream.
Why "Just Retry" Is an Antipattern
Blind retries amplify cost without improving the probability of success. Without mutating the context between attempts, the same prompt feeds the same model state, producing the same class of failure. An exponential backoff strategy designed for transient network errors becomes a token-burning machine when applied to LLM calls. Consider that each retry against a large language model consumes tokens billed by the provider. Three retry attempts at 4,000 tokens each (prompt + completion) consume 12,000 additional tokens beyond the original call. Actual cost varies by provider; some offer prompt caching (e.g., OpenAI, Anthropic) that reduces repeat prompt costs. Verify billing details with your specific provider. A pipeline running 500 retries/day at 4,000 tokens each burns 2M tokens/day, enough to exhaust a $50 monthly budget in under a week at GPT-4 pricing. The retry must change something, or it is simply repeated gambling.
The retry must change something, or it is simply repeated gambling.
Observability-First Error Architecture
Instrumenting Non-Deterministic Calls with OpenTelemetry
Trace every agent code execution as a discrete span. The span-per-invocation model allows operators to reconstruct the full lifecycle of a generated function: what was prompted, what was produced, and how it behaved at runtime. OpenTelemetry's Python SDK supports custom semantic attributes that carry probabilistic execution metadata alongside standard trace context.
Configure the TracerProvider once at application startup (see the Telemetry Setup section above). Individual modules obtain a tracer without calling set_tracer_provider():
import os
import hashlib
import concurrent.futures
from opentelemetry import trace
from opentelemetry.trace import StatusCode
# Obtain tracer from application-level provider; do NOT call set_tracer_provider here.
tracer = trace.get_tracer("agent.executor")
_EXEC_TIMEOUT_SECONDS = int(os.environ.get("AGENT_EXEC_TIMEOUT", "10"))
MAX_ERROR_MESSAGE_LEN = 500
def execute_generated_code(code_string: str, prompt: str, attempt: int, temperature: float):
"""
WARNING: exec() on LLM-generated code is inherently unsafe. In production,
run generated code in an isolated subprocess, container, or sandbox (e.g.,
RestrictedPython, Pyodide, or a separate process with OS-level isolation).
The snippet below is illustrative only. Restricting __builtins__ is NOT
sufficient isolation on CPython — determined code can escape via object
introspection chains.
"""
prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()[:12]
with tracer.start_as_current_span("agent.code_execution") as span:
span.set_attribute("agent.prompt_hash", prompt_hash)
span.set_attribute("agent.attempt_number", attempt)
span.set_attribute("agent.temperature", str(temperature))
span.set_attribute("agent.code_length", len(code_string))
try:
exec_globals: dict = {"__builtins__": {}}
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(exec, code_string, exec_globals)
try:
future.result(timeout=_EXEC_TIMEOUT_SECONDS)
except concurrent.futures.TimeoutError:
raise TimeoutError(
f"Generated code exceeded {_EXEC_TIMEOUT_SECONDS}s execution limit"
)
if "result" not in exec_globals:
raise ValueError(
"Generated code did not assign to 'result'. "
"Ensure generated code sets: result = <value>"
)
result = exec_globals["result"]
span.set_attribute("agent.result_type", type(result).__name__)
span.set_status(StatusCode.OK)
return result
except Exception as exc:
span.set_attribute("agent.error_class", type(exc).__name__)
span.set_attribute("agent.error_message", str(exc)[:MAX_ERROR_MESSAGE_LEN])
span.set_status(StatusCode.ERROR, str(exc))
raise
The prompt_hash attribute enables correlation of failures across retries originating from the same prompt. The attempt_number attribute distinguishes first-pass failures from retry failures, which often have different root causes. Output schema fingerprinting, implemented by hashing the structure of the result, can be added as a custom attribute to detect schema violations between runs. The execution timeout (configurable via the AGENT_EXEC_TIMEOUT environment variable) prevents generated code containing infinite loops from blocking the calling thread and leaking span contexts. The explicit check for "result" in exec_globals ensures that generated code which never assigns to result raises a clear error instead of silently returning None with a success status.
Structured Error Capture with Sentry
Sentry integration captures non-deterministic exceptions with the full execution context needed for post-hoc analysis. The key is enriching each captured exception with metadata specific to probabilistic execution: the attempt number, the correction strategy applied, the agent identifier, and a reference to the generated code.
Caution: Sending raw prompts and generated code to Sentry transmits that content to a third-party service. Redact or hash sensitive fields before capture. Review your data processing agreement with Sentry. The example below logs hashes by default, not raw content.
Note: Call sentry_sdk.init() once at your application entry point, not in library or utility modules, to avoid overriding the host application's Sentry configuration.
import os
import hashlib
import sentry_sdk
sentry_sdk.init(
dsn=os.environ.get("SENTRY_DSN", ""), # Gracefully degrade if env var absent
traces_sample_rate=0.05, # 5% sampling for production; use 1.0 only in development
)
def report_agent_failure(exc: Exception, prompt: str, generated_code: str,
attempt: int, agent_id: str, correction_strategy: str):
prompt_hash = hashlib.sha256(prompt.encode()).hexdigest() # compute once
prompt_hash_short = prompt_hash[:12]
with sentry_sdk.new_scope() as scope:
scope.set_context("agent_execution", {
"prompt_hash": prompt_hash,
"generated_code_length": len(generated_code),
"attempt_number": attempt,
"agent_id": agent_id,
"correction_strategy": correction_strategy,
})
scope.set_tag("agent.attempt_number", str(attempt))
scope.set_tag("agent.correction_strategy", correction_strategy)
scope.set_tag("agent.error_class", type(exc).__name__)
scope.add_breadcrumb(
category="agent.prompt",
message=f"Prompt hash: {prompt_hash_short}",
level="info",
)
scope.add_breadcrumb(
category="agent.generated_code.metadata",
message=f"Generated code length: {len(generated_code)} chars",
level="info",
)
scope.capture_exception(exc)
Tag errors with attempt_number and correction_strategy to build custom Sentry dashboards or issue grouping queries showing whether self-correction converges or diverges across retry sequences (this requires custom configuration; it is not available out of the box). Breadcrumbs log a hash of the prompt and the generated code length as an ordered trail, preserving the causal chain that led to the failure without bloating the exception payload or transmitting sensitive content. Using sentry_sdk.new_scope() ensures that tags and context from one agent call do not bleed into another in concurrent execution.
The Self-Correction Decorator: A Practical Pattern for Self-Healing Agents
Design Principles
The self-correction decorator rests on four principles, though in practice they interlock rather than standing alone.
Bounded attempts and budget guarding work together: a hard ceiling on retries prevents infinite loops, while cumulative estimated token spend tracking aborts execution if cost grows faster than progress. Without budget guarding, the decorator degenerates into an expensive retry loop with a counter.
Context mutation between retries is what separates this from blind repetition. The error message, traceback, and previous output feed back into the agent's next prompt, transforming each retry from repetition into directed correction. Classification gating prevents wasting that mutation on errors where correction callbacks have no track record of success. A KeyError from schema drift is retryable. An authentication failure is not.
The correction_callback parameter must conform to the following contract:
from typing import Callable
CorrectionCallback = Callable[[dict], Callable]
# Input: a dict containing error_type, error_message, traceback, attempt, function_name
# Output: a replacement callable with the same signature as the original function
Full Implementation
The decorator module below has no dependency on pandas. The usage example that follows requires pandas separately.
import os
import functools
import traceback
import hashlib
from typing import Callable, Optional, Tuple
from opentelemetry import trace
import sentry_sdk
# Obtain tracer from application-level provider (see Telemetry Setup section).
# Do NOT call set_tracer_provider here.
tracer = trace.get_tracer("agent.self_correct")
MAX_ERROR_MESSAGE_LEN = 500
CorrectionCallback = Callable[[dict], Callable]
class TokenBudgetExceeded(BaseException):
"""Raised when cumulative estimated token spend exceeds the retry budget.
Extends BaseException so it is not caught by broad 'except Exception' handlers."""
pass
class TerminalAgentError(Exception):
"""Raised when an error is classified as non-retryable.
Callers must catch TerminalAgentError and inspect __cause__ to recover
the original exception type if needed for downstream handling.
"""
pass
def self_correct(max_attempts: int = 3, token_budget: int = 4000,
retryable_errors=None,
correction_callback: Optional[CorrectionCallback] = None,
agent_id: str = "default"):
"""
Decorator for agent-executed functions that enables bounded, context-aware
self-correction. Catches exceptions, classifies them, mutates correction
context, enforces token/attempt budgets, and reports terminal failures.
Args:
max_attempts: Hard ceiling on total execution attempts. Must be >= 1.
token_budget: Maximum cumulative estimated tokens across all retries.
Token estimation uses a ~4 chars/token heuristic. For production
budget enforcement, integrate tiktoken or your provider's tokenizer.
retryable_errors: Exception type or tuple of exception types eligible
for retry. A single type is automatically wrapped in a tuple.
correction_callback: CorrectionCallback — receives an error_context dict
and returns a corrected callable with the same signature as the
decorated function. Must return a callable.
agent_id: Identifier for tracing and Sentry tagging.
"""
if max_attempts < 1:
raise ValueError(f"max_attempts must be >= 1, got {max_attempts}")
if retryable_errors is None:
retryable_errors = (KeyError, ValueError, TypeError)
if isinstance(retryable_errors, type):
retryable_errors = (retryable_errors,)
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
cumulative_tokens = 0
last_exception = None
original_name = func.__name__
# current_func is local to this call frame — safe for threaded use.
# For async, replace with contextvars.ContextVar (see production
# hardening note below).
current_func = func
for attempt in range(1, max_attempts + 1):
# Derive prompt_hash from call arguments as a proxy for prompt identity.
# For accurate correlation, pass prompt content explicitly via kwargs.
prompt_hash = hashlib.sha256(
(str(args) + str(sorted(kwargs.items()))).encode()
).hexdigest()[:12]
with tracer.start_as_current_span("agent.self_correct_attempt") as span:
span.set_attribute("agent.attempt_number", attempt)
span.set_attribute("agent.agent_id", agent_id)
span.set_attribute("agent.prompt_hash", prompt_hash)
span.set_attribute("agent.cumulative_tokens", cumulative_tokens)
try:
result = current_func(*args, **kwargs)
span.set_attribute("agent.result_status", "success")
return result
except retryable_errors as exc:
last_exception = exc
tb = traceback.format_exc()
span.set_attribute("agent.error_class", type(exc).__name__)
span.set_attribute("agent.error_message", str(exc)[:MAX_ERROR_MESSAGE_LEN])
error_context = {
"error_type": type(exc).__name__,
"error_message": str(exc),
"traceback": tb,
"attempt": attempt,
"function_name": original_name,
}
# Only charge and check budget if another attempt will occur
if attempt < max_attempts:
# ~4 chars/token heuristic; use tiktoken for accuracy
estimated_tokens = (len(tb) + len(str(exc))) // 4 + 200
cumulative_tokens += estimated_tokens
if cumulative_tokens > token_budget:
span.set_attribute("agent.abort_reason", "token_budget_exceeded")
raise TokenBudgetExceeded(
f"Budget exceeded: {cumulative_tokens}/{token_budget} estimated tokens"
) from exc
if correction_callback:
candidate = correction_callback(error_context)
if not callable(candidate):
raise TerminalAgentError(
f"correction_callback returned non-callable: {type(candidate)}"
) from exc
current_func = candidate
except (TokenBudgetExceeded, TerminalAgentError):
raise # always propagate control-flow exceptions
except Exception as exc:
# Non-retryable: report to Sentry and raise immediately
with sentry_sdk.new_scope() as scope:
scope.set_tag("agent.agent_id", agent_id)
scope.set_tag("agent.terminal_error", "true")
scope.capture_exception(exc)
raise TerminalAgentError(
f"Terminal error on attempt {attempt}: {exc}"
) from exc
# All retryable attempts exhausted
with sentry_sdk.new_scope() as scope:
scope.set_context("agent_exhaustion", {
"max_attempts": max_attempts,
"cumulative_tokens": cumulative_tokens,
"final_error": str(last_exception),
})
scope.capture_exception(last_exception)
raise TerminalAgentError(
f"All {max_attempts} attempts exhausted"
) from last_exception
return wrapper
return decorator
Usage Example
import pandas as pd
# NOTE: This is a hardcoded stub for illustration only.
# In production, replace with a real LLM API call that receives error_context
# and generates a corrected function dynamically. This stub always returns
# the same corrected function regardless of the actual error type.
def llm_correction(error_context):
"""STUB ONLY: Always returns corrected_transform regardless of error.
In production, replace with a real LLM API call that receives error_context
and generates a corrected function dynamically."""
# The stub simulates an LLM that sees the KeyError on 'date' and generates a corrected version
def corrected_transform(df):
# Corrected: use 'timestamp' column instead of 'date'
df["year"] = pd.to_datetime(df["timestamp"]).dt.year
return df[df["year"] >= 2023]
return corrected_transform
@self_correct(
max_attempts=3,
token_budget=4000,
retryable_errors=(KeyError, ValueError),
correction_callback=llm_correction,
agent_id="data-pipeline-agent-01"
)
def transform_data(df):
# First attempt: LLM-generated code with wrong column name
df["year"] = pd.to_datetime(df["date"]).dt.year
return df[df["year"] >= 2023]
# Execution: attempt 1 raises KeyError('date'), decorator captures context,
# correction_callback returns the stub's fixed function, attempt 2 succeeds.
# With a real LLM callback, success on attempt 2 is not guaranteed.
sample_df = pd.DataFrame({
"timestamp": ["2022-01-01", "2023-06-15", "2024-03-10"],
"value": [1, 2, 3]
})
result = transform_data(sample_df)
Why This Is Not Just a Retry Loop
A blind retry executes the same function with the same inputs three times, producing three identical KeyError exceptions and consuming tokens on each attempt with zero diagnostic progress. The self-correction decorator behaves differently at every step.
Attempt one catches the KeyError, extracts the traceback and error message, and packages them into an error context dictionary. By attempt two, the correction callback has already ingested that context and returned a new function targeting the specific failure. A third attempt exists as a safety net but rarely fires because the directed correction has narrowed the error space. Context mutation transforms repeated gambling into directed search through the solution space.
Context mutation transforms repeated gambling into directed search through the solution space.
Classifying Errors: Deciding What Deserves a Retry
Building an Error Taxonomy for AI Agents
Not all errors are created equal in probabilistic execution.
Terminal errors include authentication failures, permission denials, and hard schema breakages where the upstream data contract has changed fundamentally. Propagate these immediately to Sentry and halt execution.
Retryable-with-mutation errors cover semantic drift, partial output, and format violations: cases where feeding the error back to the agent succeeds in more than roughly 30% of historical attempts for that error class. Below that threshold, the correction callback is unlikely to help and the tokens are better saved.
Retryable-without-mutation errors cover transient network failures and rate limits, where standard exponential backoff is the correct strategy because the issue is infrastructure, not logic.
Dynamic Classification with Confidence Scoring
Static error classification misses the reality that retry viability degrades with each attempt. Assigning a retry confidence score between 0 and 1, based on error type combined with attempt history, provides a more nuanced gate. A KeyError on attempt one might score 0.8 (illustrative; derive actual values from your own retry-success telemetry). The same KeyError on attempt three, after two correction cycles failed to resolve it, drops to 0.2. When confidence falls below a configurable threshold (0.3 is an illustrative starting point; tune based on observed retry success rates in your specific deployment), the decorator aborts early rather than exhausting remaining attempts.
The confidence scoring logic described above is conceptual and is not implemented in the decorator code shown earlier. A production implementation would add a classify_error(exc, attempt) -> float function that returns the confidence score and gates retry decisions within the decorator loop.
Using the LLM itself to classify its own errors (meta-correction) is possible but demands caution. Each classification call consumes tokens and adds latency. Recursive self-assessment can become a secondary budget drain. Reserve meta-correction for high-value pipelines where the cost of a false-positive retry is significantly lower than the cost of premature termination.
Putting It All Together: An End-to-End Pipeline
Architecture Overview
The full pipeline operates as a closed loop. The agent generates code, and the @self_correct decorator wraps execution. OpenTelemetry traces each attempt as a discrete span with prompt hash, attempt number, estimated token count, and result classification. This is where observability pays off.
When an exception occurs, the decorator classifies it against the error taxonomy. Retryable errors trigger context mutation: the error message, traceback, and previous output feed into the correction callback, which calls the LLM to produce a corrected function. Terminal errors route directly to Sentry with full execution context. On success, the result returns with provenance metadata (attempt count, cumulative tokens, correction strategies applied) attached to the trace.
Production Hardening Checklist
After N consecutive failures across multiple functions (not just retries within a single function), a circuit breaker should open so that subsequent calls fail fast rather than consuming resources. This prevents cascading failures across an agent fleet.
Async-safe variants of the decorator are essential for concurrent agent workloads. Isolate current_func per-coroutine using contextvars.ContextVar to avoid race conditions. A full async variant of this decorator is beyond this article's scope; see the Python contextvars documentation.
Store all intermediate code generations, not just the final successful output. The failed attempts contain the richest diagnostic signal for improving prompt engineering.
Segment Sentry alert routing by error taxonomy: terminal errors page on-call, retryable exhaustions feed a daily digest, and budget exceeded alerts trigger automated prompt review workflows.
Toward Reliable AI Systems
Probabilistic code demands probabilistic error handling. The pattern is observe, classify, mutate, bound. Instrument every agent execution with OpenTelemetry spans carrying probabilistic metadata, and capture terminal failures in Sentry with full context. Classify errors dynamically rather than relying on static exception hierarchies. Mutate the correction context between retries so each attempt is a directed step, not a coin flip. Enforce hard boundaries on both attempts and token spend.
Probabilistic code demands probabilistic error handling. The pattern is observe, classify, mutate, bound.
The @self_correct decorator presented here is a starting template, not a finished product. Production deployments will need to adapt the correction callback to their specific LLM provider, tune confidence thresholds based on observed retry success rates, and extend the error taxonomy as new failure modes surface. The data collected from failed attempts (the tracebacks, the intermediate outputs, the correction strategies that worked and those that did not) is as valuable as the successful outputs. That failure data is the training signal for improving correction-callback accuracy in future iterations.