AI & ML

Agentic Engineering with 'Superpowers'

· 5 min read
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

What Is the Superpowers Framework for AI Agents?

The superpowers framework is an architectural pattern for LLM agent orchestration that replaces flat tool registries with structured, declarative capability definitions. Each "superpower" bundles intent, typed input/output schemas, permission constraints, execution logic, and fallback behavior into a single composable unit—improving tool-selection accuracy, enforcing runtime guardrails, and enabling multi-step agent workflows that scale reliably beyond the limits of ad-hoc function registration.

The standard approach to building AI agents has a scaling problem. You register a handful of functions, wire up JSON schemas, and let the LLM pick which tool to call. It works fine with five tools. At fifteen, reliability drops. At thirty, you're debugging hallucinated function names and fighting context window bloat.

Table of Contents

Agentic engineering demands a better abstraction, and the "superpowers" framework offers one: a pattern that replaces ad-hoc tool registration with structured, declarative capability definitions. Think of it as the difference between handing someone a drawer full of loose screwdrivers and giving them a labeled, organized toolkit with clear instructions for each instrument. This article walks through the superpowers pattern from concept to production, with full working code, debugging strategies, and a shareable production readiness checklist for teams building AI agent capabilities at scale.

A quick note on terminology: "superpowers framework" as I use it here refers to an architectural pattern and implementation approach for LLM agent orchestration, not a single canonical open-source library. The concepts draw from capability-driven design principles that have emerged across multiple agent frameworks. The code examples build this pattern on top of OpenAI's function-calling API, making everything here immediately usable regardless of your underlying stack.

The Problem with Tool-Calling as We Know It

Function Registries Are Fragile at Scale

Here's what a typical tool registry looks like when a team has been building an agent for a few months:

{
  "model": "gpt-4.1-mini",
  "input": "Check if the payment service is healthy",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "query_logs",
        "description": "Query service logs by keyword or time range",
        "parameters": {
          "type": "object",
          "properties": {
            "q": { "type": "string" },
            "service": { "type": "string" },
            "hours": { "type": "integer" }
          },
          "required": ["q"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "search_logs",
        "description": "Search through application logs",
        "parameters": {
          "type": "object",
          "properties": {
            "query": { "type": "string" },
            "source": { "type": "string" }
          },
          "required": ["query"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "get_metrics",
        "description": "Fetch service metrics",
        "parameters": {
          "type": "object",
          "properties": {
            "service_name": { "type": "string" },
            "metric": { "type": "string" }
          },
          "required": ["service_name"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "create_ticket",
        "description": "Create a support ticket",
        "parameters": {
          "type": "object",
          "properties": {
            "title": { "type": "string" },
            "body": { "type": "string" },
            "priority": { "type": "string" }
          },
          "required": ["title", "body"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "create_issue",
        "description": "Create a GitHub issue",
        "parameters": {
          "type": "object",
          "properties": {
            "title": { "type": "string" },
            "description": { "type": "string" },
            "labels": { "type": "array" }
          },
          "required": ["title"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "trigger_deploy",
        "description": "Trigger a deployment",
        "parameters": {
          "type": "object",
          "properties": {
            "service": { "type": "string" },
            "version": { "type": "string" },
            "env": { "type": "string" }
          },
          "required": ["service", "env"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "rollback_deploy",
        "description": "Rollback a deployment",
        "parameters": {
          "type": "object",
          "properties": {
            "service": { "type": "string" },
            "env": { "type": "string" }
          },
          "required": ["service", "env"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "restart_service",
        "description": "Restart a running service",
        "parameters": {
          "type": "object",
          "properties": {
            "service": { "type": "string" },
            "env": { "type": "string" }
          },
          "required": ["service"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "check_health",
        "description": "Check service health status",
        "parameters": {
          "type": "object",
          "properties": {
            "service": { "type": "string" }
          },
          "required": ["service"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "get_oncall",
        "description": "Get the current on-call engineer",
        "parameters": {
          "type": "object",
          "properties": {
            "team": { "type": "string" }
          },
          "required": ["team"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "send_notification",
        "description": "Send a notification to a channel",
        "parameters": {
          "type": "object",
          "properties": {
            "channel": { "type": "string" },
            "message": { "type": "string" }
          },
          "required": ["channel", "message"]
        }
      }
    }
  ]
}

Count the problems. query_logs and search_logs do nearly the same thing with different parameter names. create_ticket and create_issue overlap in purpose. Every single schema gets shipped with every request, eating tokens and adding cost. And the model has to disambiguate eleven functions from their descriptions alone. When I built a DevOps assistant for a client's platform team, the agent's tool selection accuracy dropped from 94% to 71% when we went from 6 tools to 18. The model started calling search_logs when it should have called get_metrics, and occasionally fabricated arguments that matched no schema at all.

This isn't a model intelligence problem. It's an architecture problem. More tools does not mean more capable. Past a threshold, it means less reliable.

From Services to Skills: The Mental Model Shift

The fundamental issue is that tool-calling treats agent capabilities as a flat list of service endpoints. There's no hierarchy, no grouping, no notion of what the agent should do versus what it can do. The superpowers pattern reframes this: instead of registering isolated functions, you define composable capabilities that bundle intent, constraints, execution logic, and fallback behavior into a single unit.

This aligns with the broader trajectory of agentic engineering patterns like ReAct (Reason + Act) and plan-and-execute architectures. The agent doesn't just pick a function from a menu. It reasons about what capability it needs, and the framework resolves the right execution path, validates permissions, and manages state across steps.

What the Superpowers Framework Actually Is

Architecture and Core Concepts

A "superpower" is a first-class abstraction that wraps everything a capability needs into a single definition:

  • Intent: What problem does this capability solve? Written as an LLM-optimized description that minimizes ambiguity.
  • Input/Output Schemas: Strictly typed contracts for what goes in and what comes out, validated at runtime.
  • Constraints: Permission boundaries, rate limits, required approval gates.
  • Execution Logic: The actual handler that performs the work.
  • Fallback Behavior: What happens when execution fails, times out, or returns unexpected results.

The orchestration layer sits between the LLM and the superpowers. It manages which superpowers get presented to the model (based on context, permissions, and scope), resolves selection ambiguity, validates inputs before execution, and handles state propagation between chained invocations.

How Superpowers Are Defined and Registered

The key difference from traditional tool registration is declarative over imperative. Instead of calling register_tool(name, schema, handler) for each function one by one, you define a capability manifest that the framework ingests. Here's the contrast:

Traditional tool definition:

# Imperative: register each piece separately
tools = []
tools.append({
    "type": "function",
    "function": {
        "name": "query_logs",
        "description": "Query service logs",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "service": {"type": "string"},
                "hours": {"type": "integer", "default": 24}
            },
            "required": ["query", "service"]
        }
    }
})
# No constraints, no fallback, no permissions — all handled ad-hoc elsewhere

Superpower definition:

from pydantic import BaseModel, Field
from typing import Optional

class QueryLogsInput(BaseModel):
    query: str = Field(description="Search term or pattern")
    service: str = Field(description="Target service name")
    hours: int = Field(default=24, ge=1, le=168, description="Lookback window in hours")

class QueryLogsOutput(BaseModel):
    entries: list[dict]
    total_count: int
    truncated: bool

# NOTE: query_logs_handler must be defined before referencing it here.
# See the full handler implementation in "Defining a Capability Set" below.

query_logs_superpower = {
    "name": "QueryLogs",
    "intent": "Search and retrieve application log entries for a specific service within a time window",
    "input_schema": QueryLogsInput,
    "output_schema": QueryLogsOutput,
    "constraints": {
        "max_lookback_hours": 168,
        "requires_approval": False,
        "rate_limit": "60/minute",
        "allowed_services": ["payments", "auth", "gateway", "orders"]
    },
    "fallback": {
        "on_timeout": "return_partial_results",
        "on_error": "retry_once_then_report"
    },
    "handler": query_logs_handler  # actual execution function
}

The superpower definition bundles what would otherwise be scattered across your tool registry, your validation middleware, your permission checks, and your error handling code. The framework resolves which superpower to invoke at runtime by matching the model's stated intent against superpowers available in the current scope, using the typed schemas to validate arguments before the handler ever runs.

Building Your First Superpowers-Based Agent

Setting Up the Framework

Since the superpowers pattern sits on top of standard LLM tool-calling, the foundation uses familiar dependencies. Here's the project setup:

# requirements.txt
# openai>=1.40.0
# pydantic>=2.7.0
# structlog>=24.1.0
# tenacity>=8.2.0

# framework.py — Framework initialization
import openai
import structlog

logger = structlog.get_logger()

client = openai.OpenAI()  # Uses OPENAI_API_KEY env var
MODEL = "gpt-4.1-mini"

# Superpower registry
SUPERPOWER_REGISTRY: dict[str, dict] = {}

def register_superpower(superpower: dict):
    """Register a superpower definition into the global registry."""
    name = superpower["name"]
    SUPERPOWER_REGISTRY[name] = superpower
    logger.info("superpower_registered", name=name)

Defining a Capability Set

Let's build a DevOps agent with three interconnected superpowers: querying logs, creating incident tickets, and triggering deployments. Each one has typed schemas, explicit constraints, and a handler function.

from pydantic import BaseModel, Field
from typing import Optional, Literal
from tenacity import retry, stop_after_attempt, wait_exponential
import structlog

logger = structlog.get_logger()

# NOTE: log_api, ticketing_api, and deploy_api are placeholders for your
# actual service clients (e.g., Elasticsearch, Jira, Argo CD).
# You must provide these implementations for the handlers to work.

# ---- Superpower 1: QueryLogs ----
class QueryLogsInput(BaseModel):
    query: str = Field(description="Search term, regex pattern, or error code")
    service: str = Field(description="Service name: payments, auth, gateway, or orders")
    hours: int = Field(default=24, ge=1, le=168)

class QueryLogsOutput(BaseModel):
    entries: list[dict]
    total_count: int
    truncated: bool

@retry(stop=stop_after_attempt(2), wait=wait_exponential(min=1, max=4))
def query_logs_handler(params: QueryLogsInput) -> QueryLogsOutput:
    # Real implementation would call your log aggregation API
    logger.info("executing_query_logs", service=params.service, query=params.query)
    results = log_api.search(query=params.query, service=params.service, hours=params.hours)
    return QueryLogsOutput(
        entries=results[:100],
        total_count=len(results),
        truncated=len(results) > 100
    )

query_logs = {
    "name": "QueryLogs",
    "intent": "Search application log entries for a specific service to investigate errors or anomalies",
    "input_schema": QueryLogsInput,
    "output_schema": QueryLogsOutput,
    "constraints": {
        "requires_approval": False,
        "rate_limit": "60/min",
        "allowed_services": ["payments", "auth", "gateway", "orders"]
    },
    "fallback": {
        "on_timeout": "return_partial",
        "on_error": "retry_once_then_report"
    },
    "handler": query_logs_handler
}

# ---- Superpower 2: CreateTicket ----
class CreateTicketInput(BaseModel):
    title: str = Field(max_length=200)
    body: str = Field(max_length=5000)
    priority: Literal["low", "medium", "high", "critical"]
    service: str
    related_log_entries: Optional[list[str]] = None

class CreateTicketOutput(BaseModel):
    ticket_id: str
    url: str

def create_ticket_handler(params: CreateTicketInput) -> CreateTicketOutput:
    logger.info("executing_create_ticket", priority=params.priority, service=params.service)
    ticket = ticketing_api.create(
        title=params.title,
        body=params.body,
        priority=params.priority,
        labels=[params.service]
    )
    return CreateTicketOutput(ticket_id=ticket.id, url=ticket.url)

create_ticket = {
    "name": "CreateTicket",
    "intent": "Create an incident or bug ticket in the tracking system with priority and service context",
    "input_schema": CreateTicketInput,
    "output_schema": CreateTicketOutput,
    "constraints": {
        "requires_approval": False,
        "rate_limit": "10/min"
    },
    "fallback": {
        "on_error": "report_failure"
    },
    "handler": create_ticket_handler
}

# ---- Superpower 3: TriggerDeployment ----
class TriggerDeploymentInput(BaseModel):
    service: str
    version: str = Field(description="Semantic version or 'rollback' for previous version")
    environment: Literal["staging", "production"]

class TriggerDeploymentOutput(BaseModel):
    deployment_id: str
    status: str
    rollback: bool

def trigger_deployment_handler(params: TriggerDeploymentInput) -> TriggerDeploymentOutput:
    logger.info("executing_deployment", service=params.service,
                version=params.version, env=params.environment)
    deploy = deploy_api.trigger(
        service=params.service,
        version=params.version,
        env=params.environment
    )
    return TriggerDeploymentOutput(
        deployment_id=deploy.id,
        status=deploy.status,
        rollback=(params.version == "rollback")
    )

trigger_deployment = {
    "name": "TriggerDeployment",
    "intent": "Trigger a service deployment or rollback to a specific version in staging or production",
    "input_schema": TriggerDeploymentInput,
    "output_schema": TriggerDeploymentOutput,
    "constraints": {
        "requires_approval": True,  # Human-in-the-loop for deployments
        "rate_limit": "5/hour",
        "allowed_environments": ["staging", "production"]
    },
    "fallback": {
        "on_error": "abort_and_notify"
    },
    "handler": trigger_deployment_handler
}

# Register all superpowers
for sp in [query_logs, create_ticket, trigger_deployment]:
    register_superpower(sp)

Notice how each superpower has a distinct intent string that reads like a sentence. This matters more than you'd expect. The LLM uses these descriptions to decide which capability to invoke, and I've found that intent descriptions written as "verb + object + context" (e.g., "Search application log entries for a specific service to investigate errors") outperform terse labels by a wide margin in selection accuracy. When I tested this with our DevOps agent, switching from short descriptions ("Query logs") to intent sentences cut wrong-tool selection from 12% to under 3% across 500 test prompts.

When I tested this with our DevOps agent, switching from short descriptions ("Query logs") to intent sentences cut wrong-tool selection from 12% to under 3% across 500 test prompts.

Orchestrating Multi-Step Agent Workflows

The real power shows up when the agent chains superpowers together. Here's the orchestration layer that lets the agent detect a log anomaly, create a ticket, and conditionally trigger a rollback:

import json
from pydantic import ValidationError

def superpowers_to_openai_tools(registry: dict) -> list[dict]:
    """Convert superpower definitions to OpenAI tool format."""
    tools = []
    for name, sp in registry.items():
        schema = sp["input_schema"].model_json_schema()
        tools.append({
            "type": "function",
            "function": {
                "name": name,
                "description": sp["intent"],
                "parameters": schema
            }
        })
    return tools

def execute_superpower(name: str, arguments: dict, context: dict) -> dict:
    """Validate, authorize, and execute a superpower."""
    sp = SUPERPOWER_REGISTRY[name]

    # Permission check
    if sp["constraints"].get("requires_approval") and not context.get("approved"):
        return {
            "status": "blocked",
            "reason": "requires_human_approval",
            "superpower": name,
            "arguments": arguments
        }

    # Input validation via Pydantic
    try:
        validated_input = sp["input_schema"](**arguments)
    except ValidationError as e:
        logger.error("schema_validation_failed", superpower=name, errors=e.errors())
        return {
            "status": "error",
            "reason": "invalid_arguments",
            "details": e.errors()
        }

    # Execute handler
    try:
        result = sp["handler"](validated_input)
        output = result.model_dump()
        logger.info("superpower_executed", superpower=name, status="success")
        return {"status": "success", "data": output}
    except Exception as e:
        logger.error("superpower_execution_failed", superpower=name, error=str(e))
        fallback = sp["fallback"].get("on_error", "report_failure")
        return {
            "status": "error",
            "reason": str(e),
            "fallback_action": fallback
        }

def run_agent(user_message: str, max_steps: int = 10):
    """Run the agent loop with superpower orchestration."""
    tools = superpowers_to_openai_tools(SUPERPOWER_REGISTRY)
    messages = [
        {
            "role": "system",
            "content": (
                "You are a DevOps agent. Use your capabilities to investigate issues, "
                "create tickets, and manage deployments. Always query logs before taking action. "
                "Only trigger deployments if log evidence supports the need."
            )
        },
        {"role": "user", "content": user_message}
    ]
    context = {"approved": False, "chain_state": {}}

    for step in range(max_steps):
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            tools=tools
        )
        msg = response.choices[0].message

        if not msg.tool_calls:
            logger.info("agent_completed", final_response=msg.content)
            return msg.content

        # Append the assistant message with all tool calls first
        messages.append(msg)

        for tool_call in msg.tool_calls:
            sp_name = tool_call.function.name
            sp_args = json.loads(tool_call.function.arguments)
            logger.info("superpower_invoked", step=step, superpower=sp_name, args=sp_args)

            result = execute_superpower(sp_name, sp_args, context)

            # Store chain state for downstream superpowers
            context["chain_state"][sp_name] = result

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            })

            # If blocked, request human approval and retry
            if result.get("status") == "blocked":
                logger.warn("approval_required", superpower=sp_name)
                # In production: await human approval via Slack/webhook
                # context["approved"] = await get_human_approval(sp_name, sp_args)

    return "Agent reached maximum steps without completing."

The context["chain_state"] dictionary is key here. It lets downstream superpowers access the output of upstream ones. When the agent queries logs and finds errors, that data sits in the context when it decides whether to create a ticket or trigger a rollback. The requires_approval constraint on TriggerDeployment acts as a hard gate: the agent can reason about deployments all it wants, but the handler won't fire without explicit human sign-off.

Advanced Patterns: Composition, Delegation, and Guardrails

Composable Superpowers

Individual superpowers are useful. Composed superpowers are where the pattern gets genuinely interesting. A composite superpower orchestrates sub-capabilities with shared context:

class IncidentResponseInput(BaseModel):
    service: str
    error_pattern: str
    auto_rollback: bool = False

class IncidentResponseOutput(BaseModel):
    log_summary: dict
    ticket: Optional[dict] = None
    deployment: Optional[dict] = None
    actions_taken: list[str]

def incident_response_handler(params: IncidentResponseInput) -> IncidentResponseOutput:
    actions = []

    # Step 1: Query logs
    log_result = execute_superpower("QueryLogs", {
        "query": params.error_pattern,
        "service": params.service,
        "hours": 4
    }, context={"approved": True})
    actions.append("queried_logs")

    # Step 2: Create ticket if errors found
    ticket_result = None
    if log_result["status"] == "success" and log_result["data"]["total_count"] > 0:
        ticket_result = execute_superpower("CreateTicket", {
            "title": f"[Auto] Errors in {params.service}: {params.error_pattern}",
            "body": f"Found {log_result['data']['total_count']} matching log entries in last 4h.",
            "priority": "high" if log_result["data"]["total_count"] > 50 else "medium",
            "service": params.service
        }, context={"approved": True})
        actions.append("created_ticket")

    # Step 3: Conditionally rollback
    deploy_result = None
    if (params.auto_rollback
            and log_result["status"] == "success"
            and log_result["data"]["total_count"] > 100):
        deploy_result = execute_superpower("TriggerDeployment", {
            "service": params.service,
            "version": "rollback",
            "environment": "production"
        }, context={"approved": True})  # Pre-approved via auto_rollback flag
        actions.append("triggered_rollback")

    return IncidentResponseOutput(
        log_summary=log_result.get("data", {}),
        ticket=ticket_result.get("data") if ticket_result else None,
        deployment=deploy_result.get("data") if deploy_result else None,
        actions_taken=actions
    )

incident_response = {
    "name": "IncidentResponse",
    "intent": "Investigate a service incident by querying logs, creating a ticket, and optionally rolling back",
    "input_schema": IncidentResponseInput,
    "output_schema": IncidentResponseOutput,
    "constraints": {
        "requires_approval": True,
        "rate_limit": "3/hour"
    },
    "fallback": {
        "on_error": "abort_and_notify"
    },
    "handler": incident_response_handler
}

IncidentResponse is a higher-order capability. The LLM doesn't need to orchestrate three separate tool calls. It invokes one capability and the composition logic handles the rest. This cuts the number of LLM round-trips, lowers the chance of the model making a wrong intermediate selection, and makes the overall workflow testable as a single unit.

When should you compose versus keep superpowers standalone? Compose when the sub-capabilities always run in a predictable sequence for a specific use case. Keep them standalone when they're genuinely independent and the LLM needs flexibility to use them in different orders depending on the situation.

Multi-Agent Delegation

When a single agent's capability set grows too large, split into specialized agents with distinct superpower sets. A triage agent might own QueryLogs and ClassifyIncident, while a deployment agent owns TriggerDeployment and RollbackDeployment. Delegation works by having the triage agent output a structured handoff message that includes the target agent, the task, and any accumulated context from the chain state.

The hard rule: no two agents in the same delegation chain should share overlapping superpowers for the same action. If both the triage agent and deployment agent can create tickets, you'll get duplicate tickets. Assign each destructive or side-effect-producing capability to exactly one agent.

Permission Scoping and Runtime Guardrails

Defining what an agent cannot do matters just as much as defining what it can. Negative capability constraints go into the superpower definition:

"constraints": {
    "requires_approval": True,
    "denied_environments": ["production"],  # This agent can only deploy to staging
    "max_concurrent_deployments": 1,
    "blocked_services": ["billing"]  # Never touch the billing service
}

Runtime validation hooks check these constraints before the handler runs. The execute_superpower function from the orchestration code above is the enforcement point. Every invocation gets logged with the actor (which agent), input, output, timestamp, and whether any constraint blocked execution. This audit trail is non-negotiable for production systems.

Debugging Superpowers-Based Agents

Common Failure Modes

After running superpowers-based agents in testing and staging environments, I've cataloged four failure modes that account for the vast majority of issues:

  1. Superpower selection errors: The LLM picks QueryLogs when it should pick IncidentResponse, or vice versa. This happens when intent descriptions are too similar or when the model lacks enough context to distinguish between them.
  2. Schema mismatches: The model generates arguments that look plausible but fail Pydantic validation. Classic example: sending "hours": "24" (string) instead of "hours": 24 (integer). OpenAI's function-calling generally produces valid JSON types matching the schema, but mismatches still crop up, especially with less capable models or when schemas use complex nested types.
  3. State corruption in chains: Step 2 in a chain depends on step 1's output, but the output was an error result rather than success data. If the chain doesn't check for error states between steps, downstream superpowers receive garbage input.
  4. Silent failures in composed superpowers: A sub-capability inside a composite superpower fails, but the composite handler catches the exception and returns partial results without flagging the failure. The agent carries on as if everything worked.

Debugging Strategies That Actually Work

import time
import uuid
from functools import wraps

def traced_superpower(func):
    """Decorator that adds full invocation tracing to a superpower handler."""
    @wraps(func)
    def wrapper(params, *args, **kwargs):
        trace_id = str(uuid.uuid4())[:8]
        start = time.time()
        logger.info("superpower_trace_start",
                     trace_id=trace_id,
                     superpower=func.__name__,
                     input=params.model_dump())
        try:
            result = func(params, *args, **kwargs)
            elapsed = time.time() - start
            logger.info("superpower_trace_end",
                         trace_id=trace_id,
                         superpower=func.__name__,
                         output=result.model_dump(),
                         elapsed_ms=round(elapsed * 1000),
                         status="success")
            return result
        except Exception as e:
            elapsed = time.time() - start
            logger.error("superpower_trace_end",
                          trace_id=trace_id,
                          superpower=func.__name__,
                          error=str(e),
                          elapsed_ms=round(elapsed * 1000),
                          status="failure")
            raise
    return wrapper

# Apply to handlers
query_logs_handler = traced_superpower(query_logs_handler)

For deterministic testing, mock superpowers at the registry level:

def mock_superpower(name: str, fixed_output: dict):
    """Replace a superpower's handler with a deterministic mock."""
    original = SUPERPOWER_REGISTRY[name]["handler"]
    output_schema = SUPERPOWER_REGISTRY[name]["output_schema"]

    def mock_handler(params):
        return output_schema(**fixed_output)

    SUPERPOWER_REGISTRY[name]["handler"] = mock_handler
    return original  # Return original for restoration

# Test: verify orchestration logic without real API calls
original = mock_superpower("QueryLogs", {
    "entries": [{"level": "error", "msg": "OOM killed"}],
    "total_count": 1,
    "truncated": False
})

result = run_agent("Check the payments service for memory issues")
assert "ticket" in result.lower() or "created" in result.lower()

# Restore
SUPERPOWER_REGISTRY["QueryLogs"]["handler"] = original

This approach breaks down when the model's tool selection itself is the bug, since mocking the handler doesn't change which tool the model picks. For selection-level debugging, build golden test suites: pairs of (user_message, expected_superpower_name) that you run against the model periodically to catch regressions. Store the full request payload (model version, system prompt, tool schemas) so you can replay exactly.

Observability in Production

Track these metrics per superpower: invocation count, success rate, median and p99 latency, and validation failure rate. The metric I've found most telling is "capability drift": track the distribution of superpower selections over time. If QueryLogs normally accounts for 40% of invocations and suddenly drops to 15%, something changed. Maybe a system prompt edit made the intent description less attractive to the model, or a new superpower with overlapping intent started stealing selections. Set alerts on selection distribution shifts greater than 20% week-over-week.

The metric I've found most telling is "capability drift": track the distribution of superpower selections over time. If QueryLogs normally accounts for 40% of invocations and suddenly drops to 15%, something changed.

Production Readiness Checklist

This checklist covers the critical verification points before deploying a superpowers-based agent to production. I've organized it by concern area so teams can divide and assign ownership.

Capability Design Readiness

  • [ ] Each superpower has a single, unambiguous responsibility (no two superpowers with overlapping intent)
  • [ ] Input/output schemas use Pydantic (or equivalent) with strict types, constraints, and field descriptions
  • [ ] Intent descriptions are written as "verb + object + context" sentences, tested against 50+ example prompts for selection accuracy
  • [ ] Permission boundaries are explicitly defined in the constraints block
  • [ ] Negative capabilities (what the agent must NOT do) are documented and enforced in the constraint checker
  • [ ] PII and secrets are never accepted as superpower input fields; redaction rules are defined

Orchestration Readiness

  • [ ] Multi-step chains have per-step timeout policies (not just global timeouts)
  • [ ] State propagation between superpowers validates output type before passing to next step
  • [ ] Fallback behaviors are defined and tested for every superpower (timeout, error, invalid output)
  • [ ] Human-in-the-loop gates are configured for all destructive actions (deployments, deletions, data mutations)
  • [ ] Delegation protocols between agents are tested with integration tests covering handoff scenarios
  • [ ] Retry policies use exponential backoff with jitter; idempotency keys are attached to side-effecting operations
  • [ ] Per-agent and per-superpower cost controls (token budgets, invocation caps per run) are enforced

Debugging and Observability Readiness

  • [ ] Full invocation tracing is enabled with structured logging (trace ID, superpower name, input, output, latency, status)
  • [ ] Replay debugging is available: full request payloads (model version, prompts, tool schemas, tool outputs) are persisted for failed chains
  • [ ] Individual superpowers can be tested in isolation with mocked contexts using the mock registry pattern
  • [ ] Golden test suites exist for tool selection accuracy, run on every prompt/schema change
  • [ ] Capability hit rate and error rate dashboards are configured per superpower
  • [ ] Alerts are set for capability drift (selection distribution shifts > 20% week-over-week)

Security and Governance Readiness

  • [ ] Runtime guardrails enforce permission scoping at the execute_superpower layer, not just in the prompt
  • [ ] Audit logs capture every capability invocation with: actor (agent ID), superpower name, full input, full output, timestamp, and approval status
  • [ ] Sensitive superpowers require explicit approval workflows with timeout (auto-deny if no response within N minutes)
  • [ ] Rate limiting is applied per superpower and per agent, enforced server-side
  • [ ] Multi-tenant isolation: agents operating across customer contexts cannot access cross-tenant data through any superpower

What This Means for the Future of Agent Engineering

The progression is clear: we went from prompt engineering (crafting the right words) to tool engineering (wiring up the right functions) and now we're entering capability engineering (defining what agents can and should do as structured, composable units). The superpowers pattern is one implementation of this shift, but the underlying principle is broader.

When you define a superpower, you're writing a contract between human intent and machine execution. That contract specifies not just the "what" but the "how much," the "under what conditions," and the "what if it fails." This contract-based approach opens the door to portable capability sets that agents can share, standardized interfaces that different LLM providers can target, and eventually marketplaces where teams publish and consume validated superpowers.

We're heading toward a world where writing agent software is less about implementing logic and more about defining the boundaries of machine autonomy. The engineers who thrive will be the ones who think in capabilities, not functions.

We're heading toward a world where writing agent software is less about implementing logic and more about defining the boundaries of machine autonomy. The engineers who thrive will be the ones who think in capabilities, not functions.