Concurrency & Observability
Lane-based locking, trace IDs, and production debugging — the infrastructure that keeps autonomous agents from colliding.
The Collision Problem
Think of it as two people editing the same Google Doc at the same time — neither sees what the other is typing, and every few minutes the document just resets to an old version. That’s what happens when an autonomous agent is running multiple surfaces without coordination.
An autonomous agent isn’t one process. It’s multiple surfaces sharing the same database, the same skills, and the same memory:
Surface 1: HEARTBEAT (cron, every 30 min)
Surface 2: AGENT-OPERATE (admin interaction)
Surface 3: CHAT-COMPLETION (visitor chat)
Surface 4: WEBHOOKS (external triggers)
What happens when a heartbeat fires while the admin is mid-conversation with the agent? Both try to:
- Read and write to
agent_memory - Execute skills that modify state
- Update
agent_objectivesprogress - Log to
agent_activity
Without coordination, you get:
- Race conditions — heartbeat overwrites memory that operate just wrote
- Duplicate work — both surfaces execute the same pending automation
- Corrupted state — partial writes from interrupted operations
- Billing surprises — parallel API calls double your token spend
Lane-Based Locking
FlowPilot uses a simple, effective concurrency model: lane-based advisory locks stored in the agent_locks table.
The Model
Each agent surface claims a “lane” before operating. Only one process can hold a lane at a time.
CREATE TABLE agent_locks (
lane TEXT PRIMARY KEY,
locked_by TEXT NOT NULL,
locked_at TIMESTAMPTZ DEFAULT now(),
expires_at TIMESTAMPTZ DEFAULT now() + interval '5 minutes'
);
Acquiring a Lock
import { tryAcquireLock, releaseLock } from '../_shared/concurrency.ts';
// In heartbeat
const acquired = await tryAcquireLock(supabase, 'heartbeat', 'heartbeat-cron', 300);
if (!acquired) {
console.log('[heartbeat] Another instance is running, skipping');
return; // Graceful exit, no error
}
try {
// ... run heartbeat logic ...
} finally {
await releaseLock(supabase, 'heartbeat');
}
Lock Lanes
| Lane | Holder | Purpose |
|---|---|---|
heartbeat | Cron trigger | Prevents overlapping heartbeat cycles |
operate | Admin session | Prevents heartbeat from interfering with interactive use |
automation:{id} | Heartbeat | Prevents duplicate automation execution |
objective:{id} | Any surface | Prevents parallel progress on same objective |
TTL-Based Expiry
Locks auto-expire after 5 minutes (configurable). This prevents deadlocks from crashed processes:
-- The RPC function checks expiry atomically
CREATE OR REPLACE FUNCTION try_acquire_agent_lock(
p_lane TEXT, p_locked_by TEXT, p_ttl_seconds INT DEFAULT 300
) RETURNS BOOLEAN AS $$
BEGIN
-- Delete expired locks
DELETE FROM agent_locks WHERE lane = p_lane AND expires_at < now();
-- Try to insert
INSERT INTO agent_locks (lane, locked_by, expires_at)
VALUES (p_lane, p_locked_by, now() + (p_ttl_seconds || ' seconds')::interval)
ON CONFLICT (lane) DO NOTHING;
-- Check if we got it
RETURN EXISTS (
SELECT 1 FROM agent_locks
WHERE lane = p_lane AND locked_by = p_locked_by
);
END;
$$ LANGUAGE plpgsql;
Why Not Redis?
For a self-hosted system running on a single Supabase instance, PostgreSQL advisory locks are:
- Simpler — no additional infrastructure
- Sufficient — agent concurrency is low (max 4-5 concurrent surfaces)
- Persistent — lock state survives edge function cold starts
- Auditable — you can query
agent_locksto see current state
Redis would be overkill. If you’re running hundreds of agent instances, you need Redis (or something equivalent). For a single-tenant self-hosted deployment, PostgreSQL is the right tool.
Trace IDs: Following the Thread
The heartbeat runs 11 operations in 45 seconds — self-healing, planning, automating, reflecting, remembering. Without a trace ID, debugging is like trying to piece together a crime scene from scattered witnesses who don’t agree on the timeline.
With a trace ID, you get a complete story. Every operation in a single autonomous run is linked under one ID. You can see what happened — in order, from start to finish.
When a heartbeat runs, it might:
- Self-heal 2 skills
- Advance 3 objective steps
- Execute 2 automations
- Reflect on 7 days of performance
- Save 4 memories
That’s 11+ operations across multiple tables. Without a correlation ID, debugging is archaeology — piecing together timestamps and hoping they align.
The Solution
Every autonomous run generates a trace ID that flows through the entire operation chain:
import { generateTraceId } from '../_shared/trace.ts';
const traceId = generateTraceId('hb'); // hb_m3k9f2_a7x2p1
The trace ID is:
- Human-readable — prefix tells you the surface (
hb= heartbeat,op= operate) - Sortable — timestamp component enables chronological ordering
- Unique — random suffix prevents collisions
Propagation
The trace ID is passed through every function call and stored in every activity log:
// Heartbeat creates trace
const traceId = generateTraceId('hb');
// Passed to reasoning engine
const result = await reason({
...config,
metadata: { traceId },
});
// Stored in activity logs
await supabase.from('agent_activity').insert({
skill_name: 'advance_plan',
conversation_id: traceId, // Reuses conversation_id for trace correlation
status: 'success',
token_usage: usage,
});
Querying by Trace
“Show me everything that happened in the last heartbeat”:
SELECT skill_name, status, duration_ms, token_usage, created_at
FROM agent_activity
WHERE conversation_id = 'hb_m3k9f2_a7x2p1'
ORDER BY created_at;
This returns the complete story of a single autonomous run — every skill called, every result, every failure — in chronological order.
The Activity Log: Structured Observability
Every agent action is logged to agent_activity with a consistent schema:
{
id: uuid,
agent: 'flowpilot' | 'visitor_chat',
skill_name: string, // What was attempted
skill_id: uuid | null, // Link to skill definition
status: 'success' | 'error' | 'pending_approval' | 'skipped',
input: json, // What was sent (sanitized)
output: json, // What came back
error_message: string | null, // If failed, why
token_usage: { // Cost tracking
prompt_tokens: number,
completion_tokens: number,
total_tokens: number
},
duration_ms: number, // Performance tracking
conversation_id: string, // Trace ID for correlation
created_at: timestamptz
}
What This Enables
- Cost attribution — Which skills consume the most tokens?
- Performance monitoring — Which skills are slowest?
- Failure tracking — Which skills fail most often? (feeds self-healing)
- Audit trail — What did the agent do, when, and why?
- Trace reconstruction — Follow a single autonomous run end-to-end
Self-Healing: Observability as Input
The activity log isn’t just for humans. The agent reads its own logs during the self-healing phase of every heartbeat:
SELF-HEAL phase:
1. Query agent_activity for last 3 days
2. Group by skill_name
3. Find skills with 3+ consecutive failures
4. Auto-disable failing skills
5. Disable linked automations
6. Inject healing report into prompt
This closes the observability loop: the agent monitors itself and acts on what it sees. A failing skill doesn’t just generate alerts — it gets quarantined automatically.
The Engine Room Dashboard
All observability data surfaces in the admin UI through the “Engine Room” — a real-time view of agent operations:
| Panel | Data Source | Shows |
|---|---|---|
| Activity Feed | agent_activity | Recent actions with status, duration, tokens |
| Token Spend | agent_activity.token_usage | Cumulative cost by skill and time period |
| Skill Health | agent_activity aggregated | Success rates, failure streaks |
| Active Locks | agent_locks | Currently held lanes |
| Memory Usage | agent_memory count | Total memories by category |
| Objectives | agent_objectives | Progress on active goals |
The Engine Room answers the operator’s core question: “What is my agent doing right now, and is it working?”
Tool Hallucination Recovery
LLMs sometimes “hallucinate” tool calls — requesting tools that don’t exist or passing malformed parameters. Without recovery, this crashes the reasoning loop and leaves the agent in an undefined state.
FlowPilot handles this with a structured recovery pattern:
1. LLM calls non-existent tool
→ Catch the "unknown tool" error
→ Do NOT abort the session
2. Inject correction message into conversation:
"Tool 'X' doesn't exist. Available tools: [list of valid tool names]
Please try again with one of the available tools."
3. Re-enter reasoning loop (max 2 retries)
4. If still failing after retries:
→ Log error with full details (tool name attempted, parameters, context)
→ Exit gracefully with summary of what was accomplished
→ Flag the stall in agent_activity for review
Why This Happens
Hallucinated tool calls are more common than you’d expect, especially when:
- The agent has been given context about tools that no longer exist (stale skill definitions)
- The model infers a tool should exist based on domain knowledge (“surely there’s a
send_smsskill”) - The model confuses skill names across similar domains
The Recovery Loop
for (let attempt = 0; attempt < MAX_TOOL_RETRIES; attempt++) {
const result = await executeTool(toolName, params);
if (result.error === 'UNKNOWN_TOOL') {
const availableTools = skills.map(s => s.name).join(', ');
conversation.push({
role: 'system',
content: `Tool '${toolName}' does not exist. Available tools: ${availableTools}. Please use one of these.`
});
continue; // Re-enter reasoning loop
}
if (result.error === 'INVALID_PARAMS') {
conversation.push({
role: 'system',
content: `Tool call failed: ${result.error}. Schema: ${JSON.stringify(skill.tool_definition)}`
});
continue;
}
break; // Success
}
The recovery message is injected as a system role message — not a user message — so the agent understands it as infrastructure feedback, not user input.
The Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No concurrency control | Race conditions, duplicate work | Lane-based locking |
| No trace IDs | Can’t debug autonomous runs | Generate and propagate trace IDs |
| Unstructured logs | console.log everywhere, no queryable data | Structured activity log |
| Logs for humans only | Agent can’t learn from its failures | Self-healing reads activity log |
| No TTL on locks | Crashed process holds lock forever | Auto-expiry with TTL |
| Over-engineering (Redis, Kafka) | Complexity without benefit for single-tenant | PostgreSQL is sufficient |
Concurrency and observability aren’t glamorous. They’re plumbing. But plumbing is what separates a demo from a product. Without it, your agent works in the lab and fails in production. With it, you can sleep while your agent runs — and know exactly what it did when you wake up.
Next: the skills ecosystem and how capabilities are organized. Skills Ecosystem →