2026-01-08 · SEV-3 · Resolved
INC-2026-001: Runs API elevated errors (pilot)
Elevated error rates for run retrieval in one pilot tenant due to upstream cache invalidation.
Impact
Some requests returned stale run metadata or transient failures for ~17 minutes in one tenant.
Root cause
Invalidation job ran with an overly broad key prefix in pilot environment, impacting shared cache namespace.
Timeline (high level)
- 09:12 UTCAlert triggered for elevated 5xx on run retrieval.
- 09:18 UTCScoped to single tenant; cache invalidation job suspected.
- 09:25 UTCMitigation: flush affected cache keys; traffic normalizes.
- 09:40 UTCGuardrails added for invalidation prefix; additional monitoring enabled.
Remediations
- Key-space isolation tightened per tenant.
- Added safeguard to prevent broad invalidation without approval.
- Added alerting on invalidation rate and cache churn.
Next steps
- Finalize cache isolation as GA requirement (GA-001).
- Document runbook for cache incidents.
NDA note: This report is a pilot-stage summary; detailed evidence may be shared under NDA via Trust Pack.