Trust · SLA and reliability
Because every AI action in your organisation passes through Xybern before it executes, our availability is your AI availability. We treat uptime as an operational obligation, not a marketing number. This page explains our availability architecture, our fail behaviour, and what happens to your AI operations during any Xybern incident.
Section 01 · The inline availability problem
If your email platform goes down, emails queue. If your analytics platform goes down, dashboards are unavailable. These are inconveniences. If Xybern goes down and you are using it as your mandatory AI execution pathway, the question is different: do your AI systems continue operating without enforcement? The answer to that question defines your risk exposure during any Xybern incident. We have designed the system so the answer is always no, AI systems do not operate without enforcement, even during incidents. Here is how.
Section 02 · Fail behaviour
Xybern's default fail behaviour is closed, if the enforcement layer is unreachable, AI actions queue at the enforcement boundary and do not execute. This is a deliberate architectural decision, not a fallback. We explain it here, with the full consequence of each option, because you should choose your fail behaviour consciously.
Fail-open · not the Xybern default
Xybern unreachable → AI actions execute anyway
Fail-closed · Xybern default
Xybern unreachable → AI actions queue. None execute.
Default for all deployments. Configurable queue TTL, actions not processed within TTL are rejected with a permanent record.
Section 03 · Uptime history
This is the operational record, not the SLA promise. Every bar represents one day. Green is fully operational. The single amber bar represents the one incident in the last 90 days, a 23-minute degraded performance window where enforcement continued operating at reduced throughput. No enforcement gaps occurred.
Enforcement gaps, AI actions that executed without passing through Xybern, have been zero across all 90 days including the incident window. Fail-closed architecture maintained enforcement continuity throughout.
Section 04 · Incident response
For an inline enforcement system, incident response is not just about restoring the service. It is about ensuring enforcement continuity throughout the incident and providing a complete record of what happened to every AI action during the window. This is how we handle it.
Detect
Automated detection — under 1 second.
Continuous health checks across all enforcement layer components. Anomaly detection fires within 1 second of any degradation. No manual monitoring required for initial detection.
Classify
Incident classified within 2 minutes.
Every incident is classified by impact level, Degraded (enforcement operating at reduced throughput), Partial outage (some regions affected), Full outage (enforcement boundary closed, fail-closed active). Classification determines the response path.
Contain
Fail-closed engaged if enforcement is at risk.
If incident classification indicates enforcement may be compromised, fail-closed is engaged automatically. AI actions queue. No enforcement gap opens. The queue TTL timer starts. This happens before any human is paged.
Resolve
Service restored. Queue processed in sequence.
On resolution, the enforcement queue is processed in the order actions arrived, oldest first. Every queued action receives a full enforcement evaluation. No action is skipped. The vault records the queued status and the enforcement timestamp for each.
Report
Post-incident report within 24 hours.
Every incident receives a written post-mortem within 24 hours, root cause, timeline, enforcement continuity record, and remediation. Enterprise customers receive the report directly. The enforcement continuity record shows every AI action that queued, every action that was processed on recovery, and confirms zero enforcement gaps.
| Severity | Response time | Enforcement behaviour |
|---|---|---|
| Degraded performance | 15 minutes | Enforcement continues at reduced throughput |
| Partial outage | 5 minutes | Fail-closed engaged for affected regions |
| Full outage | 2 minutes | Fail-closed engaged globally, queue active |
| Security incident | Immediate | All enforcement suspended, security team engaged |
Section 05 · Health monitoring API
Programmatic access to real-time system health via the health endpoint. Monitor enforcement layer status, uptime metrics, incident history and, most importantly, enforcement gap count from your own infrastructure. Integrate with PagerDuty, Datadog, Grafana or any monitoring tool that accepts a JSON health response.
enforcement_gaps_30d
Always zero if fail-closed is active.
fail_behaviour
Current fail mode. closed is the default. open means fail-open has been explicitly configured.
queue_depth
Actions queued at the enforcement boundary. Non-zero only during incidents.
response_time_p99
Worst-case enforcement latency in milliseconds. SLA commits to under 100ms P99.
Service-level health per component
Each Xybern subsystem reports health independently. A degraded verification_engine with a healthy provenance_vault means decisions are still being recorded even if throughput is reduced.
Region-scoped responses
The health response includes the region field, the infrastructure region this endpoint is reporting on. For multi-region deployments, query each region's endpoint independently.
Guaranteed.
99.9% uptime SLA, fail-closed architecture and zero enforcement gaps in production. SOC 2 certification is currently in progress. If you are deploying AI systems in a regulated environment, this is the reliability standard you need.