Transient errors across API endpoints due to database failover
Timeline · 1 update
- resolved Jun 18, 2026, 01:09 AM UTC
We observed a brief (< 1 minute) period of API errors due to a database failover. This incident has been resolved.
Orb had 39 outages in the last 2 years totaling 64h 45m of downtime — averaging 1.6 incidents per month.
There were 39 Orb outages since July 5, 2024 totaling 64h 45m of downtime. Each is summarised below — incident details, duration, and resolution information.
We observed a brief (< 1 minute) period of API errors due to a database failover. This incident has been resolved.
We observed processing delays and queueing between 06/07/26 00:00 UTC and 01:00 UTC, with full recovery to established baselines at 01:30 UTC. We believe this will not recur, but will continue to monitor and increase capacity to prevent this in the future. Customers with dedicated webhook SLAs and provisioning were not impacted.
We have identified and are actively fixing an issue with usage processing for a subset of data, affecting alerting and usage-based invoice issuance. APIs and data ingestion remains operational.
The issue has been identified and a fix is being implemented.
A fix has been implemented and we are monitoring as latency is resolved in async services
This incident has been resolved.
We're investigating elevated errors rates across API endpoints.
We're seeing recovery as of 12:33 AM UTC, and are continuing to monitor for impact.
Errors have continued to stay mitigated as of 00:33 UTC (approximately 40 mins ago).
We're working to mitigate an infrastructure issue, which may lead to intermittment latency spikes (each of which should last a few seconds), resulting in a higher rate of client-side retries. We apologize in advance for the disruption, and we're working to resolve the situation.
This incident has been resolved.
We identified elevated latencies for fetching usage (and potentially some associated actions that required invoicing via manual action). The vast majority of the impact was from April 3 21:22 to 21:26 UTC. Impact was fully mitigated for a remaining small (<1%) of errors by 21:39.
This is now resolved.
Following a deployment at 5:40 UTC, the Orb API started experience elevated timeouts for applying and cancelling subscription pending changes. We have rolled back and errors have subsided fully as of 12:55 UTC, and we are continuing to monitor.
We identified the root cause to be a new query that was introduced. API traffic to the affected endpoints has been healthy since the rollback at 12:55 UTC.
We're seeing some async delays on invoicing, webhooks, and alerts. APIs are not impacted at this time.
We've confirmed the source of the issue and are working on a fix.
We are continuing to work on a fix for this issue.
We have applied a fix and are monitoring recovery.
The vast majority of async workloads have caught up, and we will continue to monitor our services over the next few hours.
From 02/26/2026 02:16 - 02/26/2026 04:16 UTC there was a delay in event ingestion from our scheduled maintenance. This disruption may have caused delays in alerting, threshold invoices, and top-up blocks from being issued. No data-loss was experienced.
We’re currently experiencing some lag on the web application and are actively investigating the cause.
This incident has been resolved.
We observed an increase in page load failures in the Orb dashboard and a subset of APIs (5%) starting at 02:25 UTC for approximately 10 minutes. Services are now recovered, and we are continuing to monitor. We will provide more detailed updates to affected customers. Data ingestion was not impacted.
We have not seen any new related errors, and this incident is now resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to monitor for any further issues.
This incident has been resolved.
We've identified and are investigating an issue with invoice issuance delays. No data or API impact at this time.
We have applied a fix and will continue to monitor invoice issuance.
This incident has been resolved.
We are currently investigating intermittent webapp load failures for some customers. API, ingestion, and invoicing are unaffected.
We've implemented a fix and are working on rolling it out.
This incident has been resolved.
We're looking into elevated API times on a subset of ingestion API requests. This does not affect any ingestion via S3.
We've identified the issue, and API latency is recovering - we're continuing to monitor.
Continuing to see quick recovery on API latencies. We'll provide any further updates in 10 minutes, or close out this incident.
API latencies have returned to normal, and services are fully recovered.
We've identified some revenue reporting delays in our pipeline; billing and data ingestion are not affected. Values from the API are also not affected. We expect to catch up within 24 hours, and will continue providing updates here.
As of 22:00 UTC, our revenue reporting service has caught up and is now processing data as expected.
A recent code deploy caused a minor increase in analytics request errors. Our automated canary systems rolled this back as part of our deploy process without any operator intervention. We're continuing to monitor to ensure errors do not recur, and the offending logic is not deployed. This did not impact writes or data ingestion.
We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
Orb is affected by a cloud provider incident and this is affecting our asynchronous workload. Currently this is not affecting APIs or data ingestion, but we're continuing to monitor in partnership with our service provider.
We're seeing recovery and are continuing to monitor.
We're not seeing any continued impact on our services.
We experienced an elevated level of API errors, which has now subsided. We are actively monitoring for further errors.
This incident has been resolved.
We are experiencing periods of elevated API response times specifically on read and analytics requests. Our oncall team is currently investigating. Event ingestion, invoice finalization, and write requests are unaffected.
We've identified a failure mode tied to a degenerate workload that degrades query performance for one of our usage datastores. A temporary resolution has been put in place to ensure system stability, our oncall team will continue to monitor the error rates for the API and adjust workloads as needed to ensure API availability.
After ongoing monitoring, this issue is now fully resolved. Endpoints have been recovered for the last several hours; the last period of errors occurred before 05/16/25 08:30 UTC.
Due to a large increase in query volume, analytics and read queries were degraded for a time period, resulting in a brief period of API timeouts (approx. 08:19 UTC - 08:26 UTC). We apologize for the disruption, and are working to provision more capacity as a result.
From 04/07/25 07:14 to 04/07/25 15:28 UTC, customers attempting to load usage graph data for upcoming invoices in their customer portal or on the invoice portal would have encountered an error that prevented the graph from rendering. This did not impact internal views of the invoice, calculations of invoice or usage amounts, invoice issuance, or usage ingestion. The cause of this issue was a recently modified front-end component that did not properly account for data routing behavior in external portal views. The issue was resolved at 15:28am UTC.
From approximately 12:57pm PT to 13:06pm PT on 2025-04-01, customers may have seen an elevated rate of API response errors. The cause of this issue was due to an increase is connection timeouts to a system cache for usage data. This issue has since been resolved and all connections have been restored.
From approximately 12:17pm PT to 12:22pm PT, customers may have seen an elevated rate of API response times and errors. The cause of this issue has been resolved. The root cause of the issue was due to elevated latency on our event data store.