Products Up incident

Issue with delta processing

Minor Resolved View vendor source →

Products Up experienced a minor incident on September 9, 2025 affecting Data Processing, lasting 4h 37m. The incident has been resolved; the full update timeline is below.

Started
Sep 09, 2025, 11:53 AM UTC
Resolved
Sep 09, 2025, 04:30 PM UTC
Duration
4h 37m
Detected by Pingoru
Sep 09, 2025, 11:53 AM UTC

Affected components

Data Processing

Update timeline

  1. investigating Sep 09, 2025, 11:53 AM UTC

    Dear customers, We are seeing elevated numbers of full processing runs, instead of runs that should process the delta of changed products. We are currently investigating the issue. Best regards, Your Productsup Tech Operations Team

  2. investigating Sep 09, 2025, 12:46 PM UTC

    We are continuing to investigate this issue.

  3. identified Sep 09, 2025, 12:48 PM UTC

    Dear customers, the issue has been identified, our storage cluster which stores Site Workspaces is currently in a failed state. Due to this, you may encounter workspace errors which can be safely ignored. Delta processing runs and parallel processing may not work at all, while full processing runs will succeed nevertheless. Our infrastructure team is working to restore the cluster to a full state as soon as possible, and we'll keep you posted if there are any new developments. Thank you for your understanding, Your Productsup Tech Operations Team

  4. monitoring Sep 09, 2025, 01:17 PM UTC

    Dear customers, The cluster has been rebooted into a working state. Delta operations, parallel processing and export2datasource features should function again normally. We are continuing to monitor the restoration process and will debrief this incident later on. Thanks again for your patience and continued support, Your Productsup Operations Team

  5. resolved Sep 09, 2025, 04:30 PM UTC

    This incident has been fully resolved. All systems are operating normally. We will continue monitoring the infrastructure closely over the coming days to ensure continued stability.

  6. postmortem Sep 09, 2025, 04:31 PM UTC

    # Service Incident Notification - Ceph Storage Cluster \(Processing\) **Date:** September 9, 2025 **Incident Reference:** CEPH-2025-009 **Status:** RESOLVED ## What Happened On September 9, 2025, our storage cluster experienced a service disruption while implementing a planned infrastructure upgrade to enable multi-datacenter operations. During the activation of "stretch mode" \(a feature that allows data to be synchronized across multiple datacenters for improved disaster recovery\), the cluster's internal authentication system encountered an unexpected failure. ## Customer Impact **Duration:** 12:30 → 15:00 \(2hr30m\) **Affected Services:** Data Processing **Data Safety:** All customer data remained secure and intact throughout the incident. No data was lost or corrupted. During the incident period, customers may have experienced: * Temporary inability to access stored workspaces or export2datasource * Intermittent connectivity issues with applications using the storage service ## Root Cause The issue occurred when our storage system automatically reconfigured all data pools during the multi-datacenter setup process. This reconfiguration inadvertently affected critical system components responsible for user authentication, preventing normal access to the cluster even though the underlying data remained safe and accessible. ## Resolution Our engineering team worked to restore service by upgrading the storage cluster software to a newer version that provided additional recovery options not available in the previous version. This upgrade allowed us to safely reverse the multi-datacenter configuration and restore normal cluster operations. ## What We're Doing to Prevent This * **Enhanced Testing:** We are improving our testing environments to better replicate production conditions and catch similar issues before they affect live services * **Software Version Management:** We are updating our deployment standards to use newer software versions that provide better recovery capabilities for major configuration changes * **Monitoring Improvements:** We are implementing additional real-time monitoring for authentication systems during major infrastructure changes * **Rollback Procedures:** We are developing more comprehensive rollback procedures for complex infrastructure modifications ## Our Commitment We sincerely apologize for any inconvenience this incident may have caused. Data security and service reliability are our highest priorities. We are committed to learning from this experience and implementing the necessary improvements to prevent similar incidents in the future. If you have any questions or concerns about this incident, please contact our support team.