Proemion incident

Partial mobile network connection problems

Minor Resolved View vendor source →

Proemion experienced a minor incident on August 3, 2018 affecting Mobile Network Services, lasting 1d 3h. The incident has been resolved; the full update timeline is below.

Started
Aug 03, 2018, 12:31 PM UTC
Resolved
Aug 04, 2018, 04:19 PM UTC
Duration
1d 3h
Detected by Pingoru
Aug 03, 2018, 12:31 PM UTC

Affected components

Mobile Network Services

Update timeline

  1. identified Aug 03, 2018, 12:31 PM UTC

    One of our mobile network providers is currently facing infrastructure problems. Our partner is working actively on resolving the problem. At the moment we are monitoring the impact on our customers. Currently, we estimate 5% of affected devices across our customers. Unfortunately, it still might impact all devices off individual customers. The devices might not be able to establish a proper online connection. In that case, Real-Time connections are not possible. Logging data is buffered on the devices what mitigates the problem. We don't assume any data loss. We are sorry for the inconvenience and are in active contact with the affected mobile network provider.

  2. identified Aug 03, 2018, 12:32 PM UTC

    We are continuing to work on a fix for this issue.

  3. identified Aug 03, 2018, 12:35 PM UTC

    We are continuing to work on a fix for this issue.

  4. monitoring Aug 03, 2018, 05:25 PM UTC

    We have not received a "green light" from the affected mobile network provider yet. Currently we are observing more affected customer machines, mainly in North America. For Asia and Europe the weekend-low has already arrived and effects are less visible. We'll proceed with the monitoring of the problem and we will keep you informed.

  5. monitoring Aug 04, 2018, 01:33 PM UTC

    We still have not received a "green light" from the affected mobile network provider yet. Currently, we observe 17% of machines been impacted by the mobile network provider issue. We'll proceed with the monitoring of the problem and we will keep you informed.

  6. monitoring Aug 04, 2018, 03:30 PM UTC

    The the affected mobile network provider and their signaling vendor started to activate traffic. This process is done with caution in order to prevent potential overload of the mobile network provider's infrastructure.

  7. monitoring Aug 04, 2018, 03:51 PM UTC

    The the affected mobile network provider and their signaling vendor started to activate traffic. This process is done slowly and with caution in order to prevent potential overload of the mobile network provider's infrastructure. At Proemion we can see successful connections are being made to our servers and the first affected communication units are coming online. We will keep you updated on the progress.

  8. monitoring Aug 04, 2018, 04:14 PM UTC

    We are continuing to monitor for any further issues.

  9. resolved Aug 04, 2018, 04:19 PM UTC

    The incident has been resolved. All affected communication units are back online and no data has been lost. We are continuing to monitor the situation.

  10. postmortem Aug 06, 2018, 12:46 PM UTC

    ### Course of events On Friday August 3rd, 2018 after 11:00 CEST some of our end to end monitoring devices did not go online after a reset. The automatic monitoring infrastructure detected this and alerted PROEMION on-call service. The staff tried to identify the root cause of the problem: It turned out that also other communication units with SIM cards from the affected provider did not go online anymore after performing a power on reset. Devices with SIM cards from this provider that already had an active online connection continued their sessions without problems. At 14:00 CEST our support contacted the SIM provider. The contact confirmed that there is a problem with the online connectivity of their SIM cards working in roaming environment and that on PROEMION side no action is required. At this point we also checked the amount of affected devices by comparing the number of connected communication units to an average number for the same time on Fridays. At this time about 5% less communication units were connected to our data platform. At 14:30 CEST we announced the outage on our status page. During the outage we: * continuously monitored the situation, * frequently contacted the SIM provider to get updates, * posted updates to the status page to keep our customers informed. The peak outage compared to normal weekend traffic was about 17% less communication units. On Saturday August 4th, 2018 between 16:35 and 16:40 CEST we saw a significant step up in the number of connected communication units. Afterwards the number of connected communication units was in the normal range of the average value for this time on Sundays. The SIM provider also informed us that the problem should be resolved, and connections should be possible again. We updated the information on our status page after confirming that connectivity was normal again and the outage has been resolved. Nevertheless, we continued to monitor the situation to get early information about potential recurrence of the problem. ### Review No log data was lost during the connectivity outage. The data was stored in the internal non-volatile buffer of our communication units. It was transferred to our system after the successful reconnection. In general, we see minor possibilities for improving our monitoring setup to reduce the identification time for such 3rd party outages. We lost some of our end to end monitoring capabilities during the outage. We are currently implementing some fall-back solution for our end to end monitoring for the case of a full provider outage. The core improvement will be to push our partners to provide a better communication and information flow towards us. ### Conclusion We believe that it is very important to inform our customers about issues with our system as early and as comprehensive as possible. In case of 3rd party systems like telecommunications provider's systems, we do not have control about the containment and recovery activities at the 3rd party provider. But especially in these cases we continuously monitor the situation together with the provider and regularly post updates to our status page. We apologize for potential issues that may have occurred on our customer's side during the outage. Even though our influence on the resolution of this issue was low, we did our best to keep you informed.