Issue with message processing

Incident Report for Messente

Postmortem

22.11.2024 incident post-mortem:

Issue: Network issues on the server providers’ side caused a crucial component in the message processing to become unresponsive. This lead to most SMS messages not being properly forwarded to the next stage to be sent out.

Background: A contributing factor to this incident was a recent migration done in light of an upcoming server providers maintenance date, where previously evenly distributed connections were moved to one server as the other was said to have maintenance. While this was not necessarily the cause of the issue, it contributed to the size of the impact it had.

Timeline:

00.00 UTC: Network latency issues caused a crucial component in the message processing to become unresponsive
05.25 UTC: Messente team notice the issue and start investigating
06.00 UTC: Team restarted the unresponsive component
06.09 UTC: New incoming messages were processed without further delays, ‘stuck’ messages were starting to be sent
06.17 UTC: The queue was clear and all pending messages had been processed

Conclusions:
While the fix to the issue was found and implemented fast, the time from start of the incident to detection was increased due to the night hours. This has further highlighted to us the need for improved alerting and internal escalating. We are now exploring our options for this to ensure critical level incidents like this would get escalation and fixed faster. We have also created a ticket to add additional health checks and improved monitoring to our components. We are also looking into how to reduce the redundancy in certain components to reduce the chance of a similar future incident .

Thank you for the patience while we have been looking into this and our apologies for any inconveniences!

Posted Nov 25, 2024 - 20:37 EET

Resolved

This incident has been resolved and message processing is back to normal.

We will update you with a detailed overview once it is available.

For any questions, please reach out to support@messente.com

Posted Nov 22, 2024 - 08:22 EET

Update

We are continuing to investigate this issue.

Posted Nov 22, 2024 - 08:20 EET

Investigating

We have discovered an issue with message processing. Our team is looking into it and we will update you as soon as possible.

Posted Nov 22, 2024 - 07:59 EET

This incident affected: Messaging (Omnichannel API).