Starting from 14:38UTC on Friday the 11th of October we encountered SMS API processing delays. This happened due to multiple erroneous message which consisted of 15003 parts(!)
Since it could not be saved to the database due to internal field length limits, it was forwarded to a re-try cycle after which the message processing service was stopped due to unexpected behaviour.
Service monitoring framework re-started the message processing service and it managed to process some messages before it tried to process the same message again, which caused the service to stop again.
The service performance was therefore significantly affected, as the services kept restarting and re-trying to send these messages over and over again - the queue of valid messages started to slowly pile up until the processing delays were too severe.
When we were aware of the issues (9:50UTC on Sunday the 13th of October), we created a temporary workaround for the internal message processor to avoid the re-try loop in this situation and skip processing those messages.
This is what we'll do to make sure it will not happen again: