Major incident: Network issue from our CH provider |

Post-mortem

Date of incident: Dec 19th 2023
Duration of incident: 08:04 to ca 09:50 (plus additional time taken to process queued mails)
Scope of incident: All customers of the CH instance of seppmail.cloud
Impact of incident: Inbound (and to a lesser degree outbound) messages delayed by differing amount of time

Root Cause:

The major incident of Dec 19th 2023 was caused by a network health issue, which led to delays in database synchronisation. This delayed synchronisation triggered a cascade of effects within the database cluster itself, and by consequence on the processing of messages.

Actions taken during incident:

The delay in database synchronisation was temporarily solved by a workaround (some processing was postponed). This re-enabled live mail processing earlier and with higher throughput. The live processing was then re-enabled after the synchronisation was fixed on the database cluster. The postponed processing was done throughout the day.

After the workaround was in place, the team worked to actively manage the queues to deliver messages as quickly as possible.

Learnings and areas of improvement:

The cause of the network issue at the root of this incident is under review with our datacenter provider (Infrastructure-as-a-Service).
The cross-site synchronisation of the database cluster will be stopped to minimise the “unit of failure” in case of an incident (planned to be implemented by the end of January 2024). On top of the reduced impact of a failure, it will also reduce the resources for synchronisation, which has benefits not only in an incident situation.

We have scheduled maintenance to improve our network before the year-end (see maintenance notifications on Statuspal). We will use this opportunity to also apply some detailed learnings from handling the incident.

We introduce more separation between the cryptographic handling and the (anti-spam etc) content scanning. This gives us more flexibility to manage our mail flow, besides other positive side effects.

This change has been in the works already before the present incident, together with other architectural changes to reduce latency in mail processing and increase resilience in case of isolated failures.

As with any major incident, we highlighted some improvements in our telemetry which will help to more quickly identify areas of degraded performance (eg avoiding some repeat notifications which may obstruct the view on other relevant notifications).
We communicate to our partners and customers during an incident (and for maintenance notifications) via https://seppmail.statuspal.eu/.

This is an external service we use to be able to communicate also at times when our own infrastructure might be affected. We strongly encourage all partners and customers to subscribe to these notifications, and also use a channel not tied to email in case that mail flows are affected by an incident.

December 21, 2023 · 16:09 CET

Update

One (opensearch) database cluster member located in LS became very slow all over sudden. As this member held the primary for several important indices write operations to those indices became very slow too. These slow operations slowed down any write access to those indices, which in turn caused all applications writing to these indices to be very slow and eventually timeout.

This resulted in a pressure back to the mailflow processing

Taking the affected cluster member out of the cluster took some time as data from affected node had to be transferred to the normal-running nodes.

December 19, 2023 · 17:36 CET

Monitoring

There is still a higher-than normal number of messages in the queues, even though the processing works as intended. This is usually caused by senders which have delayed messages during the outage.

We continue to monitor the situation, while we are working on further root cause analysis and planning preventive measures for the future.

December 19, 2023 · 14:15 CET

Investigating

We are currently actively managing the queues in order to ensure highest possible throughput and to work around rate limits of recipient systems.

We see a lot of messages being processed, however there is still a significant number of messages in the queues. Most newly arriving messages are delivered with small delays.

December 19, 2023 · 12:33 CET

Investigating

We see inbound and outbound mail delivery queues going down, and delivery rates continue to be high. We continue to monitor the situation to ensure we can intervene if any delivery problems should arise.

Our telemetry shows that the GINA web interface is experiencing somewhat degraded performance. This is due to the load caused by handling the queues, but we expect it to be available and return to the expected performance soon.

We continue to work on getting logs post-processed so that they should show up to date delivery information. The actual post-processing will take some time to run.

The investigation into the cause of events is still ongoing. As soon as we have more insight, we will provide in-depth root cause analysis including any preventive or other actions to avoid such a situation in the future.

December 19, 2023 · 11:07 CET

Investigating

The last remaining major obstacle caused by the problem, a database issue on one infrastructure component. Queued messages are now processed at roughly the expected performance.

The log display in seppmail.cloud may be delayed some more until the background processing has caught up. Also the queue view in the seppmail.cloud is currently experiencing some delay as well. We are working on these followup issues as well.

December 19, 2023 · 10:31 CET

Investigating

The rate of mails being processed in one site of our datacenter infrastructure is increasing, while the problem partially persists in the second site.

Currently there are about 10’000 messages in our queue, and we expect a similar number waiting to be delivered to our systems. We expect the queues to be processed gradually.

December 19, 2023 · 10:21 CET

Investigating

We are currently moving some functions to parts of the infrastructure which are less affected by the ongoing incident and are disabling some background handling in order to allow more messages to be processed.

The investigation is still ongoing.

December 19, 2023 · 10:05 CET

Investigating

The services are still performing below the expected performance. We are working with the datacenter provider to identify the root cause and towards a fix.

Unfortunately there is currently no workaround available right now.

December 19, 2023 · 09:25 CET

Investigating

The network connectivity has been partially restored, and we are working on establishing services step by step. We are seeing some emails being processed successfully, but not yet at the expected level.

December 19, 2023 · 09:04 CET

Issue

Our swiss datacenter provider seems to have network issue. We are investigating how big the outage is and possible solutions. We expect delays on mailflow on the swiss cloud and difficulty with most operations.

December 19, 2023 · 08:18 CET

Network issue from our CH provider

Updates