Database inconsistency
Updates
Post Incident Report
- Incident duration: Feb 2nd 2026 15:20 to Feb 3rd 2026 00:30 (closed after observation period: Feb 3rd 2026 14:30)
- Incident scope: CH instance of seppmail.cloud
- Customer impact: ca 2% of messages rejected with erroneous error messages between ca 20260202 17:00 and 20260203 00:10
- Root cause: LDAP database corruption
- Incident status: Resolved
Root cause analysis
One LDAP database in the cluster experienced data corruption, likely before the actual start of the incident. The reason for the data corruption lies within OpenLDAP in combination with the operating system. At this point, “only” the redundancy was reduced and the overall functionality of both the cloud portal and mailflow was not affected.
As per disaster recovery guideline for seppmail.cloud, a rebuild of the database from the “healthy” cluster partner was started, which typically would finish within a few minutes to restore full redundancy again. However, a few seconds into the recovery process the previously healthy cluster partner started to synchronize data from the cluster partner being rebuilt, resulting in overall inconsistent data and thus impact on the mailflow and other functionality, and some message queues started to build up.
It is unclear why the previously healthy cluster partner started to synchronize data from the other database being rebuilt. This procedure is used regularly also when e.g. adding or replacing a cluster partner and this “back synchronization” effect never occured previously. We currently suspect a timing issue which lead to a state confusion in the internal OpenLDAP logic.
Service recovery
Subsequent to the failed rebuild, and also per the disaster recovery guideline, data was restored from daily backup and regular snapshots. For a small number of user accounts and domains, additional data fixes had to be applied.
The next day (20260203 at around 08:00) it was observed that communication to some HIN-enabled domains was still affected. After a forced refresh of the HIN domain list this sub-issue was also resolved. The overall incident was resolved after some observation period on 20260203 at 14:30.
Mitigation for future incidents and Learnings
- There is an ongoing project at SEPPmail to a) move the LDAP databases into a different operating system and b) move certain data away from LDAP (into a relational database). We are confident that with these projects completed, the incident would not have happened. We are thus expediting these ongoing projects.
- Since we can not rule out that a future rebuild would have similar effects, we are adapting the disaster recovery guideline to add an explicit shield to avoid the “back synchronization” during a rebuild.
- We will add additional checks for data consistency (or rather: to detect data corruption), especially a “canary in the coalmine”-type test as an early indication of (potential) issues. We will also improve the monitoring in a few related areas for better / earlier detection.
- While the last three mitigations will be implemented in the next few days and weeks, we expect the first mitigation will still take considerable (testing) time due to the wide-ranging effects of these changes.
We sincerely apologize for the inconvenience caused to users, customers and partners by this incident.
After the individual users had been identified and fixed, and after no new cases have been reported by partners and customers, we consider the immediate issue leading to this incident to be resolved.
We will continue to closely monitor the situation and are following up with further analysis into the root cause and improvements to avoid such a scenario in the future.
A post-incident report will be published here in the coming days, depending on the further investigation progress.
A small number of domains had continued issues this morning which could only be found when users in these domains started to send mails again. The effects were error notifications with could not apply policy or rule engine reject. These domains have now been fixed and we can confirm successful handling and delivery of messages.
We are still aware of a number of individual users who may have continued issues. We are actively identifying and fixing such cases as and when we become aware of them.
Again, we sincerely apologise for the inconvenience caused.
Note: The incident was de-escalated from major to minor in one of the updates yesterday. We are now properly setting it to major again.
We noticed that ca 3% of users have some settings not yet restored. These settings are currently being restored.
Mail queues are mostly back to normal, with a certain number of mostly error notifications still pending delivery.
The assessment of the user impact has not been finalized yet. However we are aware that some messages have been rejected with a message of 555 User does not exist and "user_management" is not set to automatic. Unfortunately, affected senders would need to resend such messages.
We apologise for the inconvenience this incident has caused and will provide a more thorough post-incident report in the coming days.
We continue to monitor the situation and will provide a next update by tomorrow, Feb 3 2026 10:00 CET.
Most of the data restore is done, some additional items should be done in the next 15 to 20 minutes. We will then perform additional consistency checks on the data.
We see decreasing mail queues and will continue to monitor the situation.
The database inconsistency is mostly fixed, however an updates is still running to restore individual user settings to their original state.
We see that mailflow is resuming with some speed, however there are still quite a few error notifications in the queues which need to be cleaned out.
We see that messages are processed, however with some delay. We are investigating the root cause and working towards temporary or final fix.
Further, we are investigating possible negative side effects on messages which have already been processed.
Login to the cloud portal should be possible, however some features may be missing / not being shown.
We are currently investigating an issue of possible database inconsistency which partially impacts mailflow and portal login in the CH instance.
← Back