Incident window: 02:39 p.m. -> 03:03 p.m.
Cause: internal maintainance error at our hosting provider
- A configuration change by our hosting provider to remove an internal server used for a datacenter migration caused a routing error, rendering our site unreachable. Our monitoring detected the situation immediately and the configuration change was reverted immediately
Incident window: 10:20 a.m. -> 10:27 a.m.
Cause: internal maintainance operation at our hosting provider
- Due to a network configuration change on our reverse proxies (mandatory for a maintainance operation), the Trustelem service was unavailable for a few minutes
Incident window: 10:41 a.m. -> 02:17 p.m.
Cause: a preliminary analysis seems to point out an issue with the nginx configuration - handled by our hosting provider - due to an exhaustion of the number of worker connections
- Degraded service with internal errors (HTTP error 500).
Handling the incident: after identifying the cause, we restarted the service to dicrease the used workers. A point will be made asap with our hosting provider to see how this limitation can be removed.
Incident window: 00:00 a.m. -> 02:30 p.m.
Cause: IOS push certificate was expired.
- Authentication through push notifications on IOS was not working
Handling the incident: after identifying the cause, the certificate was renewed, fixing the problem.
Incident window: 10:17 a.m. -> 10:30 a.m.
Cause: files descriptors exhaustion issue.
- Partial internal error failures on login pages
Handling the incident: our primary production server encountered a files descriptors exhaustion issue causing partial failures on connexions. Those failures were detected immediately by our monitoring and a restart of the service instantly solved the instability. Our watchdog process properly detected the issue but was not designed to provide enough detailed information on the file descriptor usage on our system, therefore we are working on improving our monitoring tools to be able to identify the root cause of any future similar issue.
Incident window: 11:47 a.m. -> 1:00 p.m.
Cause: malfunction of the production HTTP outbound proxy, following a configuration problem at our hosting service provider during a migration. Our hosting service provider went back on the configuration.
MFA authentication by WALLIX Authenticator (push notification) impossible
MFA authentication via SMS not possible
authentications with Azure AD impossible
Handling the incident: the problem was detected within a few minutes and dealt with our hosting service provider as quickly as possible at our host