Incident Report: August 20 Slowdown/Outage in EU Region

On August 20, the commercebuild platform (V4) in our EU region experienced brief downtime for approximately 20 minutes resulting in 500 errors. 

As always, the stability of our platform is very important to us, and we review every incident that adversely impacts our customers to ensure that we are taking all possible steps to prevent such incidents from occurring in the future.

We apologize for the inconvenience that this incident caused.

What Happened

The database queries on the promotions table maxed out the limit of DB CPU. This, in turn, resulted in the server's inability to process requests.

Impact

A very limited number of web stores experienced a mixture of degraded performance (10 second page loads) and complete outage.

Timeline

06:50 UTC - We became aware of increasing traffic on one of our sites in EU region which degraded the performance. CPU maxed out to 100 percent, and the PHP node to 90 percent.

07:00 UTC - The traffic peaked at about 20 page loads per second.

07:10 UTC - The Systems team received monitor alerts, and the servers were rebooted.

07:30 UTC - Even after reboots, the performance was not ideal. The average mean time for page response was ~13 seconds.

07:45 UTC - The team investigated the issue and identified the DB queries which were responsible for maxing out the resources. 

07:50 UTC - The queries were tuned and DB resources were increased, which reduced the mean response time to 0.8 seconds per page load.

07:54 UTC - The team ensured DB backups and started to monitor the traffic.

09:30 UTC - The Systems team confirmed the traffic was tapered down to normal levels and the incident was resolved.

Future Prevention

  • Our Systems team aims to improve the optimize the database queries that support our Promotions module.
  • A plan is underway to increase available memory nodes specifically for the EU region.
  • More descriptive notes will be injected into pager duty alerts so the team is notified with relevant information.
  • There is a long term plan to move the PHP container into Kubernetes, so it can automatically scale and proactively resolve any issues that may occur.