Incident Report: July 28 Order Queueing and SSL Connection Errors in AU and EU Regions

On July 28, the commercebuild platform (V4) in both of our AU and EU regions experienced delays in the posting of orders from web stores to client ERP systems. Additionally, web stores in the AU region experienced SSL connection issues.

As always, the stability of our platform is very important to us, and we review every incident that adversely impacts our customers to ensure that we are taking all possible steps to prevent such incidents from occurring in the future.

We apologize for the inconvenience that this incident caused.

What Happened

The containers that manage order processing services for the AU and EU regions were not running. In the EU region, the container was initiated without rebuilding. In the AU region, it was necessary to rebuild the container to start the service. At this time, it is not clear why these containers were not running.

Moreover, the commercebuild Systems team recently implemented infrastructure modifications in the AU and EU regions. In the AU, once the container was rebuilt and deployed, these modifications prevented an SSL certificate from being reissued. 

Once the infrastructure modifications were rolled back in the AU region, the SSL certificate was reissued. Given the incident in the AU region, the infrastructure changes were also rolled back in the EU region as a precaution.

Impact

A very limited number of web stores experienced issues affecting order processing. However, several web stores in the AU region were affected by SSL connection errors. These SSL connection issues would have prevented web store users from accessing the affected sites for approximately 10 minutes in the AU region. The EU region was not impacted by SSL connection issues.

Timeline

13:16 UTC - We became aware of the issue affecting the service that queues web store orders to be posted to client ERP systems. The degradation of this service impacted a very limited number of web stores both in the AU and the EU.

15:10 UTC - After investigation, the issue was escalated to the Systems team for further review. 

22:24 UTC - The Systems team finalized their investigation and determined that the containers running the order queue service were not running. In the EU, the container was restarted without rebuilding, whereas in the AU it was necessary to rebuild the container. 

22:39 UTC - We were informed of SSL connection issues ( NET::ERR_CERT_COMMON_NAME_INVALID) affecting several web stores in the AU region. As a result of rebuilding the container in the AU region, an SSL certificate was not reissued. The certificate had not been reissued due to infrastructure modifications. The Systems team began to revert the infrastructure changes to reissue SSL certificates. 

22:49 UTC - The Systems team confirmed that the reversion of AU infrastructure modifications was completed. They commenced the reversion of EU infrastructure modifications as a precaution. 

22:56 UTC - The Systems team confirmed that the reversion of EU infrastructure modifications was completed.

22:56 UTC - The incident was resolved.

Future Prevention

Our systems team aims to deploy modifications to our production environment after thorough review and testing. In this case, however, the infrastructure modifications could only be tested during deployment in our production environment. In the future, we will refine this process in order to test such changes in a staging environment similar to that of our production environments.

Additionally, we will improve monitoring of our Docker containers to proactively resolve any issues that may occur.