June 13, 15.30 UTC –
The platform-wide issue, which occurred on June 8 at approximately 18: 30 UTC, was brought under control by 19:30 UTC for all our web stores. Since then, our Systems team has run tests to determine the cause of the issue.
After their thorough analysis, we learned that this issue occurred due to the removal of one Elasticsearch node from the US cluster. The node was removed because it was underutilized. However, its removal caused a doubling of Input/Output operations per second, which caused the other nodes to exceed their maximum thresholds.
Systems have reverted this change and are in the process of upgrading disk performance. Additionally, they have set up better monitoring of the Elasticsearch cluster and advance notification when the disks reach their thresholds.
We apologize for the inconvenience this issue has caused, and we thank you for your patience while it was resolved.
19:39 UTC – We have observed fewer errors from our internal monitoring tools and believe that the affected services have stabilized. More investigation is required to understand the cause of this issue. When available, we will share this information here.
19:09 UTC – Our systems team continues to investigate the issue. We appreciate your continued patience.
18:39 UTC – We are aware of issues with the platform and are investigating them. As soon as we have more information, we will update this post.