Database Outage
Incident Report for

Today we had an unexpected performance outage of our main production database that runs the envisage platform, this caused all our services to go down including the web interface and API.

The issue was traced to a database query taking up to 90s to complete. On investigation, this query previously was taking less than 75ms to complete on average.

We initially deployed some changes to the code to exclude this query from execution while we investigated the issue, unfortunately, other queries started taking 30+ seconds to execute and caused the web application to become unresponsive and start timing out.

Due to this, we took all web servers off line while we further investigated.

One thing that immediately jumped out at us was that the production database was doing a parallel index scan, while running the query on our laptops on a copy of the production data was using a sequential table scan. We thought this might be because the production database was still running on Magnetic disk instead of SSD storage. This could have been causing the query planner to incrorrectly predict the cost a sequential table scan to be higher than an index scan.

We took a backup and upgraded the production DB to SSD storage, unfortunately this did not solve the problem with the query speed improving due to it now having SSD storage, but still taking 80+s to perofrm, completely unacceptable.

After further investigation we found was that if we disabled parralell index scan workers in the system temporarily through setting SET max_parallel_workers_per_gather = 0; the query speed improved from 80s to less than 75ms.

This was the breakthrough that opened the door to a handling. We saw that our existing database was running on PostgreSQL version 10 which introduced parallel workers and that in the version 11 release notes of PostgreSQL it was mentioned that there were improvements in the query planner with regards to using parallel workers. As the site was down, we investigated upgrading the PostgreSQL version to version 11.

We had already been running our staging environments on version 11 for some months and were confident of the upgrade working as planned, so we opted to bring this forward and upgrade the database.

Once the database was upgraded to version 11, the query went back to taking 75ms on average to run. At this point we enabled the web, API and background workers for envisage and tested and found service to be restored.

We apoligise for the outage, however from what we can tell we hit some arbitrary query planning limit on the table size that caused it. We will investigate further to see how this sort of outage can be prevented in the future and have upgraded the instance power and IOPS allocation of our database server in the mean time.

Posted Mar 14, 2020 - 03:13 EDT

Confirmed database is now performing per expectation. All systems operational.
Posted Mar 14, 2020 - 02:02 EDT
We have deployed the update the database server and all seems to be working now, will monitor further.
Posted Mar 14, 2020 - 01:39 EDT
We have finished the upgrade on the database instance, have found a further reason behind performance due to parallel_workers in database query planner. Upgrading from v10 to v11 of PostgreSQL to take advantage of parallel query performance improvements.
Posted Mar 14, 2020 - 01:06 EDT
Issue has returned, we have taken the servers offline to perform an upgrade on our database server. No loss of data has occurred.
Posted Mar 13, 2020 - 18:30 EDT
We have pushed an optimisation fix to production, service restored, performance degraded and we are monitoring.
Posted Mar 13, 2020 - 16:30 EDT
We have identified the problem, and working to upgrade the database instance to fix the issue.
Posted Mar 13, 2020 - 15:30 EDT
We are experiencing an issue with our database server having exceptionally long query times.
Posted Mar 13, 2020 - 14:30 EDT
This incident affected: API, Background Workers, Web Site, and Database.