Today we had an unexpected performance outage of our main production database that runs the envisage platform, this caused all our services to go down including the web interface and API.
The issue was traced to a database query taking up to 90s to complete. On investigation, this query previously was taking less than 75ms to complete on average.
We initially deployed some changes to the code to exclude this query from execution while we investigated the issue, unfortunately, other queries started taking 30+ seconds to execute and caused the web application to become unresponsive and start timing out.
Due to this, we took all web servers off line while we further investigated.
One thing that immediately jumped out at us was that the production database was doing a parallel index scan, while running the query on our laptops on a copy of the production data was using a sequential table scan. We thought this might be because the production database was still running on Magnetic disk instead of SSD storage. This could have been causing the query planner to incrorrectly predict the cost a sequential table scan to be higher than an index scan.
We took a backup and upgraded the production DB to SSD storage, unfortunately this did not solve the problem with the query speed improving due to it now having SSD storage, but still taking 80+s to perofrm, completely unacceptable.
After further investigation we found was that if we disabled parralell index scan workers in the system temporarily through setting
SET max_parallel_workers_per_gather = 0; the query speed improved from 80s to less than 75ms.
This was the breakthrough that opened the door to a handling. We saw that our existing database was running on PostgreSQL version 10 which introduced parallel workers and that in the version 11 release notes of PostgreSQL it was mentioned that there were improvements in the query planner with regards to using parallel workers. As the site was down, we investigated upgrading the PostgreSQL version to version 11.
We had already been running our staging environments on version 11 for some months and were confident of the upgrade working as planned, so we opted to bring this forward and upgrade the database.
Once the database was upgraded to version 11, the query went back to taking 75ms on average to run. At this point we enabled the web, API and background workers for envisage and tested and found service to be restored.
We apoligise for the outage, however from what we can tell we hit some arbitrary query planning limit on the table size that caused it. We will investigate further to see how this sort of outage can be prevented in the future and have upgraded the instance power and IOPS allocation of our database server in the mean time.