To better handle the growing amount of data and users in Lokalise, we constantly work on scaling resources for the application. After a routine operational change that was previously executed multiple times and tested on staging environment successfully, Elasticsearch cluster that powers many Lokalise features has become suddenly overloaded.
Once more users have started coming online the service began struggling with the load leading to increased latency and general slowness of Lokalise application. We have turned off filters, search, and statistics to make application performant again while in limited mode, and continued to work on resolving the issue.
The source of the issue has been established quickly, however full performance restoration took more than an hour before we could re-enable all functionality. It took this long because the Elasticsearch index that had to be relocated was very large. The root cause of the incident was an incorrect estimation of the resources required for scaling the backend service. This was unexpected as it turned out that metrics we have in place did not reveal the full extent of the actual service’s load.
We apologize for the inconvenience and frustration caused by the downtime experienced by our customers. Our team takes this incident seriously and is committed to taking all necessary measures to prevent similar incidents from occurring in the future. We appreciate your patience and understanding and will continue to work diligently to improve our system’s performance and reliability for you.