Summary:

Early this afternoon, September 30th, 2020 the web server hosting our WordPress site went down for about an hour. Due to a typo in configuration file, the LAMP stack was broken and users were receiving 500 Internal Server Error. This affected all users of the website during this time as no content was able to be served within the window of the outage. The root clause boiled down to a extra p “wp-locale.phpp” instead of “wp-locale.php” in a single configuration file deployed by one of the developers working on the site.

Timeline:

1:20pm: Received notification via email that users were unable to reach the website.

1:22pm: Discovered the internal server error upon attempting to curl the webpage from within the web server.

1:24pm: Ran ps -auxf to find current process of the web server, 380 for the apache2 process running from www-data. strace -p 380 2> atrace in one screen while curling localhost in another screen.

1:32pm: Discover a file not found error during a stat syscall, with the filename attempted containing “.phpp”.

1:47pm: grep (.phpp) on a variey of configuration files. Find a match in /var/www/html/wp-settings.php on line 137.

1:56pm: Write and apply a puppet manifest file to fix the line in the configuration file.

2:02pm: curl -sI localhost:80 now returning 200.

Root cause and resolution:

The apache server was unable to find a certain configuration file it needed to load the service, therefore the server was not able to respond successfully to content requests, and users were receiving the 500 error. Figuring this out was harder than it sounds. Initially, I came up with a variety of possible culprits.

After some tedious perusing of strace logs, the failed syscall was found, leading to a search for the misconfigured file. Once the misconfigured file was found, a puppet manifest script was quickly written and just as quickly applied to the server to fix the error.

Corrective and preventative measures:

All new configuration files should be tested in a virtual machine or a container before being pushed to an operational server. Furthermore, the software developer who pushed the new configuration file should have been responsible and tested the web site after deploying the changed configurations. It should have been this developer who notified me long before users started sending emails. Basically, developers should not touch live servers without first testing and ensuring their work will… work!

All users are now able to view the site once again without any problems. The team has implemented a new monitoring service on the server to alert staff of any outages or metrics beyond certain thresholds. Response and recovery should be much quicker in the future thanks to this new monitor. Furthermore, policy has changed to require testing before deployment, so we won’t be hearing anything more about internal server errors.

http://ianculp.tech

http://github.com/icculp

http://twitter.com/IanCSU

http://twitter.com/IanCSU