Case Study
Taking the keys to a struggling legacy system requires a methodical approach. First, understand the environment. It was a traditional LNMP (Linux, Nginx, MySQL, PHP) stack. Nothing inherently wrong with it, but fragile under pressure.
Running a simple top command immediately revealed the culprit.
The server wasn't just working; it was thrashing. Dozens of php-fpm processes were maxing out the CPU, locking it at 100% for extended periods. This continuous high load was rapidly depleting the EC2 instance's CPU credits. When the credits vanished, the server entered a state of suspended animation—a "fake death" that eventually spiraled into a hard crash.
The access logs confirmed my suspicion. We were dealing with aggressive malicious crawlers and bad bots firing concurrent requests at an unsustainable rate.
The immediate task was clear: stop the bleeding. But the method of intervention would determine whether this was a temporary patch or a permanent cure.
What's the call?