How our webserver performed under a really heavy load
As folks noticed on Thursday morning when we opened hotels, the website displayed errors for 5-10 minutes starting at 9 AM EST due to the rush of traffic that came in. This was due to a few factors, which I'd like to go into in technical detail below.
First, some raw stats:
Peak bandwidth: approx 3.6 Megabits/sec
Peak connections: 1,060 concurrent connections
Number of users logged into the site: 100+
Here are some graphs that show just how big those numbers are, compared to normal traffic levels:
For those who saw last year's post about the load on the webserver when we opened up hotels, this year's traffic was about twice what last year's traffic was. I didn't see that coming.
As is plainly visible in the traffic graph, the machine that this website runs on is capabale of much higher bandwidth throughput. So, what happened?
In a word: caching. Or rather, the lack thereof in certain cases.
Normally, Drupal has a pretty good caching system. But I went beyond that and installed the cacherouter module so that we could cache pages on the filesystem, thus bypassing the database entirely. Normally, this works great. And it worked great last year.
What was different this year? The fact that 100+ users were logged into the site when the hotel pages were published, something which I didn't really expect. The reason why this became an issue is because logged in users don't see cached pages. They see pages pulled from the database, with the latest and greatest information loaded. This makes sense, for example, if you write a comment on a forum post while logged in, you want to see that comment appear right away, as opposed to the cached page.
Taking all that into account, 100+ logged in users hitting the refresh button on the hotel page right when I clicked "publish" on that page 9 AM caused about 100 separate attempts to regenerate the cache for that page. Definitely not good. The webserver itself (a piece of software called "nginx") stayed up and running without any issues, and the individual PHP processes didn't have any issues. It's just that it took them so long to complete (10s of seconds in some cases) that the webserver said, "Oh, this process must have crashed, died, or otherwised timed out", at which point it returned the dreaded HTTP 504 error.
Sorry about that.
So, what did we learn from Thursday morning?
1) Posting links for the hotel reservation pages to Twitter was a great idea. Over 100 of you filled out the survey I put up about the hotel booking experience. And many of you told me that you found the links posted to Twitter to be quite helpful.
2) Generating content dynamically, even if it is for a small subset of users, is just a plain bad idea. Sometimes it can't be avoided, but I feel in this case it can. It would not be difficult for me to switch the entire website over to a version that just serves up the static hotel page for a 2-3 hour window when we open up hotel reservations next year. Creating a separate DNS record for say, http://hotel.anthrocon.org/ to serve up the hotel information from a static page and putting that up on an Amazon EC2 instance for the morning is also a possibility. That would help keep traffic off our primary webserver in the first place.
3) We learned additional details about issues with specific venues through the survey I put up. One venue's reservation system had problems with a specific web browser. Another venue's front desk had a communication issue. We've documented all of the reported problems and brought them up with each of our venues.
We hope that no one was caused too much distress while getting hotel rooms on Thursday morning, and trust that we'll see each and every one of you at Anthrocon 2012.