How our webserver performed under a really heavy load
As folks noticed on Thursday morning when we opened hotels, the website displayed errors for 5-10 minutes starting at 9 AM EST due to the rush of traffic that came in. This was due to a few factors, which I'd like to go into in technical detail below.
First, some raw stats:
Peak bandwidth: approx 3.6 Megabits/sec
Peak connections: 1,060 concurrent connections
Number of users logged into the site: 100+
Here are some graphs that show just how big those numbers are, compared to normal traffic levels:
For those who saw last year's post about the load on the webserver when we opened up hotels, this year's traffic was about twice what last year's traffic was. I didn't see that coming.
As is plainly visible in the traffic graph, the machine that this website runs on is capabale of much higher bandwidth throughput. So, what happened?
In a word: caching. Or rather, the lack thereof in certain cases.
Normally, Drupal has a pretty good caching system. But I went beyond that and installed the cacherouter module so that we could cache pages on the filesystem, thus bypassing the database entirely. Normally, this works great. And it worked great last year.
What was different this year? The fact that 100+ users were logged into the site when the hotel pages were published, something which I didn't really expect. The reason why this became an issue is because logged in users don't see cached pages. They see pages pulled from the database, with the latest and greatest information loaded. This makes sense, for example, if you write a comment on a forum post while logged in, you want to see that comment appear right away, as opposed to the cached page.
Taking all that into account, 100+ logged in users hitting the refresh button on the hotel page right when I clicked "publish" on that page 9 AM caused about 100 separate attempts to regenerate the cache for that page. Definitely not good. The webserver itself (a piece of software called "nginx") stayed up and running without any issues, and the individual PHP processes didn't have any issues. It's just that it took them so long to complete (10s of seconds in some cases) that the webserver said, "Oh, this process must have crashed, died, or otherwised timed out", at which point it returned the dreaded HTTP 504 error.
Sorry about that.
So, what did we learn from Thursday morning?
1) Posting links for the hotel reservation pages to Twitter was a great idea. Over 100 of you filled out the survey I put up about the hotel booking experience. And many of you told me that you found the links posted to Twitter to be quite helpful.
2) Generating content dynamically, even if it is for a small subset of users, is just a plain bad idea. Sometimes it can't be avoided, but I feel in this case it can. It would not be difficult for me to switch the entire website over to a version that just serves up the static hotel page for a 2-3 hour window when we open up hotel reservations next year. Creating a separate DNS record for say, http://hotel.anthrocon.org/ to serve up the hotel information from a static page and putting that up on an Amazon EC2 instance for the morning is also a possibility. That would help keep traffic off our primary webserver in the first place.
3) We learned additional details about issues with specific venues through the survey I put up. One venue's reservation system had problems with a specific web browser. Another venue's front desk had a communication issue. We've documented all of the reported problems and brought them up with each of our venues.
We hope that no one was caused too much distress while getting hotel rooms on Thursday morning, and trust that we'll see each and every one of you at Anthrocon 2012.
You should put up last years results to. So we can compare the two?
EC2... maybe. As you said, the cause is logged-in people making Drupal regenerate pages. So, try this.
When load goes too great (a certian threshold), kill all logged-in sessions and disable logins for, say, 30 minutes to an hour. Read only mode, maybe?
If the load gets too great again while in the disabled-logins period, switch to the static page.
i'm no programmer or website dev, but just on a PR type guess on things, i can see that causing more fuss than an error for a few minutes here or there =/
Load average never really went above 4 (we have 4 cores) and the spike was so short-lived that this wouldn't have worked for us.
Also, if nobody can log in to the site, makes it kinda difficult for me to change pages, doesn't it?
It was a mess, but I'm glad you tweeted the hotel url's, otherwise I wouldnt have reserved a room at the Westin. Please continue to do that for next year's reservations too.
But I was very disapointed there were only 50 rooms out of 172 rooms available at the Courtyard.
The Courtyard has always allocated only 50 rooms for us, and are reluctant to allocate more. Presumably they can meet their revenue targets for that weekend by having the other rooms available to rent at full rate to people who might otherwise be staying at the Westin if we weren't there.
Thats what I thought, I wasnt trying to knock the AC staff for lack of trying to get more rooms.
Another suggestion to prevent a server overload: How about giving each hotel a different reservation start time? The Westin opens for reservations at 9 am, Courtyard at 9:30 or 10 am, Omni at 11 am and so on.
I don't think that would help much. I bet 95% of those registering at 9am attempted the Westin first, anyway.
Was "dealers get to reserve a day early" new this year? I'm not begrudging their being able to do so, as I can appreciate the desire for a quiet room close to the DLCC, but if I'd known that was the situation, I wouldn't've spent 20 minutes trying to reserve at the CM before I gave up and moved to the Westin...
No, it was not, and no, none of the hotels were full before February 2. All hotels had plenty of availability. What they did not have was enough availability to meet the demand, which was roughly 1000 people all trying to get rooms there within the period of a few minutes.
The problem is, there would not be anyone else there. The Pirates are playing out of the city that weekend; there is not another reason for keeping the room block so small. I don't understand the business reason of hold back on hotel rooms (and running the risk of not filling them at all, loosing money) when you can have a garunteed full hotel and make *some* money.
I agree. I doubt they are selling out during AC. I stayed at the Courtyard in 2009 and 2010 and it was a very quiet hotel. I saw many furries in the lobby and halls, but not that many business "suits" travelers. I dont remember seeing any families in the hotel either.
Look, folks: it boils down to a corporate decision by whoever runs the Marriott Courtyard in Pittsburgh. Their decision-makers have decided that they will give us a maximum of fifty rooms (approximately) at the discounted price. I'm certain that Kage has tried to get more rooms there, the same way I'm sure he's been trying for years to get the Hampton Inn to allocate us some rooms. For whatever reason, the Courtyard management has decided that this is the limit of rooms they'll offer Anthrocon. Without knowing why they limit us to a small number of rooms, we may not be able to do anything about it.
A point to remember: the Courtyard is small. If they have contracts like the Westin that state that there will always be rooms available for certain organizations under that contract, this may literally be the maximum number of rooms they can offer us... if all the contractees were to show up at the same time.
As I understood it, the Marriott's block size decision is based on staffing limitations, correct me if I am of wrongliness.
I've done the honors and added a line that's equal to the stats from LAST year. http://i.imgur.com/Naf5o.png
that obviously shows that the site had a crapload more people trying to rush to get a room this year.
Noting that there were 1060 connections, I have to wonder if nginx has a limit of 1024 connections. If that were increased, would we have had even more?
Since 1060 > 1024, not likely.
Also, the bottleneck here was CPU consumed by PHP processes. (A quick peak at "top" showed me that) Nginx performance was pretty much a non-issue.
Right, but 1060 is the entire server, not just nginx. Maybe you have another process consuming connections under a different user id? Possibly an email daemon sending out emails to subscribers saying that the hotel info has been posted. Maybe there are two bottlenecks in this event.
No, email is sent out periodically in batches via crontabs. We do not have inbound email service on the machine, either.
you should add to this that people should not use chrome when trying to buy the hotel. it pervents you from puting in your info so you cant buy the room, saying that it is not a secure site. so use firefox or something else. ^^
I do hope that they fix the site by next year, though. I'm much rather they make the hotel site use web standards that work with all browsers than to tell people which browsers they can and cannot use.
as would i, but i cant change that. so we have to use what we can.
Hi, Are you well? It is a great post and a great on How our webserver performed under a really heavy load idea. Always I wanted to jazz up the icons on my websites a little. It has some important idea. This is very nice post! I will bookmark this blog. Find an array of Roach Control Killer Products at ******. Get rid of your roach problems professionally. Rely on us for all your roach killer, roach control & cockroach control needs. Thank You Very Much For a Nice & Cool Article.
Yes, this is spam. I am leaving it up here for the irony value (having removed links and the name of the sponsor) because it gave me a snicker.
Kage tried sexy outfits to get more rooms? I wonder why THAT didn't work out? *sarcasm* LOL =)
How can you tell if a cockroach is squinting at you? They don't have eyelids, do they? Teehee! =P
You can tell: they are starting to draw their katana from its sheath...
*runs away with tail tucked, leaving behind a bottle of wine to appease the Shadow Bug* =)