Surviving Pac Man

On Friday, Google showed off a fun new doodle in honor of the 30th anniversary of Pac Man: a Pac Man clone, complete with sounds.

Unfortunately, in the initial release, those sounds started playing automatically—an oversight or an homage to <bgsound>, I guess. Even if Google was open in a background tab or window, or in a hidden iframe created by an add-on, the Pac Man music and sound effects would start.

And that confused some people.

Many people came to SUMO looking for an explanation, and many of them, not finding anything in the knowledge base, started posting to our forum. So many, in fact, that our database server started running out of connections.

The pounding we took on the forums also caused replication on our slave databases to fall behind by as much as 1.25 hours, so even when we wrote an article about the noises [article has been removed], it didn’t show up for most people.

As Sean put it: “We just got DDOSed by Pac Man.”

To shore up the site and bring it back from the brink of toppling over, we worked with IT (thanks, Dave!) to implement a number of temporary solutions. We…

  • …disabled a particular kind of slow, frequent, and useless query.*
  • …blocked Google’s crawler from indexing the site.
  • …disabled our own sumobot’s forum-crawling features.
  • …rotated DB slaves out of the production pool to allow them to catch up.

Google has already removed the Pac Man doodle from their home page, and we can revert most of the emergency measures here on Monday. But the event does remind us to look at what we’re doing in Kitsune, our rewrite, to weather storms like this in the future.

One idea, suggested by Dave Dash, is a read-only mode where all pages that can trigger database writes are temporarily disabled. We’ll be looking pretty seriously at this over the next couple of days.

Another important take-away is to make damn sure pages only trigger database writes if they really need to. Writes can never bounce off a cache, so they are very expensive.

Finally, we should be more proactive in how we interact with our Zeus cache. We’ll also think about whether it makes sense to start using Wil Clouser’s Zeus interface, Hera, sooner than later.

“Too much traffic” is the best problem a web development team can have. Hopefully, the first time this happens to Kitsune, we’ll be ready.

  • The queries that increment the number of views a forum thread has gotten are particularly slow for some reason. They’re also wildly inaccurate, since most people see a cached version of those pages and never trigger the query. The worst part: they occur on every (non-cached) page view, even while just reading.

(This post was translated into Belorussian, isn’t that cool?)