RSS
 

Posts Tagged ‘sumo’

An End and a Beginning

03 Nov

2010 is coming to a close, and, with it, the end of our year-long project to create a new platform for support.mozilla.com (SUMO) is in sight.

For the past year, developing the new platform has been our focus and has effected our roadmap. When 2011 starts, we’ll begin a new chapter for SUMO. It’s a very exciting time!

For the developers, this is the end of the investment phase and the beginning of the payoff for Kitsune, the code-name for our new platform.

  • The entire site will be faster.
  • We’ll be done rebuilding existing features, and can work on brand new features.
  • We’ll be free of our legacy code base, which will simplify some important sections of Kitsune.
  • We’ll be working on smaller, faster cycles.
  • We’ll be able to take time to circle back to fix things we’ve been unhappy about, but willing to live with during the migration to Kitsune.
  • We’ll be more effective at making the site even faster.
  • We’ll apply our new theme to the entire site, making the experience more consistent and seamless.
  • We’ll be more agile, able to respond to issues faster.
  • We’ll be able to parallelize more.
  • We’ll be able to push updates to the site far more frequently—and we’ve averaged releases every two weeks since August!
  • We’ll be able to take the time we need for large features and disruptive changes without blocking work on, or release of, smaller features and fixes.
  • Nagging issues with sessions will go away.
  • It will be easier to keep our entire platform up to date.
  • We’ll be free of an entire class of security issues.

We’re just beginning to work on our roadmap for Q1, 2011. A lot of it is still up in the air, but there are some fun things on there. And we’ll be taking time to improve performance even more.

In the meantime, we’re getting very close to feature complete on SUMO 2.3, which will move the Knowledge Base over to the new platform.

After 2.3, there will be only one major release left in 2010, SUMO 2.4. SUMO 2.4 will be much smaller than 2.3—maybe 1/10th the number of bugs, and come out in a matter of weeks instead of months. But 2.4 will also be huge, in that it will move the final piece over to the new platform.

 
1 Comment

Posted in Articles

 

Developing at Scale: Database Replication

17 Jun

When a website is small—like this one, for example—usually the entire thing, from the web server to the database, can live on a single server. Even a single virtual server. One of the first things that happens when a web site gets bigger is this is no longer true.

One reason is load. A popular website will simply require more than a single server, virtual or otherwise, can give, and the only way to keep scaling is to add more servers. For example, if the server runs out of available Apache connections and the number cannot be raised without negatively impacting performance.

Another reason is downtime. If a website is served from a single server, and that server goes down for any reason, planned or otherwise, then the website is down. At some point, downtime is essentially unacceptable—just ask Twitter—and redundancy is required.

Enter Replication

A common response is to set up database replication, where one database server operates as a “master,” and one or more other servers operate as “slaves.” In this setup, all of your writes to the database will go to the master, then “replicate” to the slaves, and all or most of the reads will come from the slaves. (Note that the slaves are doing both all the writes as well as all the reads: slaves are not a good place to recycle sub-par hardware.)

Replication introduces a new type of problem: if you naively send all reads to the slaves then data you just wrote will not be there.

La…wait for it…g

Even if the master and slave are sitting next to each other with a cable connecting them, replication will probably take more time than your code does to reach the next step. At a minimum, you need to assume that replication lag will be hundreds of milliseconds—an eternity when the time from one line in your web app to the next is measured in micro- or nanoseconds. In reality, replication in the real world may well take seconds, especially if your master and slaves are not physically next to each other.

The result is that ACIDity is essentially broken, specifically the Durability part. You cannot simply write data and immediately rely on its existence.

For example, say you have a large discussion forum. If you naively send all reads to the slaves, then someone’s post may take seconds to appear on the site. This is a problem if you’re trying to show a user their post immediately after posting it.

Smarter Reading

The solution is to occasionally read from the master. When you need to access data that was just written, it is probably only available on the master, so that’s where you’ll read it. Within a single HTTP request, this is fairly simple: just force any queries that rely on recently-written data to the master.

Outside of a single HTTP request, this is slightly more complex. If you’re following the practice of redirecting after a POST request to a GET request (which you should) then creating a new forum post and viewing it will be on two different HTTP requests.

One way around this is to set a very short-lived cookie that tells your web app to continue reading from the master. If any write occurs in a request, the response should include this cookie. The exact time-to-live will depend on how long your replication lag usually is—cover at least 4 or 5 standard deviations. Any request that has this cookie should honor it by reading only from the master.

A Pitch

One of the hardest things for new web developers is developing large-scale applications: first, you need a large-scale application! Setting up database replication is a huge pain, and if your site isn’t getting enough traffic, it’s not worth it.

Mozilla is one way aspiring web developers can get some experience working with large-scale web apps. All of our web apps are open source and open to contributions from community members. To get involved, stop by #webdev in IRC!

 
3 Comments

Posted in Articles

 

Weekly Update for 06/14/2010

14 Jun

Last week could have gone better. We tried to push SUMO 2.1 twice only to realize we had some issues with respect to replication that need to get ironed out.

We think we have a fix for these issues and are rounding out the tests for that fix, but we won’t really know unless we can test in a replicated environment. There are bugs open for IT to help us with that, and get replication set up for our staging server.

As Morgamic said, we’ll gather info, document, learn and innovate, then repeat next time.

And, as “unsuccessful” pushes go, these went really well. Both times we gave it an hour, then were able to back everything out and reset in another half-hour, coming in well under the downtime window.

Last week

  • Tried to push 2.1, twice. It didn’t take.
  • Filed IT bugs re: replication in staging.
  • Started thinking about Q3 goals.
  • Got the 2.2 (“questions”) branch rolling on Hudson.
  • Helped get people on the same page w/r/t 2.3 deliverables and timeline. (At least we’ll say I helped.)
  • Got all the people working on chat together.
  • Worked out a potential solution to our replication issues with Jeff and Erik.
  • Triaged 2.2—only about 5 bugs got moved out.
  • Reviewed 2.2 UI work, and a number of subsequent patches.

This week (me)

  • Have a timeline in place for 2.1 and 2.2.
  • Figure out replication in staging with IT.
  • Work out roughly what Q3 will look like.
  • 1.5.5.1.
  • Get enough sleep.

This week (team)

  • Fix our replication issues.
  • Continue on 2.2 and accelerate.
  • Work with Cheng and Howse to get the AAQ done.
 
Comments Off

Posted in Articles

 

Weekly Update for 07/06/2010

08 Jun

I missed last week. I blame the holiday on Monday. Also Erik started, which is very exciting!

Tomorrow afternoon is our planned push for SUMO 2.1, which is our new discussion forum component, and migrating the old data into that component. This is huge, since it’s the first new component serving content creation. (We’ve been running search results on Kitsune for a while now.)

Everything on our staging server feels faster on the Kitsune pages than the old pages. I can’t quite count to 2 loading a Kitsune page. I can usually get to 3 on a Tiki page. That in itself is a huge win, to me. On top of the speed, we’ve made some big leaps in our infrastructure and have done a lot of work that will directly enable 2.2, our support questions milestone

Last two weeks

  • Closed out and reviewed a number of 2.1 bugs. Really, I lost count. It’s been fantastic to see the work spread out across the team.
  • Got 2.1 ready to go out tomorrow!(!!)
  • Fixed a number of small and last-minute 2.1 bugs.
  • Helped Paul finish out the data migration work so he could focus on his last final and graduate! (Congrats, Paul!)
  • Built avatars for users without them.
  • Got email notifications to go out with some help from Jeremy.
  • Welcomed Erik. Helped him get all the development environment stuff worked out.

This week (me)

  • Get everyone introduced to Rypple.
  • Navigate a smooth 2.1 launch.
  • Poll the team about a SUMOdev on-site.
  • Finish reviewing Ricky’s 2.2 UI work.
  • Triage 2.2 and focus the bugs.
  • Help everyone figure out deliverables and timelines for 2.3 mockups as best I can. (Mostly I’ve done what I can here, I think.)

This week (team)

  • Launch 2.1. Smoothly.
  • Go go go on 2.2. On staging the day (evening?) after 2.1 launches.
  • Start spreading around reviews more.
  • Help triage 2.2 and focus it.
 
Comments Off

Posted in Articles

 

Surviving Pac Man

24 May

On Friday, Google showed off a fun new doodle in honor of the 30th anniversary of Pac Man: a Pac Man clone, complete with sounds.

Unfortunately, in the initial release, those sounds started playing automatically—an oversight or an homage to <bgsound>, I guess. Even if Google was open in a background tab or window, or in a hidden iframe created by an add-on, the Pac Man music and sound effects would start.

And that confused some people.

Many people came to SUMO looking for an explanation, and many of them, not finding anything in the knowledge base, started posting to our forum. So many, in fact, that our database server started running out of connections.

The pounding we took on the forums also caused replication on our slave databases to fall behind by as much as 1.25 hours, so even when we wrote an article about the noises [article has been removed], it didn’t show up for most people.

As Sean put it: “We just got DDOSed by Pac Man.”

To shore up the site and bring it back from the brink of toppling over, we worked with IT (thanks, Dave!) to implement a number of temporary solutions. We…

  • …disabled a particular kind of slow, frequent, and useless query.*
  • …blocked Google’s crawler from indexing the site.
  • …disabled our own sumobot’s forum-crawling features.
  • …rotated DB slaves out of the production pool to allow them to catch up.

Google has already removed the Pac Man doodle from their home page, and we can revert most of the emergency measures here on Monday. But the event does remind us to look at what we’re doing in Kitsune, our rewrite, to weather storms like this in the future.

One idea, suggested by Dave Dash, is a read-only mode where all pages that can trigger database writes are temporarily disabled. We’ll be looking pretty seriously at this over the next couple of days.

Another important take-away is to make damn sure pages only trigger database writes if they really need to. Writes can never bounce off a cache, so they are very expensive.

Finally, we should be more proactive in how we interact with our Zeus cache. We’ll also think about whether it makes sense to start using Wil Clouser’s Zeus interface, Hera, sooner than later.

“Too much traffic” is the best problem a web development team can have. Hopefully, the first time this happens to Kitsune, we’ll be ready.

* The queries that increment the number of views a forum thread has gotten are particularly slow for some reason. They’re also wildly inaccurate, since most people see a cached version of those pages and never trigger the query. The worst part: they occur on every (non-cached) page view, even while just reading.

(This post was translated into Belorussian, isn’t that cool?)

 
3 Comments

Posted in Articles