How TodaysMeet Works

I want to write about TodaysMeet’s 2015 State of the Union site, but I realized I spent half the time on the existing architecture. So, this is part 1, and here is part 2!

A little over two years ago, I set out to completely replace TodaysMeet’s platform. Then, over the past year, I’ve taken the new platform and built it into a distributed, service-oriented—buzzword-compliant—application.

So here it is: how TodaysMeet actually works today. In part 2, I’ll give a very concrete example of how great this can be.

a diagram with most of the moving parts

The two big components are the TodaysMeet Django app, let’s call it tm for now, and the websocket connection service, called ekg.

tm is written in Python, using Django. It currently serves as both the web app and API, including authenticating for both, though I’ve been trying to separate those as much as possible. It is solely responsible for talking to the database (and uses memcached where appropriate). Almost all non-static traffic goes through tm.

ekg is written in JavaScript, using NodeJS (for now) and Primus for websocket+fallback connections. The only bi-directional communication is connecting and joining a room, otherwise, everything comes in via the API and goes out via streams. It talks to tm via internal API endpoints when it needs information from the database and relies on in-process caching where appropriate.

Let’s look at the most common case, posting a new comment to a room. Lots of actions (closing or pausing a room, deleting a comment) follow a similar flow.

The POST request goes to tm. It starts a transaction, writes the comment to the database, then posts the comment to ekg before finally committing the transaction. ekg pushes the message to all the clients in that room.

Except, it doesn’t post directly to ekg. During deploys, ekg gets restarted, which can take over a second as it stops accepting new connections, tells the current clients to reconnect in a few seconds, shuts down and gets running again. Some posts would fail, causing users to see an error. It’s also important to scale to multiple ekg hosts without making the API slower, meaning it shouldn’t post to each ekg host.

There is an intermediate service called reflektor to solve these problems. It runs on the same boxes as tm. reflektor accepts any HTTP request and responds immediately, then replays it against a list of downstream servers.

Because it responds immediately, the median time for tm to post the comment is just over 3ms. The 90%ile time is 4ms. It doesn’t matter how many downstream systems reflektor will eventually talk to, the tm “new message” API endpoint is extremely fast.

reflektor queues these requests in-process using a library I call dq for “dumb queue”—that I swear I will rename and open source or replace at some point. dq:

  • stores objects in memory with no persistence options, if it crashes, or is hard-stopped, they are lost;
  • is a FIFO queue, objects get processed in order;
  • has a configurable task to run on each object that can be synchronous or asynchronous and decides what counts as success or failure;
  • uses setImmediate or setTimeout to avoid blocking the event loop;
  • automatically retries with back off on errors;
  • can be paused or put in “drain” mode;
  • goes to “sleep” when empty to limit CPU time and automatically wakes up on push;
  • and emits events like 'drained' so I can do graceful restarts.

(There are better, more robust, COTS and FLOSS tools. I built dq because I didn’t want the operational overhead of running something like nsq on small VPSes—I really wanted the queue to run on localhost for each tm server to limit network time—I wanted to stick to HTTP when possible, and I did not really need its guarantees. But the idea is similar.)

tm knows about its local reflektor and about those on its neighbors, so if the local reflektor is restarting—which happens in serial—tm will try a neighbor. Since adding reflektor and neighbor awareness, deploys are error-free and unnoticeable.

The biggest requirement for me has been speed—TodaysMeet is a real-time communication tool. So is it fast?

Yes. Because tm posts the message to reflektorbefore it commits the transaction and responds to the request, I actually had to work around a problem when the new message arrives in the browser before the POST completes!

I said I’d talk about how great this is. Sean has talked about streams at Bitly a bunch, and nearly everything he said applies here. I’ll get into it more in part two, but this architecture makes it incredibly easy to build up new systems or features without interfering with what’s already running.