A little over two years ago, I set out to completely replace TodaysMeet’s platform. Then, over the past year, I’ve taken the new platform and built it into a distributed, service-oriented—buzzword-compliant—application.
So here it is: how TodaysMeet actually works today. In part 2, I’ll give a very concrete example of how great this can be.
The two big components are the TodaysMeet Django app, let’s call it
tm for now, and the websocket connection service, called
tm is written in Python, using Django. It currently serves as both the web app and API, including authenticating for both, though I’ve been trying to separate those as much as possible. It is solely responsible for talking to the database (and uses memcached where appropriate). Almost all non-static traffic goes through
tm via internal API endpoints when it needs information from the database and relies on in-process caching where appropriate.
Let’s look at the most common case, posting a new comment to a room. Lots of actions (closing or pausing a room, deleting a comment) follow a similar flow.
POST request goes to
tm. It starts a transaction, writes the comment to the database, then posts the comment to
ekg before finally committing the transaction.
ekg pushes the message to all the clients in that room.
Except, it doesn’t post directly to
ekg. During deploys,
ekg gets restarted, which can take over a second as it stops accepting new connections, tells the current clients to reconnect in a few seconds, shuts down and gets running again. Some posts would fail, causing users to see an error. It’s also important to scale to multiple
ekg hosts without making the API slower, meaning it shouldn’t post to each
There is an intermediate service called
reflektor to solve these problems. It runs on the same boxes as
reflektor accepts any HTTP request and responds immediately, then replays it against a list of downstream servers.
Because it responds immediately, the median time for
tm to post the comment is just over 3ms. The 90%ile time is 4ms. It doesn’t matter how many downstream systems
reflektor will eventually talk to, the
tm “new message” API endpoint is extremely fast.
reflektor queues these requests in-process using a library I call
dq for “dumb queue”—that I swear I will rename and open source or replace at some point.
- stores objects in memory with no persistence options, if it crashes, or is hard-stopped, they are lost;
- is a FIFO queue, objects get processed in order;
- has a configurable task to run on each object that can be synchronous or asynchronous and decides what counts as success or failure;
setTimeoutto avoid blocking the event loop;
- automatically retries with back off on errors;
- can be paused or put in “drain” mode;
- goes to “sleep” when empty to limit CPU time and automatically wakes up on push;
- and emits events like
'drained'so I can do graceful restarts.
(There are better, more robust, COTS and FLOSS tools. I built
dq because I didn’t want the operational overhead of running something like nsq on small VPSes—I really wanted the queue to run on
localhost for each
tm server to limit network time—I wanted to stick to HTTP when possible, and I did not really need its guarantees. But the idea is similar.)
tm knows about its local
reflektor and about those on its neighbors, so if the local
reflektor is restarting—which happens in serial—
tm will try a neighbor. Since adding
reflektor and neighbor awareness, deploys are error-free and unnoticeable.
The biggest requirement for me has been speed—TodaysMeet is a real-time communication tool. So is it fast?
tm posts the message to
reflektorbefore it commits the transaction and responds to the request, I actually had to work around a problem when the new message arrives in the browser before the
I said I’d talk about how great this is. Sean has talked about streams at Bitly a bunch, and nearly everything he said applies here. I’ll get into it more in part two, but this architecture makes it incredibly easy to build up new systems or features without interfering with what’s already running.