Starring our Monolith in the Role of...

Top-down photograph of a large, thin rock at the seashore casting a long shadow.
Photo by Dan Meyers / Unsplash

Stop me if you've heard this one before.

As a small, early stage start-up, we built an application in [Rails / Django / Express / Gin] because we were able to lean on the framework and ecosystem for a lot of the boring stuff, and focus on our unique value. Now, as our business has matured and our team has grown, we're struggling. The application has grown unevenly; test coverage in the oldest, and most critical, parts is pretty thin; we are encountering scaling challenges; any minor issue can take down our entire service. Engineers have started using the word "monolith" more and more.

At this point in the story, many teams will decide that they need to move toward "microservices" or a "service-oriented architecture." They'll often start with some new piece of functionality, and build it in the "right" way. If this goes well, they'll have a new, isolated service, with no dependencies on the monolith. If it goes less-well, some requests might go from the monolith to the new service and back to the monolith before they can start generating responses.

(I promise I will stop using scare quotes now.)

(Also, I'm not commenting here on whether microservices is the best choice. That's something each team will need to make a call about. There are trade-offs either way.)

At some point, either way, the team will begin to start breaking down the monolith. Building new functionality outside of the monolith might slow down the rate at which the original problems get worse, but it doesn't improve the situation. Those problems are still present, and it's usually impossible to fully stop changing code in the monolith.

Breaking up a monolith is a huge project. Years of work went into building this, and it will likely take years to unmake—especially because the unmaking is sustaining engineering work, and we have to keep shipping at the same time. Given that, we would prefer to avoid costly missteps. That is particularly hard because, in the ever-changing world of software, our idea of good is a moving target.

I have found three helpful lenses for making this kind of refactor effective:

  1. Make big decisions for 3–5 years at a time.
  2. Think of the monolith "in the role of" its different responsibilities.
  3. Work from the outside in.

Make big decisions for 3–5 years at a time

The tools, practices, and skills available to us as engineers are always changing. New frameworks, protocols, libraries, languages, and versions come out constantly. New teammates with new experiences join. If we're not careful, this can lead to a proliferation of inconsistent technology internally. Always choosing the "right tool for the job" can lead to over-fitting a tool that our team can't support.

Before committing to a multiyear project like breaking up a monolith, we need to make some decisions about where we're going. Do we want to use the same framework? The same language? How will our services communicate? What is the future for us?

These questions can be daunting! So instead of trying to answer them forever, try to answer them for a while. In start-up environments, I've found that assuming you won't revisit a decision for three to five years is a good target. If you build a React-based front end, decide that Python is your back end language of choice, or that internal APIs should use REST, assume that you'll live with that decision for now—but not forever.

Living with a technology decision means a few things. It's worth investing in tooling around that technology to make developers more effective. And in training and possibly even hiring to round out the team's level of expertise. The cost of accepting constraints and working within them is typically lower over time than trying to work around them.

Three to five years is also, I think, a pretty good life span for a technical vision. As a vision ages, the business, team, and technical hurdles will change. It's worth scanning the horizon from where you are now—the new normal, hopefully inspired by the last vision—and setting a fresh course.

Make these big decisions as you encounter them—doing what you have to to feel confident that they won't result in disasters—and they should last you until the next large-scale project.

Think of the monolith "in the role of" its different responsibilities

By definition, a monolith has a lot of different responsibilities. In the web application world, it might provide database access for everything, execute database migrations, manage user and session authentication, render HTML and other dynamic, build other static assets, serve HTTP APIs, and contain all of the business logic. When Django says "batteries included," they mean all of those and more.

We can start looking for seams in the monolith to start carving it up by saying "the monolith in the role of the user data store" or "the monolith as the HTTP API."

The exact responsibilities will depend on your domain. For example, a website for reviewing books may have the concept of an "Author," distinct from a "User," meaning the monolith would have roles as the canonical data store for Authors and as the canonical data store for Users. A fan fiction website, on the other hand, may not have the concept of Author as a distinct entity; since any User could be the author of a piece on the site, the idea of "Book Author" may be just a property of a "Book" or "Series".

This can be complicated by certain implementation details. Code is often used by multiple roles—e.g. the currency formatter may be used by both the "Sends Receipts" role and the "Displays HTML pages (like Billing History)" role. Database enum tables—like a "US States" table—might be shared, which can make them easy to mistake for important entities.

Identifying the different roles your monolith plays is the hard part. Look for patterns in how people talk about it, and in the code. Lean on your domain knowledge. Different roles are potential candidates to move out of the monolith into their own services, and code shared between roles is a good candidate to become a shared library.

Work from the outside in

You've identified some of the major roles your monolith plays. Now, where do you start?

We can imagine the different roles as if they were different services already and look at the dependencies between them. The API role depends on the Session Store role, the User database role, and almost certainly more. The HTML rendering role may depend on the Session Store and the User database, or maybe nothing.

Note that while this may be similar to the relationships between domain entities, we can use a stricter definition of "depends on" here: does this role internally use data or results from another role? For example, in our book reviews website, we might have several interrelated entities: Users, Books, Authors, Reviews. However, a "Reviews data store" role doesn't need data from—i.e. depend on—a "Books data store" role. In most web frameworks, there may be a foreign key relationship from a row in a reviews table to a row in an books table. While this looks like a dependency, it is only one physical way of representing the "is a review of" relationship. If we removed the foreign key constraint, we could still look up a list of Reviews by Book or use the book_id from a Review to find the relevant Book. In a distributed system, maybe we'd replace the integer ID with a UUID or URN like urn:isbn:978-1732102200, but the idea is the same: we only need to represent the relationship somehow, to have some consistent identifier we can use across systems. A "Top Rated Books" role that gets a list of the highest rated books, however, probably does depend on the "Reviews data store," and possibly on the "Books data store" as well.

Look for the roles with the fewest dependencies on them, i.e. the fewest "depends on" arrows pointing into them. In web-ish applications, this will often be an API or UI layer. The fewer dependencies on a role, the easier it will be to remove. But it's not without work.

An example

Let's imagine a Django monolith, with a Single Page App (SPA) front end that uses a REST-like API, served by the monolith. What does it look like to remove the API?

Success looks like the front end making requests to a new API service, which would be responsible for handling the very first set of concerns like request authentication, and then... Wait, it depends on the other roles. The API service needs to be able to look up a Book or a User, to get lists of Reviews, and do all the other things that the front end needs. That means we need to be able to make internal API requests, as well. We will likely need to build "internal" versions of the original API, but these will have fewer responsibilities—and thus will be more reusable.

This kind of change might take the form of a strangler fig, where we build a new service that does nothing but forward requests from the front end to the monolith, and then slowly move certain responsibilities into the new service. Maybe our steps would be:

  1. Build a new API service as a transparent proxy.
  2. Add session management to the new API service. We'll use JWTs for simplicity, but server-side session storage works, too.
  3. Create a new mechanism for sending the authenticated user context to the internal APIs.
  4. Create more general-purpose internal APIs for the main entities that use the new mechanism.
  5. Compose the more general-purpose API requests in the new API service to serve the front end.
  6. As each higher-level external API endpoint is moved to the new API service, using the general-purpose internal APIs, remove the external API code from the monolith.

This approach allows us to gradually move responsibilities in small chunks of work that fit in our sustaining engineering time, while letting us remove code from the monolith at the same time.

More concretely, let's say we have a Django monolith and we've decided a few things:

  • Internal APIs will use gRPC.
  • We will send the authenticated user in a auth_user metadata field.
  • We use JWTs as session cookies.
  • We're sticking with Python but will try to build new services in Flask instead of Django, to encourage smaller services.

The first step is to stand up a new api-service written in Flask that proxies the requests transparently—or at least the body and any relevant headers.

app = Flask(__name__)

@app.route("/api/<path:subpath>")
def proxy_route(subpath):
    headers = {
        "Cookies": request.headers["Cookies"],
        # there may be other headers we need to copy
    }
    res = requests.get("http://monolith:8000/api/" + subpath, headers=headers)
    return res.json(), res.status_code, res.headers

In our Django monolith, we'll want to add another process—similar to running a Celery worker—that serves gRPC requests. Those gRPC handlers can use any of the existing Django code they need to.

As a Django view, a "read a single review" route might have looked like:

@login_required
def read_review(request, review_id):
    try:
        review = Review.objects.select_related("reviewer", "book").get(pk=review_id)
    except Review.DoesNotExist:
        raise Http404

    return JsonResponse({
       "content": review.content,
       "summary": review.summary or review.content[0:50] + "...",
       "stars": review.stars,
       "book": {
           "title": review.book.title,
           "author": review.book.author,
           "path": reverse("reviews-by-book", review.book.isbn),
       },
       "reviewer": {
           "display_name": review.reviewer.get_display_name(),
           "path": reverse("reviews-by-reviewer", review.reviewer.id),
       },
    })

The new "read a single review" route in Flask might end up like:

@app.route("/api/review/<review_id>", methods=["GET"])
def get_review(review_id):
    auth_user = get_and_validate_user_from_jwt(request)

    try:
        review = g.reviewsStub.GetReview.with_call(
            GetReviewRequest(id=review_id),
            metadata=(
                ("auth_user", auth_user),
            ),
        )
    except grpc.RpcError as error:
        # translate (or obscure) errors to HTTP status codes
        status = rpc_status.from_call(error)
        if status.code == grpc.StatusCode.NOT_FOUND[0]:
            return { "error": "review not found" }, 404

    # further error handling omitted for brevity

    book = g.booksStub.GetBook.with_call(
        GetBookRequest(id=review.book_id),
        metadata=(
            ("auth_user", auth_user),
        ),
    )

    reviewer = g.usersStub.GetUser.with_call(
        GetUserRequest(id=review.user_id),
        metadata=(
            ("auth_user", auth_user),
        ),
    )

    return {
       "content": review.content,
       "summary": review.summary or review.content[0:50] + "...",
       "stars": review.stars,
       "book": {
           "title": book.title,
           "author": book.author,
           # if we have already moved this endpoint, we could use url_for
           "path": "/api/book/" + book.isbn + "/reviews/"),
       },
       "reviewer": {
           "display_name": reviewer.display_name or reviewer.username,
           "path": "/api/reviewer/" + reviewer.id + "/reviews/"),
       },
    }

Here we can see that logic has migrated from our Django view to our new Flask application: we gather the data we need, handle any errors, calculate fallback values, and generate the correct response shape. The Django monolith's responsibilities got a little bit smaller, and that view code can now be deleted, so the source code gets a little smaller, too.

We do lose the ability to use Django's select_related() to reduce the number of separate database queries, but we gain the ability to move Reviews, Books, and Users into entirely separate applications and databases. (Each query is a single-row indexed lookup so hopefully the cost isn't too high. And technically we could move these to separate databases inside a single Django application, but we'd make the same optimization trade-off.)