Constrain Your Data Access for Fun and Scale

ActiveRecord-style ORMs can be huge time-savers, but also blur boundaries within a system, which leads to bad assumptions and habits

A photo of railway tracks.
"rails" by Tnarik Innael on Flickr.

ActiveRecord-style ORMs, like those found in Django and Rails, can be huge time-savers, but also blur boundaries within a system, which can lead to bad assumptions and habits, as well as performance, scaling, and testing issues. Introducing constraints on how data is accessed can steer away from potential pitfalls with only a small amount of upfront investment, and improve the maintainability of software over time.

Logical Data Models, Physical Data Models, and Data Access Layers

ActiveRecord-style ORMs blur the boundary between a logical (or conceptual) data model, the physical implementation of that data model, and the underlying storage mechanism. In these ORMs, the "model" serves as both the data access layer (DAL) and the physical model itself. Most ORMs like this are built for SQL databases, and so make relational database concepts (from a "relational database management system" or RDBMS, like Postgres, SQLite, MySQL/MariaDB, or MS SQL) like joins between tables, and constraints, like not allowing list-like attributes, into foundational parts of their APIs.

Leaking SQL Through an Application

ActiveRecord APIs for RDBMSes tend to expose SQL-specific concepts—and often SQL-specific language—as primary the way you interact with the DAL aspect of the models. Let's look at a common example: an Article with Comments, both of which have Authors:

from django.db import models

class Author(models.Model):
    name = models.CharField()
    email = models.EmailField()

class Article(models.Model):
    title = models.CharField()
    slug = models.SlugField()
    content = models.TextField()
    author = models.ForeignKey(
        Author,
        null=True, blank=True,
        on_delete=models.SET_NULL)

class Comment(models.Model):
    article = models.ForeignKey(Article, on_delete=models.CASCADE)
    author = models.ForeignKey(Author, on_delete=models.CASCADE)
    content = models.TextField()

If we want to display a list of Articles, including their Author's name and the number of comments, we might do something like:

articles = Article.objects.all() \
    .select_related("author") \
    .annotate(num_comments=Count("comments"))

To accomplish our goal, we have to directly interact with the SQL concepts of the JOIN (which Django calls select_related, and Rails would call #includes or #joins) and subquery (via Django's annotate). The DAL is not able to encapsulate these details, because it works by exposing nearly all the general functionality of the underlying SQL database.

Hidden and Inefficient Queries

ActiveRecord ORMs typically expose the ability to fetch more data as-needed, which makes quick iteration and prototyping easier, but can lead to unexpected numbers of queries. This situation is common:

articles = Article.objects.all()

for article in articles:
    print(article.author.name)  # causes a query for each article

This is the "n+1 queries" problem: when iterating over a list of results, we make an additional query for each result. In this contrived example using print(), we can see that the author attribute is accessed and adjust. It's more likely that the additional queries will appear unexpectedly, in distant code like a template:

{% for article in articles %}
  <h2>{{ article.title }}</h2>
  <span>by {{ article.author.name }}</span>
{% endfor %}

Similarly, it is easy to inadvertently introduce inefficient queries. If we add a new "Authors" page to our site, we might do:

authors = Author.objects.all().order_by('name')

Sorting or searching by a non-indexed field typically causes a full table scan, which can have disastrous performance implications for even moderately large tables.

JOINing at Scale

Scaling challenges are a champagne problem—we should all be so lucky as to encounter them. One issue I've seen repeatedly with scaling monolithic Django or Rails-style applications comes directly from the ActiveRecord-style ORM: a heavy reliance on JOIN operations makes it a challenge to spread the load horizontally across multiple databases.

There are a number of approaches to horizontally scaling databases. The big three are:

  1. Use read replicas to take load off of primary DB servers. This is the most friendly to using JOINs.
  2. Shard tables across multiple DB servers. This is the least friendly to JOINs since the entities may not be able to share shard keys.
  3. Spread different entities across different DB servers. This can support some JOINs, depending on how entities are divided. This often looks like "breaking up" a monolith into microservices.

(These are not mutually exclusive. I've worked on systems that used all three.)

Since ActiveRecord ORMs expose SQL semantics and cheap, even lazy, access to related objects, it's not uncommon to see chained access like:

article = Article.objects.get(id=12)
for comment in article.comment_set.all():
    print("commentor authored {0} articles".format(
        comment.author.article_set.count()))

Lookups like these relies on foreign keys and referential integrity. Moving Articles, Comments, or Authors into their own database (or service) means breaking those links and replacing the lookups everywhere they occur (especially if we want to continue avoiding the n+1 problem).

Which Models Even Matter?

Let's add support for multiple email addresses per Author (a change to the logical data model):

class Author(models.Model):
    name = models.TextField()

class AuthorEmail(models.Model):
    author = models.ForeignKey(Author, on_delete=models.CASCADE)
    email = models.EmailField()

Since SQL doesn't do a good job of supporting list attributes, we need to add another database model to the physical data model. But this new database model isn't nested under the Author, or clearly related except through naming. AuthorEmail seems to be the same "level" of entity as Author or Article. Over time, there can be more of these implementation detail models (list attributes, enum tables, many-to-many tables, etc) than the actual entities! Maintaining a correct mental data model becomes more and more difficult, especially for new team members.

(Yes, there are alternatives. They may be better in some situations.)

Instead, Constrain Data Access Patterns

Another path is to constrain data access. ActiveRecord-style ORMs work by exposing their full, general set of functionality, but we can introduce constraints that help break—or preferably avoid—the reliance on it.

Instead of views or other application logic directly accessing the SQL semantics through the ORM query methods, we can create a separate data access layer (DAL):

def list_articles(page=1, per_page=20) -> List[Article]:
    start = (page - 1) * per_page
    end = start + per_page
    articles = Article.objects.all() \
        .select_related('author') \
        .order_by('-publish_date')
    return list(articles[start:end])

By introducing this path, we are limiting the types of questions we can easily answer about the set of Articles, while ensuring that the answers are consistent and efficient. For other operations, we have the option to introduce things like caching:

def get_article(slug:str) -> Article:
    obj = cache.get("article-{0}".format(slug))
    if obj is not None:
        return obj
    return Article.objects.get(slug=slug)

We've restricted article lookups to using only the slug—i.e. we don't have the flexibility to do Article.objects.get(title="some title"). In exchange, we've gained a natural key (and a non-enumerable one) to use for caching and other lookups.

In order to avoid the n+1 problem, we'll need to approach the idea of "joins" differently. Let's assume there's a method like list_authors(page:int, per_page:int, authors:Optional[List[int]]) -> List[Author]:

articles = list_articles(page=3, per_page=10)
author_ids = list(set([art.author_id for art in articles]))
authors = {aut.id: aut for aut in list_authors(authors=author_ids)}

for article in articles:
    print("{0} by {1}".format(
        article.title,
        authors[article.author_id].name,
    ))

This requires two queries (or service calls) instead of n+1, so while the performance may (or may not) be slightly slower than the JOIN, it is generally consistent and not O(n). If some articles share authors, this will actually result in less data to load into our application, since we can deduplicate the authors.

There is a subtle bug in that example: what if we don't find the author? This approach can help us front-load questions about failure cases. We may choose to ignore that case for now, or we may provide a fallback:

print("{0} by {1}".format(
    article.title,
    authors.get(article.author_id, Author(name="Unknown")).name,
))

Since these restricted APIs don't rely on foreign keys or referential integrity, it also becomes easier to move an entity to another database or service. For example, we can put Comments into their own system, using the same "natural key" for Articles:

def list_comments(slug:string, page=1, per_page=15) -> List[Comment]:
    fetch_url = "http://comments-service/article/{}?page={}&per_page={}"
    response = requests.get(fetch_url.format(slug, page, per_page))
    raw_comments = response.json()["comments"]
    return [Comment(**c) for c in raw_comments]

Build for Tradeoffs

This approach, as with everything, comes with both benefits and costs. It requires some upfront work to model the domain, and thinking about the kinds of questions you want to make easier to answer than others.

Some design questions this will raise include:

  • Is there a single key or identifier (ideally one that isn't an autoincrementing integer) we can use to reference this entity?
  • How would we represent the relationship between these two entities if we couldn't use foreign keys?
  • What are the patterns of data access we want to support? (read heavy? write heavy? list with a single order, or with multiple orders? act on a single entity or collections of them?)

In my experience, these tend to be straight-forward questions to answer, and there aren't often a wide range of access patterns or lookups we actually need to support. When there are edge cases, we can usually get away with less-optimized implementations. (And if those edge cases become common, we can optimize them.)

# get the author's lastest article
articles = list_articles(
    order_by="-publish_date",
    authors=[author.id],
    per_page=1,
)
latest = articles[0]

As an aside, adding support for things like filtering and sorting to "list" methods has always been worth it, to me. Usually you can get away with only supporting a few attributes in each, like publish_date for sorting Articles and Author for filtering them. And if you need to add new parameters later, you can.

def list_articles(
    order_by: Optional[str] = None,
    authors: Optional[List[int]] = None,
    categories: Optional[List[str]] = None,
    year: Optional[int] = None,
    page: int = 1,
    per_page: int = 2,
) -> List[Article]: