Database - Coffee on the Keyboard

Constrain Your Data Access for Fun and Scale

James Socol — Wed, 23 Aug 2023 17:21:32 GMT

ActiveRecord-style ORMs, like those found in Django and Rails, can be huge time-savers, but also blur boundaries within a system, which can lead to bad assumptions and habits, as well as performance, scaling, and testing issues. Introducing constraints on how data is accessed can steer away from potential pitfalls with only a small amount of upfront investment, and improve the maintainability of software over time.

Logical Data Models, Physical Data Models, and Data Access Layers

ActiveRecord-style ORMs blur the boundary between a logical (or conceptual) data model, the physical implementation of that data model, and the underlying storage mechanism. In these ORMs, the "model" serves as both the data access layer (DAL) and the physical model itself. Most ORMs like this are built for SQL databases, and so make relational database concepts (from a "relational database management system" or RDBMS, like Postgres, SQLite, MySQL/MariaDB, or MS SQL) like joins between tables, and constraints, like not allowing list-like attributes, into foundational parts of their APIs.

Leaking SQL Through an Application

ActiveRecord APIs for RDBMSes tend to expose SQL-specific concepts—and often SQL-specific language—as primary the way you interact with the DAL aspect of the models. Let's look at a common example: an Article with Comments, both of which have Authors:

from django.db import models

class Author(models.Model):
    name = models.CharField()
    email = models.EmailField()

class Article(models.Model):
    title = models.CharField()
    slug = models.SlugField()
    content = models.TextField()
    author = models.ForeignKey(
        Author,
        null=True, blank=True,
        on_delete=models.SET_NULL)

class Comment(models.Model):
    article = models.ForeignKey(Article, on_delete=models.CASCADE)
    author = models.ForeignKey(Author, on_delete=models.CASCADE)
    content = models.TextField()

If we want to display a list of Articles, including their Author's name and the number of comments, we might do something like:

articles = Article.objects.all() \
    .select_related("author") \
    .annotate(num_comments=Count("comments"))

To accomplish our goal, we have to directly interact with the SQL concepts of the JOIN (which Django calls select_related, and Rails would call #includes or #joins) and subquery (via Django's annotate). The DAL is not able to encapsulate these details, because it works by exposing nearly all the general functionality of the underlying SQL database.

Hidden and Inefficient Queries

ActiveRecord ORMs typically expose the ability to fetch more data as-needed, which makes quick iteration and prototyping easier, but can lead to unexpected numbers of queries. This situation is common:

articles = Article.objects.all()

for article in articles:
    print(article.author.name)  # causes a query for each article

This is the "n+1 queries" problem: when iterating over a list of results, we make an additional query for each result. In this contrived example using print(), we can see that the author attribute is accessed and adjust. It's more likely that the additional queries will appear unexpectedly, in distant code like a template:

{% for article in articles %}
  {{ article.title }}
  by {{ article.author.name }}
{% endfor %}

Similarly, it is easy to inadvertently introduce inefficient queries. If we add a new "Authors" page to our site, we might do:

authors = Author.objects.all().order_by('name')

Sorting or searching by a non-indexed field typically causes a full table scan, which can have disastrous performance implications for even moderately large tables.

`JOIN`ing at Scale

Scaling challenges are a champagne problem—we should all be so lucky as to encounter them. One issue I've seen repeatedly with scaling monolithic Django or Rails-style applications comes directly from the ActiveRecord-style ORM: a heavy reliance on JOIN operations makes it a challenge to spread the load horizontally across multiple databases.

There are a number of approaches to horizontally scaling databases. The big three are:

Use read replicas to take load off of primary DB servers. This is the most friendly to using JOINs.
Shard tables across multiple DB servers. This is the least friendly to JOINs since the entities may not be able to share shard keys.
Spread different entities across different DB servers. This can support some JOINs, depending on how entities are divided. This often looks like "breaking up" a monolith into microservices.

(These are not mutually exclusive. I've worked on systems that used all three.)

Since ActiveRecord ORMs expose SQL semantics and cheap, even lazy, access to related objects, it's not uncommon to see chained access like:

article = Article.objects.get(id=12)
for comment in article.comment_set.all():
    print("commentor authored {0} articles".format(
        comment.author.article_set.count()))

Lookups like these relies on foreign keys and referential integrity. Moving Articles, Comments, or Authors into their own database (or service) means breaking those links and replacing the lookups everywhere they occur (especially if we want to continue avoiding the n+1 problem).

Which Models Even Matter?

Let's add support for multiple email addresses per Author (a change to the logical data model):

class Author(models.Model):
    name = models.TextField()

class AuthorEmail(models.Model):
    author = models.ForeignKey(Author, on_delete=models.CASCADE)
    email = models.EmailField()

Since SQL doesn't do a good job of supporting list attributes, we need to add another database model to the physical data model. But this new database model isn't nested under the Author, or clearly related except through naming. AuthorEmail seems to be the same "level" of entity as Author or Article. Over time, there can be more of these implementation detail models (list attributes, enum tables, many-to-many tables, etc) than the actual entities! Maintaining a correct mental data model becomes more and more difficult, especially for new team members.

(Yes, there are alternatives. They may be better in some situations.)

Instead, Constrain Data Access Patterns

Another path is to constrain data access. ActiveRecord-style ORMs work by exposing their full, general set of functionality, but we can introduce constraints that help break—or preferably avoid—the reliance on it.

Instead of views or other application logic directly accessing the SQL semantics through the ORM query methods, we can create a separate data access layer (DAL):

def list_articles(page=1, per_page=20) -> List[Article]:
    start = (page - 1) * per_page
    end = start + per_page
    articles = Article.objects.all() \
        .select_related('author') \
        .order_by('-publish_date')
    return list(articles[start:end])

By introducing this path, we are limiting the types of questions we can easily answer about the set of Articles, while ensuring that the answers are consistent and efficient. For other operations, we have the option to introduce things like caching:

def get_article(slug:str) -> Article:
    obj = cache.get("article-{0}".format(slug))
    if obj is not None:
        return obj
    return Article.objects.get(slug=slug)

We've restricted article lookups to using only the slug—i.e. we don't have the flexibility to do Article.objects.get(title="some title"). In exchange, we've gained a natural key (and a non-enumerable one) to use for caching and other lookups.

In order to avoid the n+1 problem, we'll need to approach the idea of "joins" differently. Let's assume there's a method like list_authors(page:int, per_page:int, authors:Optional[List[int]]) -> List[Author]:

articles = list_articles(page=3, per_page=10)
author_ids = list(set([art.author_id for art in articles]))
authors = {aut.id: aut for aut in list_authors(authors=author_ids)}

for article in articles:
    print("{0} by {1}".format(
        article.title,
        authors[article.author_id].name,
    ))

This requires two queries (or service calls) instead of n+1, so while the performance may (or may not) be slightly slower than the JOIN, it is generally consistent and not O(n). If some articles share authors, this will actually result in less data to load into our application, since we can deduplicate the authors.

There is a subtle bug in that example: what if we don't find the author? This approach can help us front-load questions about failure cases. We may choose to ignore that case for now, or we may provide a fallback:

print("{0} by {1}".format(
    article.title,
    authors.get(article.author_id, Author(name="Unknown")).name,
))

Since these restricted APIs don't rely on foreign keys or referential integrity, it also becomes easier to move an entity to another database or service. For example, we can put Comments into their own system, using the same "natural key" for Articles:

def list_comments(slug:string, page=1, per_page=15) -> List[Comment]:
    fetch_url = "http://comments-service/article/{}?page={}&per_page={}"
    response = requests.get(fetch_url.format(slug, page, per_page))
    raw_comments = response.json()["comments"]
    return [Comment(**c) for c in raw_comments]

Build for Tradeoffs

This approach, as with everything, comes with both benefits and costs. It requires some upfront work to model the domain, and thinking about the kinds of questions you want to make easier to answer than others.

Some design questions this will raise include:

Is there a single key or identifier (ideally one that isn't an autoincrementing integer) we can use to reference this entity?
How would we represent the relationship between these two entities if we couldn't use foreign keys?
What are the patterns of data access we want to support? (read heavy? write heavy? list with a single order, or with multiple orders? act on a single entity or collections of them?)

In my experience, these tend to be straight-forward questions to answer, and there aren't often a wide range of access patterns or lookups we actually need to support. When there are edge cases, we can usually get away with less-optimized implementations. (And if those edge cases become common, we can optimize them.)

# get the author's lastest article
articles = list_articles(
    order_by="-publish_date",
    authors=[author.id],
    per_page=1,
)
latest = articles[0]

As an aside, adding support for things like filtering and sorting to "list" methods has always been worth it, to me. Usually you can get away with only supporting a few attributes in each, like publish_date for sorting Articles and Author for filtering them. And if you need to add new parameters later, you can.

def list_articles(
    order_by: Optional[str] = None,
    authors: Optional[List[int]] = None,
    categories: Optional[List[str]] = None,
    year: Optional[int] = None,
    page: int = 1,
    per_page: int = 2,
) -> List[Article]:

Conceptual, Logical and Physical Data Models

James Socol — Mon, 21 Aug 2023 16:39:29 GMT

The vocabulary of conceptual, logical, and physical data models has proven to be one of the most valuable things I've learned in the past few years. Understanding and using the distinction between these layers or steps in data modeling helps design better—more scalable, more comprehensible, more maintainable—applications.

Conceptual Data Models

"Conceptual" data models are the highest layer, and the best place to start. These describe the major entities in a system and their relationships to each other, without worrying about the details of each one. Think of a "boxes and arrows" diagram—that's the level of detail at this layer.

Let's look at an example of something similar to GitHub or Trello: a tool that helps a user manage several "projects," and choose to create a public profile page that may feature various projects. A first pass at our conceptual data model may end up looking like this:

A conceptual diagram showing relationships between Users, Projects, and Profiles.

There are a few things to note in this diagram:

There is no detail about any kind of database tables or foreign keys.
There isn't even detail about what is included in a Project or a Profile.
The relationships are described with verbs other than "to have" or "belong to." As a user, I don't merely possess a project, I manage it. I don't merely have a profile, I create it.
The cardinalities of the various entities are established: many projects per user—represented by lines ending in dots rather than arrows—but only one profile per user.

Creating or updating a conceptual data model is a cross-functional task—product, engineering, design, data, and even subject-matter experts should all contribute at this level. This is an opportunity to employ ubiquitous language and ensure that everyone has a shared understanding of what these entities represent. Changes at the conceptual level—e.g. a change to the cardinality of a relationship, or creating a new entity—are possible but should not be undertaken lightly, since they are likely to have significant cascading effects on the rest of the software system.

Logical Data Models

"Logical" data models are the next layer down, and the place where we start to flesh out what our models will actually look like. We are still, at this layer, not writing code yet—and we will see why in the next section.

Logical data models tell us what attributes actually exist for a given entity. For example, we can look more closely at the "Project" model. What makes up a project?

The project name
A description
A relationship to a managing user
- Note that this is not a "foreign key." Only SQL databases have those.
Possibly a due date
Percent complete
Tasks

We can also look at what makes up a user:

A unique username
One or more email addresses
One or more authentication methods, like a password or OAuth connection

Logical data models provide an important point of agreement between different functional groups, like a product engineering team and a data team. Changes at this level are likely to have at least some cascading effects. As long as stakeholders agree on this "shape," the risk of those effects causing problems is low.

This is also where we can start asking more nuanced questions. Do we need more than one email address? Are "Tasks" really their own entity? If we want to make changes, for example to add premium features, is billing information part of the user, or the project? Do some projects need to be designated as "premium" or does the user become premium? Decisions like this will potentially be difficult to reverse, and may constrain product options in the future, so they should be made cross-functionally and with care. Most importantly, though, they should be made consistently. If one part of the system builds billing into the user while another part of the system—possibly managed by another team—assumes that billing should be per-project, there may be irreconcilable design choices.

Physical Data Models

Physical data models are the actual implementation. At this layer, we have actual code, actual database tables, actual protobuf messages. There are almost always multiple physical data models for any logical model: a database model and a CRUD API are different implementations of the same logical model.

Looking at our logical model of a user, we can imagine how it might look in a JSON API response:

{
  "username": "jsocol",
  "emails": ["jsocol@example.com"],
  "authMethods": [{ "kind": "password", "hash": "$2b$12$BRrjrm2AJof..." }],
}

Or we can implement it in a Django model:

from django.db import models

class User(models.Model):
    username = models.CharField(max_length=127, unique=True)
    password = models.TextField(null=True, blank=True, default=None)

class UserEmail(models.Model):
    user = models.ForeignKey(User, on_delete=models.CASCADE)
    email = models.EmailField(unique=True)

class OAuthConnection(models.Model):
    user = models.ForeignKey(User, on_delete=models.CASCADE)
    # ... etc ...

Or in protobuf for gRPC:

message User {
  string name = 1;
  repeated string emails = 2;
  repeated AuthMethod auth_methods = 3;
}

message AuthMethod {
  enum AuthKind {
    AUTH_KIND_UNSPECIFIED = 0;
    PASSWORD = 1;
    OAUTH2 = 2;
  }

  AuthKind kind = 1;
  // ... etc ...
}

We can see here how different physical representations, with different constraints, can vary. JSON and protobuf allow repetition, through arrays and repeated fields, respectively. SQL databases, however, typically do not allow an arbitrary number of values for one column, so instead we create a new database model, UserEmail. On the other hand, Django has an EmailField type that is more specific than using text or a string, while JSON and protobuf do not. In the database, the various authentication methods are implemented differently: password exists as a nullable database column on the User model, while OAuth connections are stored in their own table. Protobuf can use enum or oneof types to distinguish different methods, while JSON cannot.

Different affordances, constraints, and needs or priorities will lead to different physical models, often within the same parts of the system. And that's OK! As long as they are all representing the same logical data model, you'll always be able to map between the different physical models when you need to.

When am I Going to Use This?

Engineers, in my experience, tend to start with the physical data model. We use diagramming tools specifically designed for SQL databases. And if you're working on a solo project, that's probably fine.

Teams, however—and especially teams with different disciplines—will benefit from proactive data modeling with these layers in several ways:

Consistent vocabulary for different conceptual entities will avoid miscommunications across teams or functions.
Non-technical team members have an easier time engaging with the less detailed conceptual and logical models.
Early agreement on conceptual and logical data models makes it easier for teams or engineers to work independently—they provide the "alignment" part of "aligned autonomy."
Different physical models of the same logical model are much more likely to be interoperable in the future.
Changes to conceptual or logical data models are easier to contextualize, making them easier to implement and their consequences more obvious.
Major entities (e.g. "User") are easier to distinguish from implementation details (e.g. UserEmail or AuthMethod).

Conceptual and logical data modeling need not be a huge effort. Conceptual data modeling can involve a meeting to create the model together with subject-matter experts, product, etc, or it can be a quick sketch of the major entities and an asynchronous gut check. Logical data modeling can involve enumerating all the attributes of every entity up-front, or it can happen just-in-time, which is a much smaller chunk of work, and something we often do tacitly anyway. And physical data modeling is the work of writing software that we're already doing.

An Object Caching Pattern for Django

James Socol — Thu, 07 May 2015 08:51:39 GMT

Increasingly I’ve been treating even RDBMSes like structured key-value stores. There are still foreign keys and relationships in there, but the access patterns are most commonly by some kind of “primary” key (not always the primary key on the table, but a natural one).

Normally when I do something in more than two projects I’ll put it into a library, but for once this honestly feels too small, so instead, here’s a blog post and a gist.

This makes object caching quick to implement and very effective. Here’s a pattern I’ve been using in Django models:

Looking up an object looks like:

obj = MyModel.get(some_key)

Advantages of this pattern:

Straight-forward to implement, can be factored in a mixin without much work.
Caches non-existent entries (“misses”).
Very high hit rate in many common cases.
Low risk of caching stale data.
No signals or other spooky action at a distance.
Easy to mock get() in tests.

Disadvantages:

Subject to thundering herd when read rate is too high or hot-spots—can be partially alleviated by updating save() and delete() to write to the cache, too, but increases the probability of caching stale data.
No support for querysets or lists (intentional, as these are notoriously difficult to cache and invalidate correctly).
Can’t use queryset update() or delete() methods.

This works well when most read access is by the same natural key. You could extend it to support multiple keys—e.g. a name and an integer ID, by defining methods get_by_name(cls, name) and get_by_id(cls, pk), or similar, and then in flush, generating all the keys and using cache.delete_many. It works badly when most access is via related managers, e.g. my_obj.something_set.all().

The same pattern absolutely works outside of the Django ORM, but the specifics depend on how you’re accessing your DB. Personally, I like accessor functions that return dictionaries (e.g. get_some_object(key)).

Update, 7 May

Jannis pointed out, correctly, that this does introduce another call that can fail and thus it has implications for database transactions.

When I use this pattern, I typically enable atomic requests. Writes often cause side effects that need to propagate through various systems, so there’s usually more than one call that can fail. For the use cases I have today, atomic requests is enough. For others, more fine-grained transaction management is necessary.

Django Fixtures with Circular Foreign Keys

James Socol — Wed, 29 Sep 2010 16:50:00 GMT

If you create a nice, perfectly normalized database, you (probably) won’t ever run into circular foreign keys (when a row in table A references a row in table B that references the same row in table A).

In the real world, this happens pretty regularly. The most common situation is a “current” or “last” denormalization. You don’t really want to do a subquery with a sort every time you want to know the latest post in a forum thread, or current revision of a wiki page.

The problem—one we’ve been dealing with since we decided to rebuild SUMO—is that trying to load data with circular foreign keys produces a “chicken and the egg” situation: since each row depends on the other, neither can be loaded first.

(This is part of a bigger problem with MySQL, which is that it lacks deferred foreign key checks.)

The solution to this is to temporarily disable foreign key checks while you load in data. It’s not hard, but Django is so far unwilling to do it.

Well, now we get the chance to see if their concerns are realistic: with the latest commit to Jeff Balogh’s test-utils package for Django, we’re disabling foreign key checks during fixture loading.

Both SUMO and AMO have had to do some acrobatic hackery to get around the limit. This solution is definitely a filthy hack, but it’s contained in a single, small place, rather than spread throughout test cases in multiple projects.

Suggestions for improving this hideous monkey patch are welcome, but in the meantime I’ll be removing the gross parts from Kitsune that we needed to work around this.

Developing at Scale: Database Replication

James Socol — Thu, 17 Jun 2010 11:10:15 GMT

When a website is small—like this one, for example—usually the entire thing, from the web server to the database, can live on a single server. Even a single virtual server. One of the first things that happens when a web site gets bigger is this is no longer true.

One reason is load. A popular website will simply require more than a single server, virtual or otherwise, can give, and the only way to keep scaling is to add more servers. For example, if the server runs out of available Apache connections and the number cannot be raised without negatively impacting performance.

Another reason is downtime. If a website is served from a single server, and that server goes down for any reason, planned or otherwise, then the website is down. At some point, downtime is essentially unacceptable—just ask Twitter—and redundancy is required.

Enter Replication

A common response is to set up database replication, where one database server operates as a “master,” and one or more other servers operate as “slaves.” In this setup, all of your writes to the database will go to the master, then “replicate” to the slaves, and all or most of the reads will come from the slaves. (Note that the slaves are doing both all the writes as well as all the reads: slaves are not a good place to recycle sub-par hardware.)

Replication introduces a new type of problem: if you naively send all reads to the slaves then data you just wrote will not be there.

La…wait for it…g

Even if the master and slave are sitting next to each other with a cable connecting them, replication will probably take more time than your code does to reach the next step. At a minimum, you need to assume that replication lag will be hundreds of milliseconds—an eternity when the time from one line in your web app to the next is measured in micro- or nanoseconds. In reality, replication in the real world may well take seconds, especially if your master and slaves are not physically next to each other.

The result is that ACIDity is essentially broken, specifically the Durability part. You cannot simply write data and immediately rely on its existence.

For example, say you have a large discussion forum. If you naively send all reads to the slaves, then someone’s post may take seconds to appear on the site. This is a problem if you’re trying to show a user their post immediately after posting it.

Smarter Reading

The solution is to occasionally read from the master. When you need to access data that was just written, it is probably only available on the master, so that’s where you’ll read it. Within a single HTTP request, this is fairly simple: just force any queries that rely on recently-written data to the master.

Outside of a single HTTP request, this is slightly more complex. If you’re following the practice of redirecting after a POST request to a GET request (which you should) then creating a new forum post and viewing it will be on two different HTTP requests.

One way around this is to set a very short-lived cookie that tells your web app to continue reading from the master. If any write occurs in a request, the response should include this cookie. The exact time-to-live will depend on how long your replication lag usually is—cover at least 4 or 5 standard deviations. Any request that has this cookie should honor it by reading only from the master.

A Pitch

One of the hardest things for new web developers is developing large-scale applications: first, you need a large-scale application! Setting up database replication is a huge pain, and if your site isn’t getting enough traffic, it’s not worth it.

Mozilla is one way aspiring web developers can get some experience working with large-scale web apps. All of our web apps are open source and open to contributions from community members. To get involved, stop by #webdev in IRC!

Surviving Pac Man

James Socol — Mon, 24 May 2010 00:09:36 GMT

On Friday, Google showed off a fun new doodle in honor of the 30th anniversary of Pac Man: a Pac Man clone, complete with sounds.

Unfortunately, in the initial release, those sounds started playing automatically—an oversight or an homage to , I guess. Even if Google was open in a background tab or window, or in a hidden iframe created by an add-on, the Pac Man music and sound effects would start.

And that confused some people.

Many people came to SUMO looking for an explanation, and many of them, not finding anything in the knowledge base, started posting to our forum. So many, in fact, that our database server started running out of connections.

The pounding we took on the forums also caused replication on our slave databases to fall behind by as much as 1.25 hours, so even when we wrote an article about the noises [article has been removed], it didn’t show up for most people.

As Sean put it: “We just got DDOSed by Pac Man.”

To shore up the site and bring it back from the brink of toppling over, we worked with IT (thanks, Dave!) to implement a number of temporary solutions. We…

…disabled a particular kind of slow, frequent, and useless query.*
…blocked Google’s crawler from indexing the site.
…disabled our own sumobot’s forum-crawling features.
…rotated DB slaves out of the production pool to allow them to catch up.

Google has already removed the Pac Man doodle from their home page, and we can revert most of the emergency measures here on Monday. But the event does remind us to look at what we’re doing in Kitsune, our rewrite, to weather storms like this in the future.

One idea, suggested by Dave Dash, is a read-only mode where all pages that can trigger database writes are temporarily disabled. We’ll be looking pretty seriously at this over the next couple of days.

Another important take-away is to make damn sure pages only trigger database writes if they really need to. Writes can never bounce off a cache, so they are very expensive.

Finally, we should be more proactive in how we interact with our Zeus cache. We’ll also think about whether it makes sense to start using Wil Clouser’s Zeus interface, Hera, sooner than later.

“Too much traffic” is the best problem a web development team can have. Hopefully, the first time this happens to Kitsune, we’ll be ready.

The queries that increment the number of views a forum thread has gotten are particularly slow for some reason. They’re also wildly inaccurate, since most people see a cached version of those pages and never trigger the query. The worst part: they occur on every (non-cached) page view, even while just reading.

(This post was translated into Belorussian, isn’t that cool?)

Responsible SQL: How to Authenticate Users

James Socol — Sun, 09 Nov 2008 12:16:58 GMT

Most SQL-injection articles set a horrible example for young programmers.

Here is a very typical “bad example” of why you need to escape user data before it goes into SQL queries:

(ed. The symbol « is a line break that’s not in the real code.)

$username = $_POST[‘username’]; // username=admin

$password = $_POST[‘password’]; // password=’ OR 1=1; — ‘

$user = $db->query("SELECT * FROM users WHERE «

username=’$username’ AND «

password=’$password’ LIMIT 1;");

The point, of course, is that you must sanitize your user input, or else this person would run this query:

$user = $db->query("SELECT * FROM users WHERE «

username=’admin’ AND «

password = ” OR 1=1; — ‘ LIMIT 1;");

Which grants the sneaky user all your admin privileges. Other versions have nefarious users dropping your users or articles tables.

The problem is: this is the wrong way to authenticate users. These examples are written for beginners to understand the importance of sanitizing input, but they also provide a model to those beginners for how user authentication works. And it’s a very bad model.

This is a long one, more after the break.

The only upside to authenticating this way is that you don’t expose any information on failure, that is, if I’m trying to hijack someone’s account, I can’t tell the difference between an invalid user name and a valid user name with a bad password. That’s good, but there are good reasons not to do this at the database level.

The “correct” way is not much more complex. Basically:

Look up the record with the username only.
Get the (hashed) password out of the database.
Hash the submitted password.
Compare the two hashes.

This is really not very hard to implement. In PHP:

/**

* Check a password against the database

* @[param](http://twitter.com/param) string $username The username to check

* @[param](http://twitter.com/param) string $password The (supposed) password

* @[return](http://twitter.com/return) int 0=success, 1=bad username, 2=bad password

function check_password ($username, $password){

$db = new mysqli(); // we need to talk to the DB

10.

11.

// the real_escape_string() function is much better

12.

// than add_slashes() for escaping MySQL database input

13.

$_username = $db->real_escape_string($username);

14.

15.

// I try to make my SQL queries as easy to read

16.

// as possible. (Not always very easy.)

17.

$result = $db->query("SELECT password "

18.

."FROM users "

19.

."WHERE username = ‘{$_username}’ "

20.

."LIMIT 1;");

21.

22.

// we’re assuming the query ran correctly

23.

24.

// if we can’t return a row, then there’s no user with

25.

// that name

26.

if( !$user = $result->fetch_assoc()){

27.

return1; // return code for bad username

28.

}

29.

30.

// now, assuming the password was hashed with crypt()

31.

if($user[‘password’] != «

32.

[crypt](http://www.php.net/crypt)($password, $user[‘password’])){

33.

return2; // return code for bad password

34.

}

35.

36.

return0; // return code for success

37.

}

What’s going on here? Basically, we’re looking up the user by the username. If we don’t find a user, we throw out an error. If we do find a user, we re-encrypt the password they supplied, and check it against the encrypted password we already have. If they don’t match, we throw out an error. If they do, the user is allowed to log in.

There are two key differences between this method and the method so often espoused by tutorial writers:

This method stores an encrypted password instead of plain text.
This method differentiates between bad usernames and bad passwords.

1 should be obvious. Never store an unencrypted password. It’s extremely dangerous: if someone ever gets a look at the table, they can just read the users’ passwords—which may well be the same as their bank password (no it shouldn’t be, but it probably is). And it’s unnecessary. Every server-side language implements the MD5 hash, which is weak but works. Better options (like PHP’s crypt()) can use algorithms like Triple-DES, SHA1, Blowfish, or at least MD5 with a random salt.

But wait, #2, I said it was better not to distinguish between a bad username and a bad password, right? Well… yes, to the end user. In either case, I should display a message like “Bad username or password” to the person who tried to log in.

Internally, however, I want to know what happened. Is someone targetting known users, or just trying random combinations? How did they find real usernames? Where should I be improving security?

You’re also minimizing the number of user-submitted strings that get sent to the database. There are fewer opportunities for you to accidently allows an injection attack. If you have a policy on username syntax, you can keep yourself even safer by not talking to the database if the username is bad:

(I’ve omitted logging or real error-handling here. In a live version, I would probably wrap most of this in a try block, throw one of three types of exceptions, and do some logging in the catch block.)

// Usernames must start with a letter, and contain

// only letters, numbers, underscores and dots, but

// must not end with a dot or underscore.

$user_regex = ‘/[a-zA-Z][a-zA-Z0-9_\.]*[a-zA-Z0-9]/’;

if([preg_match](http://www.php.net/preg_match)($user_regex,$username)){

// the username matches our allowed syntax

10.

$auth = check_password($username, $password);

11.

12.

if($auth === 0){

13.

// the do_login() function is an exercise

14.

// to the reader

15.

do_login($username);

16.

}

17.

}

18.

19.

// the username was bad, or the username/password

20.

// was wrong

21.

// die() is an overly simplistic choice, here.

22.

[die](http://www.php.net/die)("Bad username or password.");

23.

24.

Obviously we still escape the username, to make damn sure, but this gives us another place to get information. Did someone actually enter `'; DROP TABLE users; --` into our login form, or did they just mistype their password.

I’m going to end with a request: if you’re about to write a tutorial for beginners, please be aware of what you’re modeling in your examples. If you’re doing something you would never do, for the sake of simplicity or because it’s not the focus of the tutorial, point that out. Link to another tutorial or at least mention that it’s a bad way to do something.

Don’t send a quiet message that wrong is OK.

Connecting PHP, IIS 6, and SQL Server 2005

James Socol — Thu, 23 Oct 2008 10:33:20 GMT

I know I will be accosted for this, but at work we needed to run PHP on IIS 6 (fairly simple) and connect it to a remote database server running SQL Server 2005 (not terrible, once I gave up the Microsoft way).

Yeah yeah, do it in ASP.NET, I know. While I like C# as a language, I kind of hate ASP.NET as a framework, so what are you gonna do? Java was an option but the start-up time was too long for this project.

My first Google search for “PHP SQL Server 2005” turned up the Microsoft SQL Server 2005 Driver for PHP. “Well great!” I thought. It’s just a PHP extension, very easy to install on Windows. But I didn’t know the horrid depths into which I was about to sink.

The Microsoft driver comes with an example application and database. The application assumes you are connecting to a local database. There is scant information about remote databases.

The driver defines this function:

sqlsrv_connect($host[, $connectionOptions[, ...]]);

The example application tells you to set $host to (local). Supposedly this works. However, after scouring the internet for several days, and trying every permutation of hostname, Windows networking name, port, IP address, white space, and several other variables that shouldn’t have been in there, I’ve decided it doesn’t talk to remote servers nicely.

PDO‘s ODBC driver, on the other hand, and a quick visit to www.connectionstrings.com, worked wonderfully.

Here is how I needed to create the PDO object. I hope this is useful for someone else:

(ed. The symbol « is a line break that’s not in the real code.)

$host = '1.2.3.4'; $port = '1433'; $database = 'MyDatabase'; $user = 'MyDatabaseUser'; $password = 'MyDatabasePassword'; $dsn = "odbc:DRIVER={SQL Server}; « SERVER=$server,$port;DATABASE=$database"; try { // connect $conn = new PDO($dsn,$user,$password); } catch (PDOException $e) { // fancy error handling }

Help Me Scale

James Socol — Fri, 06 Jun 2008 08:58:38 GMT

I’ve been reading Eran Hammer-Lahav’s intelligent posts on microblog scalability, and now I’m concerned about my own “microblog” site, Picofiction.

Similar to social networks, social updates, social messaging, social… Like many social web sites—amongst our weaponry…—Picofiction lets you “follow” your favorite authors, displaying all their posts along with yours.

I handle this very naïvely: everything is offloaded to the database. There are three tables involved here, one of users, one of posts, and one of follower/followee bindings.

Here’s the basic structure of this query:

SELECT post_id, post_body, post_date, post_type, user_name AS author_name, user_id AS author_id FROM posts LEFT JOIN users ON posts.author_id = users.user_id WHERE author_id = 'CURRENT_USER' OR author_id IN ( (SELECT followed_id FROM followers WHERE following_id = 'CURRENT_USER') ) ORDER BY post_date DESC LIMIT PAGE_START,20;

Here’s where I need help: this works great on a single database, but it does not scale horizontally.

Since this horizontal scalability is such a hot topic right now, I’m asking for ideas. I’d like to put in the infrastructure before there is a need for it.

Eran points out that caching is not as simple a solution as we’d like to think. What do you cache? How do you keep caches in sync?

Does anyone have experience with MySQL Cluster Servers? It seems like the best way of scaling is to make the process as parallelizable as possible. The database then handles the parallelization, so the less I can do in the program the better, right?

MySQL Subqueries

James Socol — Sun, 05 Aug 2007 19:31:00 GMT

I often find it difficult to find tips and advice for doing relatively simple things in things like MySQL, Ruby, Python, etc. So, starting with this post, I will help fill that niche. Today’s topic is Using Subqueries to Simplify your SQL Queries.

For this article, I’m using PHP and MySQL for examples. There are slightly different implementations of SQL in the various database engines, but this is one thing they all have in common.

SQL is called “structured query language” because it allows subqueries to make complex queries easier and faster. The idea of a subquery is simple: have the database perform one query and insert it into another.

There are dozens of useful ways of using subqueries, but I will concentrate on two: subqueries in the select expression and subqueries in the where clause.

Security Concerns

In most web programming languages, the interface between the script and the database only allows one query per access for security reasons: an injection attack could input something like '; DELETE * FROM users; and do some serious damage to a website. Imagine your SQL query to login looked something like:

SELECT * FROM users WHERE user_name = '$username' AND password = '$password';

If you are not checking and cleaning the input appropriately, someone could type the snippet above into your login form and, if multiple queries were allowed, MySQL would execute the following:

SELECT * FROM users WHERE user_name = ''; DELETE * FROM users; AND password='';

Since the empty string wouldn’t match any rows (hopefully), the first query would be discarded. The second query, the DELETE statement, would run, terminating at the second semicolon. Since the third piece of code is nonsense, MySQL would throw it out with an error.

To solve this problem, languages like PHP cause MySQL to issue an error any time there is more text (except comments) after the line terminator, usually the semicolon. The downside is that situations arise where you need to run multiple queries. The result is either often either a godawfully complicated statement with multiple JOINs, or running several queries, each of which requires communication with your database server and can slow down your applications.

In the examples below, I’ll pretend we’re building a forum that has four tables:

users with primary key user_id
forums, a list of all the boards, with primary key forum_id
threads which links each thread to a forum with forum_id and has primary key thread_id
posts which links each post to a thread with thread_id and has primary key post_id

Subqueries in Select Expressions

One way to speed up your queries again is to use subqueries. Subqueries are full SQL queries nested within another query. For example:

SELECT (SELECT * FROM t1);

Obviously it’s a pretty simple example. Notice the parentheses. Subqueries must always be in parentheses, even if they are inside a function, like:

SELECT MAX((SELECT salary FROM employees));

Let’s get to work on our forum. Say that while reading all the threads of a forum you’d like to have both the number of threads and the number of posts in the forum. One way is to run two separate queries:

SELECT COUNT(*) AS threads FROM threads WHERE forum_id='1';


SELECT COUNT(*) AS posts FROM posts LEFT JOIN threads USING(thread_id) WHERE forum_id='1';```

That might not be so bad if your SQL server is localhost, but more and more hosts are running dedicated SQL servers, meaning that every query has to run across the internet, be processed, and run back, slowing down your application. But we can run this in one query with two subqueries:

SELECT


(SELECT COUNT(*) FROM threads WHERE forum_id='1') AS threads,


(SELECT COUNT(*) FROM posts LEFT JOIN threads USING(thread_id) WHERE forum_id='1') AS posts;```

We can add the above to our query to get the name of the forum and its description, so we can further decrease the number of trips to the database:

SELECT


(SELECT COUNT(*) FROM threads WHERE threads.forum_id=forums.forum_id) AS threads,


(SELECT COUNT(*) FROM posts LEFT JOIN threads USING(thread_id) WHERE threads.forum_id=forums.forum_id) AS posts,


forum_name,


forum_description


FROM forums WHERE forum_id='1';```

Notice that we also changed the WHERE clauses to match whatever forum ID we put into the “outer query“.

Subqueries in Where Clauses

Another simple and useful way to use a subquery is in a WHERE clause. Here you must be careful to match the WHERE syntax and the type of data returned by the subquery. For example, in WHERE user_name = (...), the subquery ((...)) must return a single value, while in WHERE post_date IN (...), the subquery can return a list.

In our forum, we might want to search for all posts by a specific user, but we don’t want our visitors to need to know the user ID—or perhaps we want a more descriptive URL, like search.php?user=USER_NAME instead of search.php?user=#ID#. But in our forum, to be efficient, we link posts to their author by the user_id column.

One way to do this is to run a query to find the ID then run another query to find the posts. Another way in this particular case is to use a JOIN statement. But yet another way is to do this:

SELECT * FROM posts WHERE user_id = (SELECT user_id FROM users WHERE user_name = 'foo');

In the case above, a JOIN would also get us the information we want, but in some cases this isn’t true, for example:

SELECT column1 FROM t1


WHERE column1 = (SELECT MAX(column2) FROM t2);```

When you need to COUNT or otherwise aggregate one column, you’ll need to use a subquery instead of a JOIN, as well.

Summary

This article only scratched the surface of subqueries. Subqueries can be nested, they can appear in other places and do other things, and they can make your SQL more readable, among others. I don’t claim that the SQL statements above are the world’s most efficient or best way to do things—if you know a better way, let me know! I just want to give an introduction to subqueries, a very basic part of SQL that few people I’ve met seem to understand.