XSS: Cross-Site Scripting - Basic Security Part 2

NB: This is the second post in a series of posts on web application security.

XSS covers a number of various attacks, but the common thread is that someone gets to execute code in the context of your web page and domain. Doing that, they can do all sorts of things, primarily collecting data from logged in users, like session IDs, or worse.

There are a number of things you can do to help minimize the impact of an XSS vector against your site—and you should do them, because the more complicated the app, the more subtle holes you’re likely to end up with. We’ll cover those later, but first, let’s cover the first level of due diligence. You need to:

  • …properly escape any user content that ends up in your pages’ HTML.
  • …not use tag or attribute blacklists.
  • …audit the source of data that ends up rendered.

Any template language worth using, like Django or Jinja2, will automatically HTML-escape anything variable you output, unless you take special steps not to. Don’t disable automatic escaping. Anything that needs to let HTML through should be opt-in, so you minimize the surface area of attack.

There is an interesting case with localization. For example, you might need to localize the string Hello, <span>Username</span>. Localization (L10n for short) 101 says you can’t do the simple thing:

_('Hello,') + ' <span>' + escape(username) + '</span>'

Because in some locales, the name will need to come first. And you can’t do the other simple thing, because it’s insecure:

_('Hello, <span>%s</span>') % username

In fact, in Jinja2, at least, interpolation marks the translated string as unsafe and will cause everything, even the <span> tags to get escaped.

For cases like this, we built a filter in Jingo (our Jinja-for-Django template adapter) that escapes input and then interpolates it, marking the final output as safe. An example from our code base, in Jinja2:

{{ _('<em>Editing</em> {title}')|fe(title=page.title) }}

Even if page.title has HTML in it, that will be escaped, but the <em> tags will not. This is getting into more complicated use cases, but it is possible, and critical, with L10n and other concerns, to default to escaping content.

Sometimes you need to let some HTML through. For example, a blog with comments may want to let commenters use some HTML tags. To do this safely, use a white-list approach, and a tool like Bleach.

comment = bleach.clean(comment, tags=['em', 'strong', 'br'])

Bleach will not only escape any tags not in the whitelist, it will also close unbalanced tags so you won’t have a stray <b> or <div> ruining your whole page’s layout.

On JavaScript

Because I just fixed this bug, I’d like to talk about JS a bit.

It’s easy, even with JS frameworks, to simply put together a string of HTML and add it to the DOM via innerHTML. If user data ends up in one of these strings, you can have a problem. For a very simple example:

var stuff = '<h2>' + username + '</h2>';
document.getElementById('#subhead').innerHTML = stuff;

Now username doesn’t need to be escaped as JavaScript, but as HTML. Even if it came from the server packaged safely as a JS string, you’re still creating an XSS vector.

The route I chose, because it was expedient, was essentially to implement Python’s markupsafe.escape or PHP’s htmlspecialchars() in JS.

But if you’re starting from scratch or from a place where you already have a DOM node, you can generally use a JS framework to help you. For example, in jQuery, the .text() method uses the DOM’s .createTextNode and so escapes any special characters:

$('#subhead h2').text(username);

At least in the latest jQuery, that’s safe.