• Bleach, HTML sanitizer and auto-linker

    by  • 25 February 2010 • Articles

    Bleach is a whitelist-based HTML sanitizer and auto-linker in Python, built on html5lib, for AMO and SUMO and released under the BSD license.

    Bleach has two main functions: sanitizing HTML based on a whitelist of tags and attributes, and turning URLs into links. It uses html5lib for both.

    For more information on using Bleach, see the README included in the source. For more info on how Bleach works, follow below the jump.

    Sanitizing HTML

    Bleach’s clean() function uses a slightly custom version of html5lib’s HTMLSanitizer tokenizer that adds support for per-tag attribute whitelists. Any entity that is not part of a whitelisted tag or valid entity will be encoded. Legitimate entities and tags are allowed. The default whitelist is set up for AMO.

    Linkifying Text

    The linkify() function is a little more complicated. Naïve implementations usually rely on a simple regular expression to find URL-like strings, but this quickly becomes insufficient when you need to handle situations like these:

    • <em>http://example.com</em> (should be linkified)
    • <a href="http://example.com">test</a> (already linked, no need to linkify)
    • <a href="http://example.com">http://example.com</a> (really don’t need to linkify)
    • <em>http://xx.com <a href="http://example.com">http://example.com</a></em> (regular expression freak-out)

    So linkify() actually uses html5lib to build a document fragment and walks it, only applying the naïve regular expression in safe locations. In pseudocode:

    1. tree = parseFragment(input)
    3. linkify_nodes (tree):
    4.     for node in tree:
    5.         if node is a text node:
    6.             replace node with text nodes and links
    7.         else if node is a link:
    8.             if nofollow:
    9.                 set rel="nofollow" on node
    10.         else:
    11.             linkify_nodes(node.childNodes)
    13. return string(linkify_nodes(tree))

    This avoids attempting to apply the regular expression to things like tag attributes, the inside of <a> tags, and other places it should generally be avoided. It also lets us do things like set the rel attribute on links already in the text and pass the href attribute through the same filter it would go through if we created the link. This filter lets us redirect links through an outbound redirect, so people know they’re leaving a Mozilla site. You could do other things with it, like rickroll your visitors. That’s up to you.

    Bad HTML

    Because both clean() and linkify() use html5lib and construct document trees, using either will fix up code mistakes, like unclosed takes, and escape bare entities. linkify() allows basically every tag and attribute, so if you need to limit the legal HTML to a subset, use clean() (or the shortcut bleach() to clean then linkify).

    Getting Bleach

    Bleach is available on Github, or can be installed via pip or easy_install. Improvements and test cases are very welcome! Actually, there’s one disabled test right now that is not supported. If you can make it work, that would be pretty great!