Bleach has two main functions: sanitizing HTML based on a whitelist of tags and attributes, and turning URLs into links. It uses html5lib for both.
For more information on using Bleach, see the README included in the source. For more info on how Bleach works, follow below the jump.
clean() function uses a slightly custom version of html5lib’s
HTMLSanitizer tokenizer that adds support for per-tag attribute whitelists. Any entity that is not part of a whitelisted tag or valid entity will be encoded. Legitimate entities and tags are allowed. The default whitelist is set up for AMO.
linkify() function is a little more complicated. Naïve implementations usually rely on a simple regular expression to find URL-like strings, but this quickly becomes insufficient when you need to handle situations like these:
<em>http://example.com</em>(should be linkified)
<a href="http://example.com">test</a>(already linked, no need to linkify)
<a href="http://example.com">http://example.com</a>(really don’t need to linkify)
<em>http://xx.com <a href="http://example.com">http://example.com</a></em>(regular expression freak-out)
linkify() actually uses html5lib to build a document fragment and walks it, only applying the naïve regular expression in safe locations. In pseudocode:
tree = parseFragment(input)
for node in tree:
if node is a text node:
replace node with text nodes and links
else if node is a link:
set rel="nofollow" on node
This avoids attempting to apply the regular expression to things like tag attributes, the inside of
<a> tags, and other places it should generally be avoided. It also lets us do things like set the
rel attribute on links already in the text and pass the
href attribute through the same filter it would go through if we created the link. This filter lets us redirect links through an outbound redirect, so people know they’re leaving a Mozilla site. You could do other things with it, like rickroll your visitors. That’s up to you.
html5lib and construct document trees, using either will fix up code mistakes, like unclosed takes, and escape bare entities.
linkify() allows basically every tag and attribute, so if you need to limit the legal HTML to a subset, use
clean() (or the shortcut
bleach() to clean then linkify).
Bleach is available on Github, or can be installed via
easy_install. Improvements and test cases are very welcome! Actually, there’s one disabled test right now that is not supported. If you can make it work, that would be pretty great!