Bleach, HTML sanitizer and auto-linker
Bleach is a whitelist-based HTML sanitizer and auto-linker in Python, built on html5lib, for AMO and SUMO and released under the BSD license.
Bleach has two main functions: sanitizing HTML based on a whitelist of tags and attributes, and turning URLs into links. It uses html5lib for both.
For more information on using Bleach, see the README included in the source. For more info on how Bleach works, follow below the jump.
clean() function uses a slightly custom version of html5lib’s
HTMLSanitizer tokenizer that adds support for per-tag attribute whitelists. Any entity that is not part of a whitelisted tag or valid entity will be encoded. Legitimate entities and tags are allowed. The default whitelist is set up for AMO.
linkify() function is a little more complicated. Naïve implementations usually rely on a simple regular expression to find URL-like strings, but this quickly becomes insufficient when you need to handle situations like these:
<em>http://example.com</em>(should be linkified)
<a href="http://example.com">test</a>(already linked, no need to linkify)
<a href="http://example.com">http://example.com</a>(really don’t need to linkify)
<em>http://xx.com <a href="http://example.com">http://example.com</a></em>(regular expression freak-out)
linkify() actually uses html5lib to build a document fragment and walks it, only applying the naïve regular expression in safe locations. In pseudocode:
html5lib and construct document trees, using either will fix up code mistakes, like unclosed takes, and escape bare entities.
linkify() allows basically every tag and attribute, so if you need to limit the legal HTML to a subset, use
clean() (or the shortcut
bleach() to clean then linkify).
Bleach is available on Github, or can be installed via
easy_install. Improvements and test cases are very welcome! Actually, there’s one disabled test right now that is not supported. If you can make it work, that would be pretty great!