<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Coffee on the Keyboard &#187; user</title>
	<atom:link href="http://coffeeonthekeyboard.com/tag/user/feed/" rel="self" type="application/rss+xml" />
	<link>http://coffeeonthekeyboard.com</link>
	<description>by James Socol</description>
	<lastBuildDate>Fri, 20 Apr 2012 22:17:14 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/>		<item>
		<title>Bleach, HTML sanitizer and auto-linker</title>
		<link>http://coffeeonthekeyboard.com/bleach-html-sanitizer-and-auto-linker-for-django-344/</link>
		<comments>http://coffeeonthekeyboard.com/bleach-html-sanitizer-and-auto-linker-for-django-344/#comments</comments>
		<pubDate>Thu, 25 Feb 2010 19:22:00 +0000</pubDate>
		<dc:creator>James</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[mozilla]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sanitize]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[user]]></category>

		<guid isPermaLink="false">http://coffeeonthekeyboard.com/?p=344</guid>
		<description><![CDATA[Bleach is a whitelist-based HTML sanitizer and auto-linker in Python, built on html5lib, for AMO and SUMO and released under the BSD license. Bleach has two main functions: sanitizing HTML based on a whitelist of tags and attributes, and turning URLs into links. It uses html5lib for both. For more information on using Bleach, see [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://github.com/jsocol/bleach">Bleach</a> is a whitelist-based HTML sanitizer and auto-linker in Python, built on <a href="http://code.google.com/p/html5lib/">html5lib</a>, for <a href="https://addons.mozilla.org/">AMO</a> and <a href="http://support.mozilla.com/">SUMO</a> and released under the BSD license.</p>
<p>Bleach has two main functions: sanitizing HTML based on a whitelist of tags and attributes, and turning URLs into links. It uses html5lib for both.</p>
<p>For more information on using Bleach, see the <a href="http://github.com/jsocol/bleach/blob/master/README.rst">README</a> included in the source. For more info on how Bleach works, follow below the jump.<span id="more-344"></span></p>
<h3>Sanitizing HTML</h3>
<p>Bleach&#8217;s <code>clean()</code> function uses a slightly custom version of html5lib&#8217;s <code>HTMLSanitizer</code> tokenizer that adds support for per-tag attribute whitelists. Any entity that is not part of a whitelisted tag or valid entity will be encoded. Legitimate entities and tags are allowed. The default whitelist is set up for AMO.</p>
<h3>Linkifying Text</h3>
<p>The <code>linkify()</code> function is a little more complicated. Naïve implementations usually rely on a simple regular expression to find URL-like strings, but this quickly becomes insufficient when you need to handle situations like these:</p>
<ul>
<li><code>&lt;em&gt;http://example.com&lt;/em&gt;</code> (should be linkified)</li>
<li><code>&lt;a href="http://example.com"&gt;test&lt;/a&gt;</code> (already linked, no need to linkify)</li>
<li><code>&lt;a href="http://example.com"&gt;http://example.com&lt;/a&gt;</code> (really don&#8217;t need to linkify)</li>
<li><code>&lt;em&gt;http://xx.com &lt;a href="http://example.com"&gt;http://example.com&lt;/a&gt;&lt;/em&gt;</code> (regular expression freak-out)</li>
</ul>
<p>So <code>linkify()</code> actually uses html5lib to build a document fragment and walks it, only applying the naïve regular expression in safe locations. In pseudocode:</p>
<div class="dean_ch" style="white-space: wrap;">
<ol>
<li class="li1">
<div class="de1">tree = parseFragment<span class="br0">&#40;</span><span class="kw2">input</span><span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">linkify_nodes <span class="br0">&#40;</span>tree<span class="br0">&#41;</span>:</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw1">for</span> node <span class="kw1">in</span> tree:</div>
</li>
<li class="li2">
<div class="de2">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> node <span class="kw1">is</span> a text node:</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; replace node with text nodes <span class="kw1">and</span> links</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">else</span> <span class="kw1">if</span> node <span class="kw1">is</span> a link:</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> nofollow:</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">set</span> rel=<span class="st0">&quot;nofollow&quot;</span> on node</div>
</li>
<li class="li2">
<div class="de2">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">else</span>:</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; linkify_nodes<span class="br0">&#40;</span>node.<span class="me1">childNodes</span><span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="kw1">return</span> <span class="kw3">string</span><span class="br0">&#40;</span>linkify_nodes<span class="br0">&#40;</span>tree<span class="br0">&#41;</span><span class="br0">&#41;</span></div>
</li>
</ol>
</div>
<p>This avoids attempting to apply the regular expression to things like tag attributes, the inside of <code>&lt;a&gt;</code> tags, and other places it should generally be avoided. It also lets us do things like set the <code>rel</code> attribute on links already in the text and pass the <code>href</code> attribute through the same filter it would go through if we created the link. This filter lets us redirect links through an outbound redirect, so people know they&#8217;re leaving a Mozilla site. You could do other things with it, like rickroll your visitors. That&#8217;s up to you.</p>
<h3>Bad HTML</h3>
<p>Because both <code>clean()</code> and <code>linkify()</code> use <code>html5lib</code> and construct document trees, using either will fix up code mistakes, like unclosed takes, and escape bare entities. <code>linkify()</code> allows basically every tag and attribute, so if you need to limit the legal HTML to a subset, use <code>clean()</code> (or the shortcut <code>bleach()</code> to clean then linkify).</p>
<h3>Getting Bleach</h3>
<p>Bleach is <a href="http://github.com/jsocol/bleach">available on Github</a>, or can be installed via <code>pip</code> or <code>easy_install</code>. Improvements and test cases are very welcome! Actually, there&#8217;s one disabled test right now that is not supported. If you can make it work, that would be pretty great!</p>
]]></content:encoded>
			<wfw:commentRss>http://coffeeonthekeyboard.com/bleach-html-sanitizer-and-auto-linker-for-django-344/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

