html - Coffee on the Keyboard

Bleach, HTML sanitizer and auto-linker

James Socol — Thu, 25 Feb 2010 14:22:00 GMT

Bleach is a whitelist-based HTML sanitizer and auto-linker in Python, built on html5lib, for AMO and SUMO and released under the BSD license.

Bleach has two main functions: sanitizing HTML based on a whitelist of tags and attributes, and turning URLs into links. It uses html5lib for both.

For more information on using Bleach, see the README included in the source. For more info on how Bleach works, follow below the jump.

Sanitizing HTML

Bleach’s clean() function uses a slightly custom version of html5lib’s HTMLSanitizer tokenizer that adds support for per-tag attribute whitelists. Any entity that is not part of a whitelisted tag or valid entity will be encoded. Legitimate entities and tags are allowed. The default whitelist is set up for AMO.

Linkifying Text

The linkify() function is a little more complicated. Naïve implementations usually rely on a simple regular expression to find URL-like strings, but this quickly becomes insufficient when you need to handle situations like these:

http://example.com (should be linkified)
test (already linked, no need to linkify)
http://example.com (really don’t need to linkify)
http://xx.com http://example.com (regular expression freak-out)

So linkify() actually uses html5lib to build a document fragment and walks it, only applying the naïve regular expression in safe locations. In pseudocode:

tree = parseFragment(input)

linkify_nodes (tree):

for node in tree:

if node is a text node:

replace node with text nodes and links

elseif node is a link:

if nofollow:

set rel="nofollow" on node

10.

else:

11.

linkify_nodes(node.childNodes)

12.

13.

returnstring(linkify_nodes(tree))

This avoids attempting to apply the regular expression to things like tag attributes, the inside of `` tags, and other places it should generally be avoided. It also lets us do things like set the `rel` attribute on links already in the text and pass the `href` attribute through the same filter it would go through if we created the link. This filter lets us redirect links through an outbound redirect, so people know they’re leaving a Mozilla site. You could do other things with it, like rickroll your visitors. That’s up to you.

Bad HTML

Because both clean() and linkify() use html5lib and construct document trees, using either will fix up code mistakes, like unclosed takes, and escape bare entities. linkify() allows basically every tag and attribute, so if you need to limit the legal HTML to a subset, use clean() (or the shortcut bleach() to clean then linkify).

Getting Bleach

Bleach is available on Github, or can be installed via pip or easy_install. Improvements and test cases are very welcome! Actually, there’s one disabled test right now that is not supported. If you can make it work, that would be pretty great!

Work Pattern: Designing Web Sites

James Socol — Mon, 26 May 2008 09:40:02 GMT

The premise of Design Patterns is that similar problems have similar solutions. In the same vein, I propose this Work Pattern a set of common steps I use when I create a web site, and maybe you can use, too.

Elements and Outline

My first step is usually to create an un-styled outline of a “typical” page. I fire up my editor, fill in the basic XHTML, and then go to work inside the `` tag.

Most sites have this fairly common structure: header, content, footer. And just for fun, let’s throw in navigation between the header and the content. It’s pretty easy to represent this in XHTML:

This is my first skeleton for >90% of the sites I design. It’s a very standard document. Sometimes navigation will be inside the header, but most often it goes like this.

Now you have to start thinking about what elements will be on the page. On this site, a blog, I used “articles” instead of “content” for the main div. I also added two side bars, and I knew that inside the articles div I’d want, well, articles.

Recent Articles

Article Title

Sidebar heading

Sidebar paragraph

Sidebar heading

Sidebar
list

I won’t bore you with more code examples; I think you get the idea. I make an outline. I know at this point that my source is nice and valid, and that it will make sense when I turn off the stylesheet. I use semantic names for everything.

It’s not very pretty, but I now have a workable XHTML document, with a properly-nested outline, and most of the important elements. Good for me, because now I can start to style them.

Layout and Style

Now, I know what visual elements will need to go on the page. I know what page elements I need to style. Now I’ll start creating a style sheet.

My first style sheet will contain a few basic HTML tags and the elements of my document. I could probably write an XML-to-CSS generator with how strict I am with this step.

Ok, one more code example:

body {} h1, h2, h3, h4, h5, h6 {} a:link {} a:visited {} a:hover {} #header {} #header h1 {} #navigation {} #navigation ul {} #navigation ul li {} #articles {} #articles h2 {} #articles div.article {} #articles div.article h2 {} #theblog {} #theblog h2 {} #theworld {} #theworld h2 {} #theworld ul {} #theworld ul li {} #footer {}

One of my favorite things about this is it’s almost impossible for a mistake in one section to mess up anything else.

But obviously there’s a lot in there I can combine, can shorten. Almost anything that’s true for #theblog will also be true for #theworld in this case, so DRY, and keep things together as much as you can. But, when you’re just starting the style sheet, this is a good place to start.

As I’m going, I add a lot to the style sheet. I also add a lot to the XHTML template. Pixels get tweaked left and right and I swear at IE6, of course.

Building Templates

Once I have a complete, or near-complete, mock up, it’s time to start building templates for your CMS of choice. This is mostly copy-and-paste work at this point. Your #header and #navigation go into the header template. #footer goes into footer. #content goes in the content template.

See how easy that is?

Then you get to go through and actually add the template mark up. Whether it’s Smarty or PHP or ASP doesn’t really matter, you just replace your dummy text with the right tags.

Starting Out

I love this process, but there is one thing you really need for it to go smoothly:

You need to know what kind of content you’ll have. When you’re redesigning your blog, or building an in-house site, it’s pretty easy to know. When you’re working for a client, you may need to twist some arms to get this information. (I love this A List Apart article for advice on communicating with clients.)

One final thought: use comments. Any time I create a div, I wrap it in comments like this:

I usually use the CSS selector because it’s specific, so #articles, .article, and so on. These comments—which I left out here to save space—have saved me so much time and effort compared to relying on indentation that I can’t imagine working without them.

I didn’t set out this process as a way to streamline my work, but rather, as I started noticing patterns that worked well, I started thinking about the process. Much like Rails, which was already running Basecamp before it was a framework, I’ve been using more-and-more-polished versions of this work flow for months.

Maybe you’ll find it helpful, maybe not. Maybe you already have a “system” in place. If you do, what is it?

The W3C Sucks

James Socol — Thu, 22 May 2008 16:54:53 GMT

“If you wish to be a success in the world, promise everything, deliver nothing.”

If you want to remain the standard-setting body for the web, promise new recommendations, never deliver.

![CSS 2.1 is not even a published recommendation. Off with their (the W3C) heads.](http://coffeeonthekeyboard.com/wp-content/uploads/2008/05/css.png)

A decade ago, the W3C was actively working to improve the standards we designers and developers use every day. Sure there were some controversial things (HTML 3.0, XML 1.1) that never caught on, but at least there was discussion, thought, and sometimes even action.

The W3C started work on the CSS3 specification the same year they published CSS2—1998. Ten years later, CSS2.1 is still not technically a published recommendation.

Between 1995, when the W3C was founded, and 1999, HTML went from version 2, an RFC, to version 4.01. Where is 5? In January of this year it became a Working Draft.

When was XHTML last updated? 2001. The DOM? 2004. MathML? 2003.

What happened?

When did “do nothing group” replace “working group” over there? (Probably around 2004.)

I realize that implementing new standards is not trivial. I also realize that standards are crucial to the continued growth of the web—this site is valid XHTML and uses valid CSS.

However, without updates, these “standards” will get old and die. Something else, or someone else, will replace them. We’ve already used CSS2 for a decade. Will we use it for another? (I want my drop shadows! I want my opacity! I want my rounded corners!)

I lead with a quote from Napoleon, so I’ll finish with the French Revolution: Off with their heads. The W3C needs a change in leadership or a vigorous shakedown to get off their asses and do something.

If they’re not willing to put forth the effort, then let them eat cake while someone else does.

html - Coffee on the Keyboard

Bleach, HTML sanitizer and auto-linker

Sanitizing HTML

Linkifying Text

Bad HTML

Getting Bleach

Work Pattern: Designing Web Sites

Elements and Outline

Page Title

Recent Articles

Article Title

Sidebar heading

Sidebar heading

Layout and Style

Building Templates

Starting Out

The W3C Sucks