September 21st, 2009

A better strip_tags: pkHtml::simplify

Many web sites allow users to edit content via a rich text editor. And we all know what happens if the user pastes a Word document in there: the styles of the page wind up hopelessly munged. PHP coders can use strip_tags() to limit the HTML tags that get pasted in, but that doesn't clean up the CSS style attributes that can be pasted in. So your page is still a mess. Often it is so broken that the user can't figure out how to edit again. This is the point where they pick up the phone and plead with you for help.

strip_tags also won't help if the HTML is invalid. Unclosed tags can create nightmares elsewhere in the page when users are allowed to submit snippets of HTML that leave a bold tag or a header tag unclosed.

A common workaround is to use the FCK (aka CK) rich text editor's "paste as plaintext" mode, which thwarts attempts to paste rich text from another program. That works, after a fashion, but users are forced to do all of their styling all over again. Bold, italic and bulleted lists must be recreated from scratch.

And none of the workarounds are sufficient if you can't at least trust the intentions of your users. This is not a problem when your client logs in to update his site, but it's a big issue when his customers are creating personal profiles and the like.

HTML Tidy can tackle this sort of filtering for you, but it is often not included in your build of PHP as it is not standard equipment (although some Linux distributions do include it in the box). This is especially important on client sites where you've been asked to live with the setup that is available.

HTML Purifier does much the same thing in pure PHP, but since it parses the document directly in PHP it is a bit slow and heavy. Which makes you long for the raw performance of strip_tags.

Enter pkHtml::simplify():
pkHtml::simplify($richTextHTML,
  "<h3><h4><h5><h6><blockquote><p><a><ul><ol><nl><li><b><i>" .
  "<strong><em><strike><code><hr><br><div>" .
  "<table><thead><caption><tbody><tr><th><td>");
If that looks a lot like the arguments to strip_tags(), you're right. That's because pkHtml::simplify() uses strip_tags() as a starting point. But pkHtml::simplify() follows up strip_tags() with a DOMDocument-based filter that removes attributes too, except for the attributes that actually make sense to permit for certain tags. That is, if you are permitting the tag at all, you need those attributes for it to be useful.

Currently these tags are:

A tag -> href and name attributes img tag -> src attribute

If you're coding for Symfony, you don't have to call pkHtml::simplify() directly. Instead, you can use the sfValidatorHtml validator, also found in pkToolkit, which allows the above list of tags by default because they are well-suited to user-entered content. For the convenience of Symfony developers we also package Dominic Schierlinck's sfWidgetFormRichTextarea widget in pkToolkit. It's meant to be compatible with both MCE and FCK, although we always use FCK.

pkHtml::simplify is short and sweet because it uses the most straightforward tool for each job. For removing disallowed tags, nothing beats the raw speed of strip_tags. But strip_tags doesn't know anything about attributes, so for that purpose we use DOMDocument. Many PHP developers aren't yet familiar with this class, which is standard equipment in PHP 5.

DOMDocument has three great features from our perspective: it can clean up HTML automatically, ensuring that tags are closed. And it lets you manipulate attributes painlessly. And since it is built into the core of PHP, it is much faster than parsing HTML "from scratch" in PHP.

DOMDocument does want to give you back a complete HTML document with nifty things like a document type and html, head and body elements that we don't need when we're manipulating snippets of user-entered HTML, not truly complete pages. So pkHtml::simplify optionally undoes that for you, leaving you with an HTML container element that's ready to snick into place in the middle of your page body.

So: how good is the performance of pkHtml::simplify? Very good indeed. On one project we need to separately clean hundreds of untrusted potential HTML containers in a single XML document, in real time, before presenting portions of that information to the user. pkHtml::simplify has no trouble keeping up in this scenario.

Our Apostrophe CMS takes advantage of pkHtml::simplify() to allow rich text editing without the constant "oops I screwed up my site" issues that come up without a robust server-side filter.

pkToolkitPlugin is available from the Symfony repository and is released under the open-source MIT license. Keep in mind that the pkHtml class is useful to all PHP developers, not just Symfony developers. Just download the tarball and copy it from the lib folder.

I recommend picking up the latest version of the code from svn. You can also use the tarball versions, but you may need to remove a few Symfony logging calls to use it in a non-Symfony context.