Understanding Input Formats

Code on laptop

Yesterday I had to do some major filter debugging, causing me to learn a lot about Drupal's filter system. Here are some basic and less basic observations on Drupal's format and filter system:

Formats vs Filters

The basic relationship between formats and filters is confusing in Drupal, partly because the two words are similar. An input format is made up of one or more filters. For example, the 'Full HTML' format includes several filters by default ('Html corrector', 'Line break converter', and 'URL filter'). Clicking on Admin: Site Configuration: Input Formats takes you to /admin/settings/filters where you can administer formats and their filters.

Input Formats vs. Output Filters

Though formats are sometimes called 'Input Formats', filtering actually takes place upon display of the text rather than input. Input format is an accurate term because text is given a choice of format when added and is saved with that format information. The raw text is saved into Drupal's database and is not filtered until it is displayed.

Filters and Security

Any format which does not contain the HTML filter (for example, Full HTML format) or evaluates PHP (PHP format) is a security risk when untrusted users have access to it. We actually are pushing a patch for D7 which will automatically remind admins of this fact if they set their site up insecurely.

Filter Cache

The filter cache serves cached filtered content to anonymous and authenticated users alike. The more complex your filters are, the greater the performance gain from this cache. (I have been working on a site which does extensive xsl transformations to turn xml in node bodies into html on the page, making filter caching critical.) Cached filtered content gets cleared out on cron Caching happens during the check_markup function. When text with an input format (this could be a node body, comment body, block etc) is called to be displayed on the site, the text and its format is passed to check_markup. Check_markup checks the cache_filter table for an entry for that text and format by doing an md5 hash on the content. If an entry exists, the cached data is returned. If not, the content has to be filtered by the filters in its format. The filtered result is added to the filter cache along with its md5 hash and displayed. The caching is not indexed by node id, so the cached content cannot be traced back to its origin (if you put the same text in two places in your site with the same format, only one cache entry will be created for it.) This means that it is very difficult to make a filter which needs to have knowledge of which node its content comes from. The only exception to filter caching is when you use a format which is not cacheable. A format becomes uncacheable if it contains any uncacheable filter. The PHP filter is an example of an uncacheable filter. Whether or not a filter is cacheable is set by the implementation of hook_filter which defines the filter (there is an $op of 'no cache'). You can set a filter to 'no cache' if you are developing the filter, but keep in mind that you will have to re-save your format at /admin/settings/filters before the format will update to be an uncacheable format.

Ready to get started?

Tell us about your project