Premises of WordPress content handling

As we approach the release of WordPress 5.0, which will feature the introduction of the new Gutenberg editor, it is worth taking a look at WordPress’s current model in handling user content.

A page request

Consider the following simplification of the data pipeline for a front-end page request:

wp-post-pipeline-front-end

In it, post_content is read from the database, then filtered through any registered filter for the_content, but notably the following default filters:

  • do_shortcode, which replaces occurrences of recognized shortcodes with the output of their processing;
  • wpautop, which creates proper <p> paragraphs out of double line breaks;
  • wptexturize, which performs a number of simple typographic improvements, e.g. turning a “dumb” double quote " into or .

The filtered output is then inserted into the rest of the computed HTML for a given resource. For instance, a page request for http://example.org/2018/01/01/my-post will typically require constructing a header and a footer, and inserting the filtered post content into the <div id="page"> scaffold.

Finally, this generated HTML document reaches the browser, where it is used to render a full page for the delight of the visitor’s eyes.

Editing a post

Consider now the simplified lifecycle of editing a post:

wp-post-pipeline-editing

In it, post_content is read from the database, then inserted into the “frame” for the editor: the WP-Admin layout and the editing area. From here, TinyMCE takes over, spawning a rich-text editor from the raw content; any subsequent rich-text edits translate to changes in the raw content. Finally, either at the user’s request or as part of the auto-saving feature, the updated raw content is sent back to the server, which is more or less directly saved to the database (there are functions called for a few save-related hooks—save_post, post_updated, etc.—but most deal with record keeping, e.g. cache flushing).

Observations

(1) The first point of the above lengthy and potentially obvious description is that WordPress doesn’t natively or semantically understand its contents. For all intents and purposes, we are only considering post_content and disregarding metadata, viz. sidebar fields, metaboxes. Indeed, the aforementioned default filters operate at a very low level (character and whitespace recognition) or independently of content format (shortcodes).

(2) On the other hand, (2.1) browsers do understand the content, to a degree: it’s HTML by default. (2.2) It’s that understanding that powers the editor: the stack supporting TinyMCE consists of the browser VM and DOM combined.

(3) Shortcodes operate orthogonally: they are not HTML—and in fact they can get in the way of HTML whenever these two subsystems intersect—and are subject to their own parsing gaps. Put more cynically: they can handle things such as nesting, but that doesn’t mean they should.

(4) Observation #1 is a good thing in that reading and editing contexts aren’t tied to WordPress, but #3 then breaks that separation. Observation #2 is also good in that there is a common language for content (HTML) that editing contexts understand. Can we build from that, i.e. expand our document language while retaining HTML-centrism and absorbing shortcode expressiveness?

(5) (5.1) The issue of HTML is that it has limited semantics. Compare a sample of tags:

  • conveying some meaning: article, section, figure
  • conveying no meaning: div, span
  • can’t express specialization: an img representing a film poster vs. an img representing an actor’s portrait
  • can’t express aggregation: what is a series of img?

(5.2) What if we augment our content with meta-HTML? There is a precedent in WordPress in the form of the <!--more--> and <!--nextpage--> tags. We could end up with annotations like:

<!-- gallery -->
  <img src="…">
  <img src="…">
<!-- /gallery -->

Next

The takeaway from this drastically simplified read should be the duality between WordPress as a format-agnostic (vulgo “dumb”) content management system, and WordPress’s adequate leveraging of the environment (the browser) for parsing and serializing content. It provides a complementary perspective for the next article, The Language of Gutenberg.

Thanks: Riad Benguella for review

Image: WordPress texturization

Author: Miguel Fonseca

Engineer at Automattic. Linguist, cyclist, Lindy Hopper, tree climber, and headbanger.

3 thoughts on “Premises of WordPress content handling”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: