Damian Cugley’s Weblog

Feel Mark Pilgrim’s distress at the excision of cite from XHTML 2.0’s Text module. The irony is that cite is one of the ‘semantic’ tags (‘logical’ tags, as they used to be called) that is actually used and supported by web browsers. Meanwhile fossils like dfn, kbd and samp are retained.

In proper English-language typography, italics are used for various purposes:

  • Citing things like movies and books (The Fellowship the Rings, 2001: A Space Odyssey), but not, for example, short stories (‘The Sentinel’);
  • Names of ships (HMS Beagle, HMS Endeavour), but not, for example, pubs (The Beagle and Hounds);
  • Foreign tags (ad nauseam, bête noire), but not when they have become English words in their own right (café, ångström);
  • Words and letters mentioned rather than used (‘the word complex is often confused with complicated’, ‘mind your ps and qs’);
  • Terms being introduced for the first time* (‘we use equivalence relation to mean a relation that is symmetric, transitive, and associative...’);
  • Words and letters used as identifiers in mathematical work† (x, y, α, β), with special exceptions for some standard functions like sin and cos;
  • Ditto for writings about computing by authors who think of computing as being related to maths (gcd(ab), shortest_path, CLUNK); and
  • To indicate emphasis.

The implication of the XHTML 2 draft is that all of the above actual, real-people uses of mark-up only deserve a single tag, em. If we want to have a single element whose semantic, logical, Platonic ur-meaning is ‘text that is printed in italics’, why not just use i and save us some typing?

Meanwhile there are separate XHTML tags for several esoteric usages that exist only in computer literature, and in fact only in computer manuals: code, samp, kbd, and (in some interpretations) var. Now, even computer-literate types are inconsistent in their use of a special typeface to distinguish ‘computer text’ from other text. For identifiers in programs one might argue that italics works nicely and is easier to read (Bjarne Stroustrup uses italics rather than typewriter in the third edition of The C++ Programming Language). Few, if any, find the time to distinguish between typing foo on the keyboard, the character sequence foo it generates, the program fragment foo and the variable foo it is parsed as. And if you are in that position, you probably ought to be using DocBook instead...

I’d wager that even back in the dawn of the WWW, non-computer-related text dominated the Web, starting with those particle-physics databases and the IMDB. The HTML features designed to support computer manuals are a fossil, left over from when the HTML vocabulary was lifted from GNU Texinfo (or something closely related thereto).

Idle speculation

There are precedents for formatting that which is normally italicized differently. The Texinfo conventions for cite, em, and var are _cite_, *em*, and VAR. Donald E. Knuth in the TeXBook distinguishes citation from emphasis, using oblique type for the former and true italic for the latter (the mad fool).

For what it is worth, if I were king of XHTML for a day I would retain cite. Would it also be appropriate to extend it to other names of things like ship names? Many of the above uses of italics are really a form of quotation; they could use q, which after all has never been very successful at supplying quotation marks:

Strunk abhored the phrase <q>student body</q> and suggested <q>studentry</q> instead.

producing something like

Strunk abhored the phrase student body and suggested studentry instead.

In print this would be set with italics, but you can see how it could just as easily have used quotation marks. (In principle the use of typewriter text for computery stuff is also a form of quotation and could arguably be a variation on q!)

Actual reported speech and would use actual marks of quotation, which in British tradition are &lsquo; and &rsquo; (‘…’), and in American &ldquo; and &rdquo; (“…”).

This leaves dfn for definitions, var for variables names and similar (metasyntactic variables, formal parameter names, and mathematical symbols). Oh, and em to indicate emphasis only. Hmm. This almost makes sense.

Conclusion

I think I am sticking with XHTML 1.0 for now. To be honest I am still chary of this new-fangled application/xhtml+xml media-type (I still haven’t found out why they want application rather than text). I think that even if XHTML 2 is not intended to be backward-compatible with XHTML 1, it nevertheless should be rich enough that documents may be converted between formats without loss of information. Folding cite in to em on the face of it violates that principle.

Footnotes

* The element dfn was lifted from Texinfo to cover this case, but was not supported by browsers (it was not shown italicized), so no-one uses it.

† The element var was originally introduced to cover the mathematical and metasyntactic use (being lifted straight out of the Texinfo conventions), but Microsoft Internet Explorer’s designers got it wrong and used the monospace font for var, in effect changing its meaning. The XHTML 2 description tends towards the latter interpretation.

Update (1 November 2006). Linked to from Cafe con Leche XML News and Resources. Corrected the spelling of bête. Since I wrote this, the Firefox programmers have elaborated their implementation of the q element to generate language-sensitive quotation marks. Joe Clark has more on we computer scientists’ promotion of computer-specific HTML elements like kbd over additions that might be useful in non-technical publishing in an article ‘How not to fix HTML’.

13 January 2003

Article Archive by Year