Challenges getting MoinMoin to output XHTML
This page is my personal ramblings as I try to figure this topic out. If this becomes a serious topic in which others are interested, lets then create a separate page outside of my home space.
See Discussion at the end of the page.
Contents
As of 1.5.0 there are lots of things inside MoinMoin that prevent it from outputing strict XHTML, or for that matter even just well-formed XML (i.e., the tags are properly nested and closed off).
The patch MoinMoinPatch/FormatterApiConsistencyForHtmlAttributes goes a long way to allowing the basic output formatters (text_html.py in particular) to output well-formed XHTML.
Simple syntax stuff
Self-closing tags: Also still need to change other HTML fragments that's sitting around in various python files to conform. Such as changing <link> to <link />, <br> to <br /> or <input> to <input /> and so on. This change is quite easy, but tedious. The list of commonly self-closing tags to watch for include:
<area/>, <base/>, <basefont/>, <br/>, <col/>, <frame/>, <hr/>, <img/>, <input/>, <link/>, <meta/>, <param/>
Also watch out for <p> without a closing </p>.
The <script> tag should always have a closing </script>, even if it has no contents (such as if it only links to an external file via the src attribute), since IE will ignore it otherwise.
Tables: All <table> elements being output should also include <tbody> elements. This is not only correct XHTML, but also can have important influences on CSS (some browsers render CSS incorrectly if there is no tbody element). As Moin rarely uses other table components like thead, tfoot, or caption; the simple solution is to just always output tbody with the table element as one unit:
{{{<table><tbody>
<tr><td> .... </td></tr>
</tbody></table> }}}
Other simple things: Must change the DOCTYPE, use the xmlns attribute on the <html> element, etc.
{{{<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> ... </html> }}}
Content type
True XHTML should be served with a content type (MIME type) of "application/xhtml+xml", and not "text/html". All modern browsers except IE support that. Fortunately you can in almost all cases use the same output for either MIME type---as long as it's valid XHTML, serving it as just text/plain will almost always work too. The exceptions are:
- CDATA sections don't work in text/html
Character entities (such as ) must be changed to numeric character references (such as  ) when going to application/xhtml+xml (except of course the three standard entities &, <, and > which XML defines as built-in...note that " and ' are not XML standard and must be converted).
Are you sure? and other entities are correct XHTML AFAIK, I don't know if it changes when you mix XHTML with other XML markup (StefanoZacchiroli)
Yes. I regularly write/use web apps that output XHTML 1.0 strict with application/xhtml+xml (but not MoinMoin yet), and I can tell you that all browsers that support it will barf on an with an XML parsing error. I don't know if there's anything to fix in MoinMoin, but I mentioned it to be complete. (Actually you can provide an internal DTD entity ala <!DOCTYPE> to define those HTML-ish entities explicitly, but that's worse than just using numeric references) -- DeronMeranda 2006-06-21 15:19:29
I (quickly) checked the specification of XHTML and XML and it seems indeed you're right. Nice to know that the W3C checked available at http://validator.w3.org/check/referer is not reliable It indeed easily claim that the document I've made available at http://mowgli.cs.unibo.it/~zacchiro/a.html is valid XHTML while it is not (since it contains the infamous entity). Still, firefox have no problem in displaying it. I will add to the TODO list of my patch the change of entities in   then. Thanks! (StefanoZacchiroli)
Determining which content type to serve is quite easy. Just look for "application/xhtml+xml" in the HTTP request's Accept header. If it's not present you must fall back to text/plain.
This is the last thing to do. You should not serve as application/xhtml+xml until all the other problems are fixed and the html output is strictly conforming!
Line number anchors
The wiki parsers put invisible anchors into the output at any point that the wiki source line number it's on increases. For the html formatter, this results in a <span> element (with no visible content). However this span can be output at any place, even where a span should not occur (such as inside a <table> but outside a <td>; or inside a <ul> but not also inside a <li>.
The formatter probably needs to delay writing anchors until the next legal place for a span element.
Uniqueness of ids: When the [[Include()]] macro is used to include the source of one wiki page inside another, the line number ids used can be duplicated! This is not allowed in XHTML, where all id attribute values must be unique per document.
{{{<span class="anchor" id="line-1"></span>This is in the parent document. <span class="anchor" id="line-1"></span>This is in the included document. }}}
Caching and dynamic content
Right now the weirdest obstacle seems to be the way the page caching system works. This is done by formatter/text_python.py with it's Formatter class. It acts as a front-end to the actual formatter being used, such as formatter/text_html.py. It has a nasty problem where it can re-order the calls to formatter methods. See DeronMeranda/DiscussPythonFormatter.
The current way it works is that it may delay calling the some formatter methods, based upon whether it thinks the output is static or dynamic. When creating a cache for the page it actually generates python code. It will go ahead can call the real formatter for all static content and just put the generated html fragments in the cache. But for dynamic content it will instead insert python code that will call the formatter methods when the cached page is retrieved (and not when the cache is created). This is how for instance some macros like [[FullSearch]] will always output the latest results, even though the wiki source for the page on which the macro belongs has not changed.
However this mix of dynamic and static content means that the real formatter gets it's methods called all out of order (time-wise), even though the final output will effectively have everything pasted back together in the correct document order. But this also means that the formatter class can not hope to keep track of any state during the formatting. It can not do things like keep track of an accurate stack of nested HTML tags.
As a real case, the formatter will currently get called on to output an </h1> before it gets called to output the corresponding <h1> tag!
One posible compromise which may partially help would be to insure that openers and closers are always called in the correct order (e.g., if <h1> is delayed to page-view time, then </h1> should be as well).
Javascript
There are two basic issues with Javascript code that is output:
Use of document.write and such. XHTML can only be actively modified via standard DOM methods.
Simple escaping. Common characters in Javascript like && are reserved XML characters. They need to be escaped.
For the escaping case, it depends on if the document is served as text/html or application/xhtml+xml, as well as if the Javascript is inside <script> elements or directly inline such as in an onchange event handler attribute.
- text/html case
inside <script>. Typical comment-wrapper hacks work.
inline in attribute. The code should be HTML-escaped just like any thing else. Must call wikiutil.escape().
- application/xhtml+xml
inside <script>. Should wrap it all in a CDATA section. Do not use comment-wrapper hacks, as in XHTML this will be a real comment and the javascript code will not be executed.
- inside an attribute. Must HTML-escape the code, same as in text/html case.
CDATA sections would look like, {{{<script language="javascript"><![CDATA[
- ...
]]></script> }}}
Obviously putting most javascript in external *.js files is probably a good thing. Also any complex code inside attribute values should probably be converted into function calls, where the code is defined inside a <script> instead.
CSS stylesheet namespace
Futhermore, for inline CSS stylesheets using <style> elements in the <head> section, it can be quite useful (especially for mixed XML, such as with MathML or SVG) if the default namespace is set via the @namespace at-rule (with the same value as the <html> xmlns attribute has):
{{{<style type="text/css"><![CDATA[
@namespace url(http://www.w3.org/1999/xhtml);
- ...
]]></style> }}}
This is not terribly important, but is probably easy to do.
Discussion
I added a feature request on this subject, see FeatureRequests/ValidXHTMLOutput -- StefanoZacchiroli