Project: Tree based output formatter

This project is part of GoogleSoc2008.

This project adds a new tree based interface between the different parts of the output rendering. This tree can be modified in different ways during the rendering. It gets a distinct mime-type to fit in the following conversions. All the conversions done by this project operates on mime-types as identifier. The type of the input can be anything, e.g. a page with mime-type text/x-moin1.7, an image with image/jpeg or raw data with application/octet-stream. Each converter may supports several input/output mime-types.

The rendering of a wiki page is done in several steps:

Converter text/x-moin1.7 -> application/x-moin-document.
- If there are other types like python source included, it is converted on its own with the appropriate converter and embedded into the tree.
Macro handling mangles the tree.
- The "Include" macro dumps the tree of another document into the current tree.
- The "TOC" macro creates the table of the complete document and embeds it in the tree.
Converter application/x-moin-document -> application/x-xhtml-moin-page

The same can be used for image/* (and also application/octet-stream):

Converter image/* -> application/x-moin-document
- The generated tree includes an image element which embeds the image into the page but may render further informations like EXIF data.
Macro handling (may not do anything useful with this tree, but it is generic)
Converter application/x-moin-document -> application/x-xhtml-moin-page

Types

text/x-moin1.7

Wiki source as used in MoinMoin 1.7.

application/x-moin-document

New intermediate tree format.

As any spec, it should reuse appropriate standards. What comes in mind is DC (Dublin Core) for author and similar informations (See [DCMI]). The text part may be a proper subset of OpenDocument (See [ODF]) or DocBook. It will also allow HTML in its own (real) namespace (See [XHTML1.1]).

Includes are done with XInclude and XPointer (See [XInclude] and [XPointer]). It may need a special XPointer function to support anything the current Include macro supports.

application/x-xhtml-moin-page

XHTML subset, no html, head (+ contents) and body. As difference to application/xhtml+xml it specifies div as the root element and can be embedded into the theme to generate the real output. This type is only used internal.

Macro handling

The internal tree will use a little bit different macro definition than the Wiki input. Some macros like BR, Include and TOC will be promoted to pseudo-macros and interpreted (_not_ expanded) by the wiki parser.

Macros need to know the context (block vs. inline) they are used in.

BR: It needs to be presented in the tree anyway because it is highly output dependant. HTML implements it as a br-element, ODF as text:line-break.
Include: It needs to be handled special because normal macro results should not be again macro expanded. May use XInclude (see [XInclude]).
TOC: This is not yet decided, but many output formats support automatic toc generation.

Plugin compatibility

The modifications affect three types of MoinMoin plugins, parser, macro and formatter. parser and macro plugins which only use the public formatter API should work using a special implementation of this API which produces a tree instead of complete output; plugins which directly generate output or even use request.write will not work. Compatibility support for formatter plugins will be not provided.

Macros

AbandonedPages: See RecentChanges
Action: unused?
AdvancedSearch: Raw HTML
Anchor: Formatter only, unknown
AttachInfo: unknown
AttachList: unknown
BR: Move to parser
Data, DateTime: Formatter only
EditedSystemPages: Formatter only
EditTemplates: unknown
EmbedObject: Raw HTML
FootNote: Move to parser
FullSearch: Raw HTML
FullSearchCached: See FullSearch
GetText: Formatter only
GetText2: Formatter only
GetVal: Formatter only
GoTo: Raw HTML
Hits: Raw text
Icon: Formatter only
Include: Move to parser
InterWiki: Formatter only
LikePages: Formatter only, recheck
MailTo: Formatter only
MonthCalender: Raw HTML
Navigator: Raw HTML
NewPage: Raw HTML
OrphanedPages: Formatter only
PageCount: Formatter only
PageHits: Formatter only
PageList: Raw HTML
PageSize: Formatter only
RandomPage: Formatter only
RandomQuote: unknown
RecentChanges: Raw HTML
ShowSmileys: widget.browser.DataBrowserWidget
StatsChart: unkown
SystemAdmin: Formatter, raw HTML
SystemInfo: Raw HTML
TableOfContents: Move to parser
TemplateList: Formatter only
TitleIndex: unknown
TeudView: Raw HTML
TitleSearch: unknown
Verbatim: Formatter only
WantedPages: Formatter, raw HTML
WordIndex: unknown

Parser

text/*: Formatter only
text/cplusplus: ParserBase based.
text/creole: Own tree, formatter
text/csv: widget.browser.DataBrowserWidget
text/diff: ParserBase based.
text/docbook: Reuses wiki parser.
text/html: Raw HTML
text/irssi: Formatter
text/java: ParserBase based
text/moin-wiki: Formatter only
text/pascal: ParserBase based
text/python: To be removed (compiles into python code)
text/rst: unknown
text/xslt: unknown, raw HTML

In-memory tree format

There are two approaches to implement tree structures. One low-level like DOM, which only defines elementar types like text, comment and node. The other one is a high-level tree which includes nodes for paragraphs, links and so on. My intention was to use a low-level tree because I want extensibility. In the discussion I wrote the following:

There is no usual way. DOM uses a low-level set of items (node, attribute, text and some more) which can represent the whole set of inputs, see [DOM]. Encoding the node types into the classes will work if you know all possible inputs or you'll get again a catch-all node.

Let's make an example: I want to include MathML, see [MathML]. MathML is an XML application. There are two ways to do that:

Use it literally as text. (This IMHO contradicts the reason for this whole project.)
Parse it and make it part of the tree.
- If using special nodes, you need to either create x node types or have a catch all node which includes the name.
- If using low-level nodes nothing special needs to be done.
[DOM] - http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/
[MathML] - http://www.w3.org/TR/2003/REC-MathML2-20031021/

Also I think the tree should have a stable "dump" format. XML would be a standardized option. Also someone mentioned that it may be easier to compare the dumps in unit tests instead of direct inspection of the tree.

Or use xpath / other xml tools for the tests.

There are several XML and tree implementations, xml, xml.etree, xml.minidom and lxml.

xml: AFAIK unmaintained, libxml as dependency.
xml.etree (ElementTree): Actively maintained, one large API problem: no text nodes.
xml.minidom: Old, was never really usable.
lxml: libxml as dependency.

Because of this, I think the best solution is a ElementTree fork which fixes the API problem. I don't really like to fork software but anything else would introduce compiled extensions.

Cacheability

The initial tree only depends on the input page. It should be cached directly after the edit. It is also possible to already expand all "stable" (non-volatile) macros at this time. The tree can be converted to HTML in this half-expanded state and cached.

$/!\$ How can we convert to html without fully expanding? E.g. if there is some include and toc macro, this could be a problem IMHO.

The converter to html may be applied several times to the tree and will only touch things it knows but will leave the already existing html intact.

Project stages

Plan for GSOC 2008, Plan for extending this project

Further possible projects

Section editing: Embed the page source section wise into the tree. This makes it possible to replace one section and dump the source after that.
Conversion between different wiki markups.

Refs

[DCMI] - Dublin Core Metadata Element Set, Version 1.1: Reference Description, http://www.dublincore.org/documents/dces/, Dublin Core Metadata Initiative, 2003
[ODF] - Open Document Format for Office Applications, Version 1.1, http://docs.oasis-open.org/office/v1.1/, OASIS, 2007
[XHTML1.1] - XHTML 1.1 - Module-based XHTML - Second Edition, http://www.w3.org/TR/xhtml11, W3C, 2007
[XInclude] - XML Inclusions (XInclude) Version 1.0 (Second Edition), http://www.w3.org/TR/xinclude/, 2006

MoinMoin: BastianBlank/TreeOutputFormatter (last edited 2009-08-28 18:11:24 by BastianBlank)