{i} Current State : Near the end

Questions

How it works ?

We simply iterate over the element tree of the different HTML tags. For each element, we run the visit method to check the namespace of the tag, and if it is an XHTML tag, we will run appropriate behavior as defined below.

We have different kind of HTML Tags : symmetric tags, which have same tag name in the DOM Tree. Simple tags, which does not need attributes, and can be converted just by changing the tag name, and the rest of tags, which require more complicate operation.

All the symmetric tags are in a set, so we will check the presence of an element in this set, if so, we will just copy the element with the moin_page namespace in the resulting tree.

For the simple tags, we have a dictionary which associate to the tag name the equivalent moin_page element. Se we can directly return this one if we encounter such of tag.

And for all the other tags, there is method like visit_xhtml_name_of_the_tag which will handle the conversion.

What about MetaData ?

From a MoinMoin point of view, we cannot handle Metadata, indeed author, title and so are defined for each item by the author. So we just ignore such of informations.

What about images and object ?

The converter won't check the attachment. It will just convert the (x)HTML tag into the appropriate tag in the DOM Tree. You have to ensure that you are using absolute path for the resources, or at least use the <base> tag.

Like the images and objects, there is not link checker, and we cannot retrieve the base path, so be sure to have absolute URL.

What about styles ?

The converter does not support style currently. But we are going to implement a basic support soon.

Where can I check the different equivalences between the DOM Tree and HTML ?

See, DOM DocBook and HTML 2010/HTML-DOM Equivalences for more information about the equivalence between HTML element, and the DOM Tree.

You can add any question about the converter here

Open issues

CSS

I need to find to able to parse CSS elements from style attributes, to handle more correctly some situation. In this attributes, the CSS syntax is like following:

key1 : value1;
key2 : value2;

We can probably write simple parser, to convert this kind of simple CSS into python dictionary which can be easily manipulate after. The problem with this approach will be that it needs long operation to extract only one attribute.

Another approach can be to find substring (the attribute name we are looking for) from the CSS string and then extract the value.

At this time I do not think we should support external CSS file, or even CSS within the <head> tags. Only the basic and simple CSS you can find in the style attribute.

Class Attribute

As we can see in the HTML_Out converter, some tags of the DOM Tree are converted into HTML using class attribute. For example, the <error> tag is converted as <span class="error">. The question is : should we convert back such of things ? It means that we should guess the meaning of class attribute. Indeed, some people can class="error" with different meanings. Maybe we could use some prefix to be sure that the class attribute comes from MoinMoin ?

Status of the current work

You can find here an array which indicate the level of advancement of the converter. Check the equivalence for more information.

Page Meta information

HTML Tag Name Equivalence

Test

Conversion

Comments

<base href="uri:test">

Done

Done

Need to keep the IRI to retrieve full path from relative path.

Page Structure

HTML Tag Name

Equivalence

Test

Conversion

Comments

<html>

<html>

Done

Done

Without this tag, the converter will throw an exception for invalid HTML.

<body>

<body>

WIP

Done

We need to verify at the end of the conversion the presence of this tag. If there is not, we add body tag.

<div>

<div>

Done

Done

<hr>

<separator>

Done

Done

Basic Text Structure

HTML Tag Name

Equivalence

Test

Conversion

Comments

<p>

<p>

Done

Done

SYMMETRIC

<hX>

<h outline-level="X">

Done

Done

Maybe add some tests?

Paragraph Elements Contents

HTML Tag Name

Equivalence

Test

Conversion

Comments

<em>

<emphasis>

Done

Done

<strong>

<strong>

Done

Done

SYMMETRIC

<b>

<strong>

Done

Done

<pre>

<blockcode>

Done

Done

/!\ Not sure it is the best choice, but this would allow symmetric HTML-DOM converter

<tt>

<code>

Done

Done

<sub>

<span base-line-shift="sub">

Done

Done

<super>

<span base-line-shift="super">

Done

Done

<u>

<span text-decoration="underline">

Done

Done

<ins>

<span text-decoration="underline">

Done

Done

<a href="uri:test">

<a xlink:href="uri:test">

Done

Done

<br />

<line-break>

Done

Done

<code>

<code>

Done

Done

SYMMETRIC

<i>

<emphasis>

Done

Done

<samp>

<code>

Done

Done

<big>

<span font-size=120%>

Done

Done

<small>

<span font-size=85%>

Done

Done

<del>

<span text-decoration="line-through">

Done

Done

<s>

<span text-decoration="line-through">

Done

Done

<strike>

<span text-decoration="line-through">

Done

Done

List of tags converted with <span html-element="tag.name">

Object Contents

HTML Tag Name

Equivalence

Test

Conversion

Comments

<img src="uri:test">

<object xlink:href="uri:test">

Done

Done

/!\ URI conversion ?

<object data="uri:test">

<object xlink:href="uri:test">

WIP

WIP

/!\ URI conversion

List

HTML Tag Name

Equivalence

Test

Conversion

Comments

<ul>

<list item-label-generate="unordered">

Done

Done

<ol>

<list item-label=generate="ordered">

Done

Done

<li>

<list-item-body>

Done

Done

<dl>

<list>

Done

Done

<dt>

<list-item-label>

Done

Done

<dd>

<list-item-body>

Done

Done

<dir>

<list item-label-generate="unordered">

Done

Done

Table

<table>

<table>

Done

Done

SYMMETRIC

<theader>

<table-header>

Done

WIP

Add support for attributes according to the equivalences.

<tfoot>

<table-footer>

Done

WIP

Add support for attributes according to the equivalences.

<tbody>

<table-body>

Done

WIP

Add support for attributes according to the equivalences.

<tr>

<table-row>

Done

WIP

Add support for attributes according to the equivalences.

<td>

<table-cell>

Done

WIP

Add support for following attributes : align bgcolor colspan rowspan valign

<th>

<table-cell>

Done

WIP

Need to add specific attributes

<col />

Nothing

Nothing

Nothing

<colgroup >

Nothing

NothingRL

Nothing

Ignored Tags

Here is the list of the ignored tags, all the child of these tags are ignored :

Use the converter in MoinMoin

You can now use the converter in MoinMoin, by using item with the mimetype = text/x.moin.html

Tests

MoinMoin: DOM DocBook and HTML 2010/HTML-DOM (last edited 2010-07-07 17:30:22 by ValentinJaniaut)