ConfluenceConverter/DevelopmentNotes/TransformProcess

A straight dump from Brad's email about this topic:

Depending on the complexity of the data we might want to consider some sort of intermediate structure - ie:

`Confluence DB/Export -> <some other data structure> -> MoinMoin`

The advantage of such an approach is it would be more easily testable and also easier to break up for concurrent development.

The main disadvantage I can see with this idea is that it creates a layer of complexity that may not be worth the extra effort (not to mention extra possibility of bugs).

Confluence Dump XML

In a word - confusing.

A good step towards understanding is in the way internal ids and id-refs are handled:

id's and id-references are not done using attributes as might be expected
an element is assigned an id by giving it a child element like: <id name="id">10</id>
a reference to an element with an id has the form of an element which is a 'referencing-type' - for example the following XML chunk is a reference: <collection name="bodyContents" class="java.util.Collection"><element class="BodyContent" package="com.atlassian.confluence.core"><id name="id">3309603</id></element></collection>
So: an id element is a child is either an id assignment or and id-reference based on the context (ie the type of parent element to the id element).

Looking at the elements in a confluence export that are parents to an id element the rule appears to be:

"property" and "element" elements are reference elements
"object" elements are the only ones with id's assigned

BUT

They have another special implementation which is that id's are not unique - they appear to be unique only within a class of object - so we have conflicts like:

<object class="ReferralLink" package="com.atlassian.confluence.links">
<id name="id">104</id>
...
</object>

<object class="SpacePermission" package="com.atlassian.confluence.security">
<id name="id">104</id>
...
</object>

The references to these pseudo-id's include enough information to derive the type:

<element class="ReferralLink" package="com.atlassian.confluence.links"><id name="id">104</id>
</element>

<element class="SpacePermission" package="com.atlassian.confluence.security"><id name="id">104</id>
</element>

In order to build an id-map then:

Grab all "object" elements and find the "id" elements - map id values to object nodes based on the object class (ie the value of the class attribute on the object)
Grab all "property" and "element" elements with an "id" child element - these are the references and the type of reference is given by the class attribute of the parent element

Content Migration Strategy

It might be useful to consider how the content should be migrated. For example:

Should all the history be migrated or just a snapshot of the content?

Doing the former might be possible with a page package whereas the latter is definitely possible with a page package.

The original request was for the migrated wiki to contain a full history (see the announcement links on ConfluenceConverter) -- BradleyDean 2012-03-30 11:43:13
Right. It looks like it might be possible with page packages to write a package installer script that replays at least the edits with their author details, if not including the timestamp information. -- PaulBoddie 2012-03-30 13:41:04
It should be possible to preserve the timestamp information given recent changes to Moin, but the package installer would need to be patched to take advantage of this. -- PaulBoddie 2013-02-18 15:08:19

Should the history preserve metadata such as the editor and the date and time?

If so, the edit log would probably need to be manipulated to reflect the correct history.

As above, ideally the meta-data will be migrated where possible. -- BradleyDean 2012-03-30 11:43:13
As noted above, if the package installer can do this without us having to manipulate the edit log directly, this might not be so difficult. If the timestamp is required, perhaps we could extend the installer to modify such details. -- PaulBoddie 2012-03-30 13:41:04

How should user profiles be migrated?

Even if history metadata isn't important, it would probably be desirable to migrate profiles even if it isn't possible to preserve things like passwords.

How might comments and other non-page features be incorporated into the migrated Wiki?

Moin doesn't directly support various Confluence features, but things like comments could reside on subpages according to a convention.

Should the content be audited and filtered during the process?

If there are spam pages or spammer user profiles, we could filter them out, but this would probably occur between parsing the Confluence XML dump and importing into Moin.

MoinMoin: ConfluenceConverter/DevelopmentNotes/TransformProcess (last edited 2013-02-18 15:08:19 by PaulBoddie)