HTML2Moin

Some people want to transfer HTML-files - not only as a attachment but as real Wiki-pages. In this cases a converter that turns HTML into wiki-markup makes life easy. OpenOffice.org and Word-files can be saved as HTML and simplified with HtmlTidy (that even can "strip surplus tags in Word 2000 pages").

In Python there are two different htmlparsers that could be useful:

HTMLParser can be used like this:

   1 #!/usr/bin/python2.3
   2 
   3 url = "http://jurawiki.org/StartSeite?action=print"  # sample url
   4 
   5 import urllib
   6 htmldata = urllib.urlopen(url).read()
   7 
   8 from HTMLParser import HTMLParser
   9 
  10 verbose = 0
  11 
  12 class MyHTMLParser(HTMLParser):
  13     def do_a_start(self, attrs, tag):
  14         print " %s " % attrs[0][1]
  15 
  16     def do_b_start(self, attrs, tag):
  17         print "'''",
  18 
  19     def do_b_end(self, tag):
  20         print "'''",
  21 
  22     def do_em_start(self, attrs, tag):
  23         print "''",
  24 
  25     def do_em_end(self, tag):
  26         print "''",
  27 
  28     def handle_starttag(self, tag, attrs):
  29         func = MyHTMLParser.__dict__.get("do_%s_start" % tag, MyHTMLParser.do_default_start)
  30         return func(self, attrs, tag)
  31 
  32     def handle_endtag(self, tag):
  33         func = MyHTMLParser.__dict__.get("do_%s_end" % tag, MyHTMLParser.do_default_end)
  34         return func(self, tag)
  35 
  36     def handle_data(self, data):
  37         print data,
  38 
  39     def do_default_start(self, attrs, tag):
  40         if verbose:
  41             print "Encountered the beginning of a %s tag" % tag
  42             print "Attribs: %s" % attrs
  43 
  44     def do_default_end(self, tag):
  45         if verbose:
  46             print "Encountered the end of a %s tag" % tag
  47 
  48 p = MyHTMLParser()
  49 p.feed(htmldata)
  50 p.close()

FlorianFesti extended this parser from ThomasWaldmann a bit (see ActionMarket/HTML2MoinMoin.py) and made a nice Web-Interface (see ActionMarket/ImportHtml.py). A sample implementation can be seen in the ZDI-Wiki.

Other existing HTML2Wiki Converters

using xslt

<?xml version="1.0" encoding="Windows-1251"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
<xsl:output method="text" encoding="Windows-1251"/>
<xsl:template match="/">
        <xsl:apply-templates/>
</xsl:template>

<xsl:template match="h4">
  === <xsl:apply-templates /> ===
</xsl:template>

<xsl:template match="p|pre">
        <xsl:text xml:space="preserve">
</xsl:text>
        <xsl:apply-templates/>
        <xsl:text xml:space="preserve">
</xsl:text>
</xsl:template>

<xsl:template match="u"> __<xsl:apply-templates/>__</xsl:template>

<xsl:template match="em"> *<xsl:apply-templates/>*</xsl:template>

<xsl:template match="li[parent::ul]">
<xsl:text disable-output-escaping="yes">
    * </xsl:text> <xsl:apply-templates/>
</xsl:template>

<xsl:template match="li[parent::ol]">
<xsl:text xml:space="preserve">
    </xsl:text>
<xsl:value-of select="position()"/>.<xsl:text xml:space="preserve"> </xsl:text>
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="*">
<xsl:text xml:space="preserve"> </xsl:text><xsl:apply-templates/>
</xsl:template>

</xsl:stylesheet>

stylesheet can be enriched with other tags

coconuts uses text_html_text_moin_wiki to do a mass convert of html fragments of our old CMS pages into wiki markup and sends it by xmlrpc to a wiki server. It is in development process. -- ReimarBauer 2010-03-02 19:29:06

MoinMoin: HtmlConverter (last edited 2018-03-08 08:54:34 by RudolfReuter)