{i} If you are looking for a working PDF generator visit ActionMarket/PdfAction.

PDF Report

Proposal for a new output formatter

News:

Goal:

The goal is, to make it possible, to create a pdf document containing a table of content and any number of wiki pages on the fly.

Design ideas:

I think, it is a little bit difficult, to reuse the current formatter code, as it does not create a clean xml structure. Therefore I want to:

  1. create an empty xml document
  2. parse one wiki page and inserting it as xmlnode into this document
  3. repeat this step for all pages
  4. insert a table of content at the beginning
  5. convert this to pdf
  6. correct the page numbers in the table of content
  7. recreate pdf document
  8. print pdf document to file

Open questions:

  1. How do I create a table of content?
    One idea was, to create a set of pages by passing the name of a category as parameter.
    But if you do that, than you cannot define the order, in which the pages appear, but that can be important.

    • I would suggest to write a pdf processor. One page name per line plus additional parameters. The content of the text area could be rendered as bullet list or something like this and should supply a link to an action that generates and downloads the PDF.
    • Create the TOC out of the heading of the included pages. Implement a parameter to shift heading level up and down. Allow to ignore heading below a special level per page.
  2. Is there a dtd that defines, how to represent a wiki page or a whole wiki in xml format?
    See also: WikiXml

Formal Requirements

Class PdfReport

Todo

  1. install reportlab toolkit -- done

  2. install a python ide (e.g. eric) -- done

  3. install libxml2-python -- done
    Some comments why I want to use this xml library:

    • it is very fast
    • it has very many features, e.g. UTF support, full XPATH support and validation of xml documents
    • I know it very well
    I just noticed, that it is part of the standard libxml2 package on gentoo linux, so no installation effort required.
  4. create a class definition for PdfReport (attributes and methods) -- done

  5. implement a minimal working prototype
    1. implement appendWikiPage()
      1. create page nodes -- done
      2. create paragraph nodes -- done
      3. create header nodes -- done
      4. create list nodes -- done (well, for unordered lists)
    2. implement convertToPdf() -- partly done (formatting is not implemented yet)
  6. discuss it on the mailing list
  7. implement the class PdfReport

    1. complete appendWikiPage()
      1. create bold nodes
      2. create italic nodes
      3. create link nodes
      4. create BR nodes
    2. complete convertToPdf()
  8. embed it into the moinmoin framework

-- -- UweFechner 2004-04-11 14:46:11

I would strongly recommend not to implement another parser! As this may look easy it will be very hard to create a parser that parses exactly the same markup than the MoinMoin parser. And this parser will break if we change the MoinMoin parser. Additionally this won't be usable with other parsers.

If the dom/xml parser doesn't produce a proper xml format for you, clone and modify it or use xslt to create the wanted format.

Please use the MoinMoin infrastructure to access the data. Filesystem layout will change in 1.3, default encoding will change, too. Use Page(request, pagename).get_raw_body() to load the page content from hard drive. -- FlorianFesti 2004-10-10 16:23:50

Well, I would prefer to use the existing parser, IF there is a clean and well defined interface to the existent content. This could be an xml dtd, for example. But I can't find anything like this in any of the existing documents.
And without a clean interface, e.g. a dtd or an xml-schema or an OMG IDL (interface definition language) interface I can't see, how I shall implement a pdf converter, based on a clean infrastructure.
Where is this interface defined?

-- UweFechner 2004-10-10 18:12:20

The "clean" interface is the Formatter API. Right now there is no XML format that will match your needs. dom/xml produces a representation of page source. This is not what you want as it does not execute macros and processors. Right now dom/xml is not yet used in MoinMoin and is not tested too well but will (likely) evolve to our internal representation format. I would suggest you fork dom/xml to produce the format you need. Perhaps you just need to remove the macro and processor method to go back to the default implementation inherited from base.

As an alternative should can try to produce the pdf directly by implementing an new formatter. Be aware that the parser doesn't reorder the text format tags to grant correct nesting. The dom/xml formatter is our prototype to fix this problem by keeping track of the opend tags and adding closing and reopen tags. For example

<em>italics<strong>italics/bold</em>bold</strong>
gets
<em>italics<strong>italics/bold</strong></em><strong>bold</strong>

Getting the parser/formatter interface right has been a longer process that might still not have been completed to 100% yet. (Just found a bug about attachment handling is still done in parser.) But getting it right is required for caching (see MoinMoinIdeas/WikiApplicationServerPage for details). So problems in this area will get fixed.

-- FlorianFesti 2004-10-10 19:14:46

I have to agree with FlorianFesti. Working inside the scope of Moin would have the benefit of improving the parser and formatter (and thus improving all of Moin). Building upon the code that's already in Moin. If radical changes to the parser/formatter are needed, we can split it off while ironing out problems, but working within the moin framework. Looking at the Formatter code, I don't think it justifies starting from scratch, it's not such a hopeless case :D .

-- MarijnVriens 2004-10-11 05:03:04

Most improvements needed should have been already done during the 1.3 development.

-- FlorianFesti 2004-10-11 08:11:05

Well, in the moment I just don't want to invest much time in understanding the moinmoin framework (and learning how to use arch). I might try to improve the existing xml exporter, I am tempted to extract a standalone xml exporter out of the existing codebase. It takes much less time, to understand a file format, than to understand a code framework, and for my needs (wiki to pdf conversion) I just can't see, why the effort of understanding the framework should be neccessary. I prefer "loosly coupled" software components to strong coupling. It makes independent development more easy.

-- UweFechner 2004-10-11 19:47:12

MoinMoin: PdfReport (last edited 2007-10-29 19:08:55 by localhost)