One sometimes find himself in the need for writing texts that have a specified length: Scientific writing, proposals, and journalism all require this sort of information. The WordCount macro could help you.
[[WordCount]] - word count of this page [[WordCount()]] - same [[WordCount(4000)]] - this page, agains a target count of 4000 [[WordCount(FrontPage)]] - count FrontPage [[WordCount(FrontPage,4000)]] - count FrontPage agains a target count of 4000 [[WordCount(PageOne,PageTwo)]] - count some pages [[WordCount(PageOne,PageTwo,4000)]] - count some pages against a target count of 4000
[[WordCount(FrontPage,115)]] is the word count of the front page, compared with a target length of 115. The difference will be printed. If you include in the pages list the magic word subpages the subpages of the current page will be counted too.
Discussion
Does this macro ignore wiki markup when counting words?
A: not currently. Should it? If it should, is there a convenient regexp for wiki markup or should I make up mine? Let me see, what is markup-that-should-not-be-counted?
- I can easily ignore things like macro calls, the = mark sequences for the header, the horizontal lines.
- Things inside square brackets: I could count the words in the second member.
Things between angle brackets, like <-2> or <style="whatever"> should be ignored
In an ideal world, the word counting would happen right after the HTML generation. I could easily strip all the HTML taggery and count the words. But I don't know how to do that: MoinMoinGods out there, suggestions?
Generally, the wiki parser is looking for wiki markup and print the text between the markup. The parser works like this:
for line in text: for markup in line: print text before markup replace markup print text after last markup
To get correct word count, you should write a new parser, that count the words in the text it finds, in this loop, and count words in text inside markup. This is not easy, but otherwise your word count not correct. Maybe just add "About <wordcount> words" instead.
Another idea, I think that all text should be printed using the formatter.text() calls. So maybe you can simply create a subclass of the text_html formatter, that count the number of words it prints. But the problem is the formatter prints directly to client, so the number of words is known only after all the page was formatter and sent to the client. You can redirect the page output into a buffer, insert a placeholder for the result of the word count, then insert it and send the page to the client.
Interface
About the interface, I think its confusing and has unneeded options. How about simpler syntax:
[[WordCount]] - word count of current page
Will print:
- Word Count: about xxxx words
Syntax:
[[WordCount(children)]] - word count of current page with all children
Will print:
- Word Count: about xxxx words (including children)
-- NirSoffer 2005-02-25 16:15:42 Words in this page WordCount