Moin uses unicode internally and utf-8 as encoding (since moin 1.3).
On input, it decodes from an encoding (config.charset), on output, it encodes to that encoding.
I/O happens at these places:
- FILES:
- wiki pages and backup pages
- log files, last-edit, edit lock
- dictionaries
- HTML:
- page output
- form input
FILES
- page filenames are hexified utf-8
- page content is in utf-8
HTML
page output: request.write will accept Unicode strings and transparently encode to the output encoding.
form input: request.setup_args() should give results in unicode strings. ONLY PARTLY DONE YET.
- we need to decode (utf-8 encoded) form input
- we MUST NOT decode arbitrary files uploaded
- URLs are currently hexified utf-8
- URLs can and should be in UTF-8, major browsers will support that.
I tested by copy&pasting some hebrew pagename to the URL line. Browser did encode that to %XX%XX%XX - but it worked!
- URLs can and should be in UTF-8, major browsers will support that.
internals
- request.getText delivers Unicode strings
for the big ugly regex (CamelCase links, WordIndex etc.), we have pre-made config.chars_{lower,upper,digits,spaces} unicode strings, see MoinMoin/util/chartypes.py that have all those characters that exist in UCS-2
- unicodestring.isupper()/islower() work, same for .upper()/lower()
for REs you might have to use re.compile(whatever, re.UNICODE)
codecs.open(fname, mode, encoding) transparently encodes/decodes on writing/reading. Nice.
do not use str(s), but unicode(s) - otherwise you will invoke .encode('ascii') implicitely - and this will crash if there are non-ascii characters in s
if you get a UnicodeDecodeError from ascii decoder (although you didn't use .decode() at all), this is usually because Python implicitely uses this to translate strings to unicode strings, e.g. when adding or joining string to unicode strings.
Problems
- How to sort the title/page index in multiply languages?
- We can make it simple by sorting first the config.default_lang pages, then English pages (international) then other pages, in (what?) order.
- How can get the pages language? Hebrew page might have an English name, so it will be easier to link to it from the web. We can use the #language tag in the start of the page. Or we can just sort page like this by its name, with other english pages.
- We need to get the language from a word or a title
- Title can use more than one langauge like ?????English - we can sort this as Hebrew name if default_lang is he, or as English name when default lang is en.
- We can use dictionaries, but one word can be at more than one dictionaries - again we can use the default lang to choose.
How can we get the words from a non-CamelCase page name?
Spliting extended links by spaces might serve everyone. The rule for authors should be simple: Use CamelCase or separeted words.
other stuff
http://www.cl.cam.ac.uk/~mgk25/unicode.html Unicode FAQ