Description

When using non-utf8 charset as config.charset, saving or creating new pages with non-ascii content is broken.

Step to reproduce:

  1. Create a new wiki instance, with empty data and underlay dir
  2. Create one page by hand in underaly, like FrontPage. Use iso-8859-1 encoding for the text file.

  3. Set config.charset to 'iso-8859-1'
  4. Try to edit the page or create new pages using urls with non-ascii page names.
  5. Backtrace :)

Details

MoinMoin Version

1.3 devel

Discussion

It is caused by our decode user input wrong strategy. We try to decode all user input, both from textarea and query string as utf-8, then if it fails, we try all charsets in http_accept_charsets. This is wrong becuse:

  1. All user input is sent by browsers using config.charset, which is not utf-8 on this case.
  2. There is no connection between http_accept_charset and the charset used to encode user input.

Fixed in http://nirs.dyndns.org:8000/FrontPage

The fix based on the old code from moin-1.2 that seems to work fine for most cases. First, we try to decode using utf-8. Because non-ascii characters use 2 bytes with special first byte value in utf-8, usually what is not utf-8 will raise UnicodeError, and we try again with config.charset.

The fix contains:

  1. Page name is NOT sent to unquoteWikiname - do not try to use the internal quoting in page names. Now we can change our quoting without effecting links in pages or in the web.
  2. white space and slashes are normalized, as described in PageNames

  3. Each subpage is decoded separately, so we can decode Hebrew sub page within iso parant page
  4. What we can't decode is replaced with "?", so if you try to use characters which are not supported on the wiki, you simply get a name with ??? instead of crashing.
  5. Cleanup group pages from non acl friendly characters (same function will be used for user names). Each path component is cleaned separately.

Unsolved problems

The fix does fix the case when you add a word to a page name that uses iso chracters. In this case, the name can't be decoded as utf-8, because of the existing iso characters, and is decoded using iso. This convert the unicode to gargabge.

There is a bug in Mozilla https://bugzilla.mozilla.org/show_bug.cgi?id=261929, when url you type is sent not in utf-8, and not in config.charset. You can "fix" Mozilla by going to about:config, then filtering by utf, then double clicking "network.standard-url.encode-utf8". Then all urls are encode as 'utf-8'.

A workaround to these problems is to add the page with a link on the wiki, or through FindPage "Go" field.

Plan


CategoryMoinMoinBugFixed

MoinMoin: MoinMoinBugs/NonUtf8Charset (last edited 2007-10-29 19:05:59 by localhost)