Description
When using non-utf8 charset as config.charset, saving or creating new pages with non-ascii content is broken.
Step to reproduce:
- Create a new wiki instance, with empty data and underlay dir
Create one page by hand in underaly, like FrontPage. Use iso-8859-1 encoding for the text file.
- Set config.charset to 'iso-8859-1'
- Try to edit the page or create new pages using urls with non-ascii page names.
Backtrace
Details
MoinMoin Version |
1.3 devel |
Discussion
It is caused by our decode user input wrong strategy. We try to decode all user input, both from textarea and query string as utf-8, then if it fails, we try all charsets in http_accept_charsets. This is wrong becuse:
- All user input is sent by browsers using config.charset, which is not utf-8 on this case.
- There is no connection between http_accept_charset and the charset used to encode user input.
Fixed in http://nirs.dyndns.org:8000/FrontPage
The fix based on the old code from moin-1.2 that seems to work fine for most cases. First, we try to decode using utf-8. Because non-ascii characters use 2 bytes with special first byte value in utf-8, usually what is not utf-8 will raise UnicodeError, and we try again with config.charset.
The fix contains:
- Page name is NOT sent to unquoteWikiname - do not try to use the internal quoting in page names. Now we can change our quoting without effecting links in pages or in the web.
Don't do this: http://nirs.dyndns.org:8000/FrontPage(%2f)SubPage
white space and slashes are normalized, as described in PageNames
- Try:
http://nirs.dyndns.org:8000/%00 - Null invisible page
http://nirs.dyndns.org:8000/__FrontPage__ - leading and trailing spaces
- Each subpage is decoded separately, so we can decode Hebrew sub page within iso parant page
- What we can't decode is replaced with "?", so if you try to use characters which are not supported on the wiki, you simply get a name with ??? instead of crashing.
- Cleanup group pages from non acl friendly characters (same function will be used for user names). Each path component is cleaned separately.
Unsolved problems
The fix does fix the case when you add a word to a page name that uses iso chracters. In this case, the name can't be decoded as utf-8, because of the existing iso characters, and is decoded using iso. This convert the unicode to gargabge.
There is a bug in Mozilla https://bugzilla.mozilla.org/show_bug.cgi?id=261929, when url you type is sent not in utf-8, and not in config.charset. You can "fix" Mozilla by going to about:config, then filtering by utf, then double clicking "network.standard-url.encode-utf8". Then all urls are encode as 'utf-8'.
A workaround to these problems is to add the page with a link on the wiki, or through FindPage "Go" field.
Trying to type in the browser url box: http://nirs.dyndns.org:8000/J%fcrgenHermann%c3%9cnicode
Creatring from another page: http://nirs.dyndns.org:8000/J%fcrgenHermann%dcnicode
Plan
- Priority: Medium
- Assigned to:
- Status: Fixed in patch-206