Description

In 1.5.8 version, also in 1.6.0(2007.10.22) version. There is a bug, it is, when we try to upload attachment with a string name like "这是一个附件.jpg"(Chinese). MoinMoin will pop up a error. like utf8 error...blah..balh...

Steps to reproduce

upload a attachment with name "这是一个附件.jpg" or rename it like this. it will be happen.

Example

Component selection

general

Details

MoinMoin Version	1.5.8 and 1.6.0
OS and Version	windows xp
Python Version	python 2.4
Server Setup	apache 2.2.4
Server Details	mod_python or cgi
Language you are using the wiki in (set in the browser/UserPreferences)	any language

Workaround

I think I already fixed it, but I do not know how to submit it to moinmoin's source.

The reason of this bug is, In current MoinMoin's AttachFile.py, MoinMoin always use "fn.decode(config.charset)" or "fn.encode(config.charset)" to deal with file name. But I found config.charset is "utf-8".

It is really a wrong concept. File System has its coding charset, it can be got by "sys.getfilesystemencoding()". So, all "config.charset" in AttachFile.py should be replaced by "sys.getfilesystemencoding()".

In attachment is bug fixed AttachFile.py, based on 1.5.8 release version.

AttachFile_base1.5.8_modified.py

BTW, about upgrade.

you have to rename all attachment to right encoding charset, especially file names contain some character between unichr(255) and unichr(65536).otherwise, you will got wrong file name in your MoinMoin

Discussion

I think using anything else than utf-8 would be wrong, because our goal is not to have nice filenames in the filesystem, but to be able to encode any filename we get - and that is only possible with unicode encodings (like utf-8). There are lots of systems out there having iso-8859-1 as file system encoding and this for sure won't encode chinese chars.

But if you get some exception when using this, the reason is a different one anyway. We need: traceback.html

-- ThomasWaldmann 2007-10-23 06:34:27

As you wish, I submit the traceback.html. I hope this problem could be fixed in next version of MoinMoin. Many Chinese, Japanese and Korean etc. users complaint this problem since long time ago. -- Timesking

Can you please do that and submit the results (i checked the filename and it is valid utf-8, so the question is rather why it is not there):

dir D:\moin\mywiki\data\pages\(e58685e5ad98e7aea1e79086)\attachments\

Can you please also do that:

cd D:\moin\mywiki\data\pages\(e58685e5ad98e7aea1e79086)\attachments\
python
>>> import os
>>> os.listdir('.')

BTW, the long term fix is UnifyPagesAndAttachments.

sorry, that tractrac.html wrong, I updated again. btw. when upload file, rename is "这是一个附件", not "这是一个附件.jpg".

dir result

2007-10-23  15:39            21,016 杩欐槸涓€涓檮浠

That dir result is no valid chinese word, right?

os.listdir('.') result

['\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe9\x99\x84\xe4\xbb']

That is no valid utf-8 string. And utf-8 encoding the "dir result" does not even look similar to that string. So it is not quite clear to me what's happening.

>>> print '\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe9\x99\x84'.decode('utf-8')
这是一个附件

It looks like it doesn't get the chars at the end of the string right. Maybe this is due to some windows internal recoding?

"这是一个附件" write to bin as utf8 is followings:
\xef\xbb\xbf\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe9\x99\x84\xe4\xbb\xb6

but we only get file name from listdir

'\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe9\x99\x84\xe4\xbb'

it seems lost some byte when writing file name.

from head it lose
\xef\xbb\xbf
from end it lose
\xb6

btw, it is easy to re-produce, just selecting any file to upload, but rename input text set to "这是一个附件".

\xe8\xbf\x99 traslate to utf-8 is 0xfeff. it is utf-8 header, we can ignore it.
So, the most important is why \xb6 has been lost.

anyway, if you write following code in python eviorment:

import struct
stream = open("c:\\abc\xb6", 'wb')
for i in xrange(65536):
    stream.write(struct.pack("H", i))
stream.close()

you will only get file c:\abc. it is strange. from os.listdir you will find 'abc' only. python ignore '\xb6' at end

This test is invalid - you should try a vlid utf-8 string.

but if your write it like

import struct
stream = open("c:\\abc\xb6def", 'wb')
for i in xrange(65536):
    stream.write(struct.pack("H", i))
stream.close()

you will get 'abc\xb6def' from os.listdir.

This test may be also invalid.

Why?? who konws it. it it really a bad idea to use decode('utf-8') as final attachment file name.

I guess this is a windows specific problem. Does it happen on this wiki here?

no, I have tried on /testing1 . It works well. So, how to deal with this problem without changing server to Linux?

or MoinMoin drop window server users?

Please try this on Windows (tested here on Mac OS X, Python 2.4.4):

>>> import os
>>> dir = 'unicode-test'
>>> filename = u'\u8fd9\u662f\u4e00\u4e2a\u9644\u4ef6'.encode('utf-8')
>>> os.mkdir(dir)
>>> open(os.path.join(dir, filename), 'wb')
<open file 'unicode-test/这是一个附件', mode 'wb' at 0x4c4a0>
>>> os.listdir(dir)
['\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe9\x99\x84\xe4\xbb\xb6']
>>> filename in os.listdir(dir)
True

I got same problem when I attached some files which have japanese charactor in filename. Maybe same cause with this page.

I tried to find cause in traceback_attached_jpn.txt, but failed.
Find on these plathomes:
- Moin 1.6.3, Moin 1.7.0rc2 with DesktopEdition Mode on VistaSP1 [Japanese] NTFS format
- Moin 1.5.8 on WS2003 SP1 [Japanese] NTFS format

-- HidekazuHara 2008-06-12 12:00:00

I tried fix against 1.7.0 with punycode and urllib. Please see pydoc on AttachFile_base170_modified.py.

Anyway, The first line of the code below is (logically) invalid on MS Windows. Also possibly invalid on some linux systems which don't use utf-8 file system.

f = open(u"SomeUnicodeFilename".encode("someCodec")) # Invalid!
f = open(u"SomeUnicodeFilename") # Valid, we should give unicode object itself.

-- SuzumizakiKimitaka 2008-06-26 09:43:05

Many POSIX systems accept every bytestring (no matter whether they use utf-8 encoding in the file system or not), so we never heard of problems there (we know that it might not display correctly when using ls to see the filename, but that never was a requirement).

For Win32, we need more details:

does open() really accept unicode object (can be ucs4, 32bit unicode, depending on how your python was compiled) as you are telling above or is it rather utf-16 maybe (note that ucs2 and utf-16 are giving same result except for stuff that can't be represented in 16bits)?
does this depend on the filesystem used (well, we do recommend using ntfs, but still it would be interesting what happens with fat32)

-- ThomasWaldmann 2008-06-26 10:26:35

Python on Windows has been accepting unicode strings for years (for ntfs as well as fat32), it will do the conversion internally. For historical reasons, Moin has been storing the filenames without respecting the fs encoding (and thereby not using this feature). Nevertheless a migration script would be needed to make a transition to the fs-encoding-driven approach. Technically, there is no problem on windows besides the fact that it changes the data storage semantics. -- AlexanderSchremmer

Well, lets discuss how to proceed with this. Some infos first:

see GoogleSoc2008 - ChristopherDenter is working on getting a new storage / storage backend api production ready - introducing this in some future moin version will have to include migration scripts that read 1.5/6/7 compatible storage and write to new storage backend.
when looking at pagename storage, we already solved the problem using wikiutil.quoteWikinameFS(). You can even transfer page dirs from windows to posix and vice versa without having any problems there.
pages and attachments will get unified to mimetype items in the future (with new storage backend, likely using quoteWikinameFS for the FS backend)

Looking at this, we have some options that are easier than others about how to proceed:

do nothing:
- the new storage backend will solve this anyway
- we have to write a migration script for it anway, too
- likely not released before end 2010
use quoteWikinameFS for encoding of attachment filenames:
- needs no configuration
- platform independant storage
- needs a mig script for renaming all attachments
- needs good testing
- could be done earlier if someone provides good code for it as a patch to current code (the smaller and cleaner the code is, the more likely it will get accepted - if we want to apply it within 1.7.x we don't want big code changes, and the next release, 1.8, might use the new storage code already).

-- ThomasWaldmann 2008-06-26 11:28:58

Ok, I know. And sorry for a MoinMoin newbie I am. I understand like following, are these right?

Current attachment-file system will be replaced in near future as told in ChristopherDenter.
And the problem discussed in this page will be solved by the work.
But the replacing would not be done until end of 2010(or later).
If someone want to fix more earlier, anyone should provide the migration script (and good test cases) to rename all existing attachments.

Hmm. It seems so hard for me to write migration code just now, because I haven't MoinMoin wiki launched yet, just testing.

-- SuzumizakiKimitaka　2008-06-26 22:03:34

Yes, you understood it right. As the new storage code is still work in progress, we can't be sure about when it will be production-ready. We hope that 1.8-storage repo will be in that state at end of SOC 2008, but we will need some time after that for merging the code, testing and finally creating a new release from it.

As 1.7 is already released, we primarily want to do bugfixes in 1.7.x, not bigger changes (1.8 branch is for developing bigger changes).

I will have a look later at how big the minimum changes are (e.g. when just using quoteWikinameFS instead of encode(config.charset) and whether we can do that within 1.7.x. BTW, if you are interested, maybe join us on MoinMoinChat.

-- ThomasWaldmann 2008-06-26 15:03:27

As discussed above, some files with Japanese characters in their filenames cannot be attached.

UnicodeDecodeError
'utf8' codec can't decode bytes in position 19-21: invalid data

Once this error appears, the page becomes inaccessible. To rescue the page, the attached file must be manually removed from the attachments folder, and the last line in the edit-log file must also be removed.

This is known and there are bug reports for this. It maybe can work if your (file)system encoding is utf-8. Fixing this would need major changes in AttachFile and other parts of moin, so we decided to fix that in moin 2.0 (please use ascii filenames until then).
- Thanks for your consideration for non-ascii characters. It is great to hear that moin 2.0 will support these characters in filenames, but until then is it somehow possible to avoid sticking into an error page? For example, how about checking the filename before processing the attachment action, and if non-ascii characters (or only invalid ascii characters) are found, going back to the uploading page with a warning such as "Use ascii character for filename"? Another simple (but not very effective?) idea may be to display a message that recommends the use of ascii filenames in the upload page itself if user's language setting is non-ascii. I myself never use Japanese characters for any filenames, but I see many others do, and I am afraid that they get puzzled if they find the error and the page being inaccessible.

-- TakeoKatsuki 2009-07-05 00:11:49

Plan

Priority:
Assigned to:
Status:

CategoryMoinMoinBug

MoinMoin: MoinMoinBugs/Non-ASCII attachment names on Windows (last edited 2009-11-28 08:27:26 by ReimarBauer)