Description

In 1.5.8 version, also in 1.6.0(2007.10.22) version. There is a bug, it is, when we try to upload attachment with a string name like "这是一个附件.jpg"(Chinese). MoinMoin will pop up a error. like utf8 error...blah..balh...

Steps to reproduce

  1. upload a attachment with name "这是一个附件.jpg" or rename it like this. it will be happen.

Example

Component selection

Details

MoinMoin Version

1.5.8 and 1.6.0

OS and Version

windows xp

Python Version

python 2.4

Server Setup

apache 2.2.4

Server Details

mod_python or cgi

Language you are using the wiki in (set in the browser/UserPreferences)

any language

Workaround

I think I already fixed it, but I do not know how to submit it to moinmoin's source.

The reason of this bug is, In current MoinMoin's AttachFile.py, MoinMoin always use "fn.decode(config.charset)" or "fn.encode(config.charset)" to deal with file name. But I found config.charset is "utf-8".

It is really a wrong concept. File System has its coding charset, it can be got by "sys.getfilesystemencoding()". So, all "config.charset" in AttachFile.py should be replaced by "sys.getfilesystemencoding()".

In attachment is bug fixed AttachFile.py, based on 1.5.8 release version.

AttachFile_base1.5.8_modified.py

BTW, about upgrade.

you have to rename all attachment to right encoding charset, especially file names contain some character between unichr(255) and unichr(65536).otherwise, you will got wrong file name in your MoinMoin

Discussion

I think using anything else than utf-8 would be wrong, because our goal is not to have nice filenames in the filesystem, but to be able to encode any filename we get - and that is only possible with unicode encodings (like utf-8). There are lots of systems out there having iso-8859-1 as file system encoding and this for sure won't encode chinese chars.

But if you get some exception when using this, the reason is a different one anyway. We need: traceback.html

-- ThomasWaldmann 2007-10-23 06:34:27

As you wish, I submit the traceback.html. I hope this problem could be fixed in next version of MoinMoin. Many Chinese, Japanese and Korean etc. users complaint this problem since long time ago. -- Timesking

Can you please do that and submit the results (i checked the filename and it is valid utf-8, so the question is rather why it is not there):

dir D:\moin\mywiki\data\pages\(e58685e5ad98e7aea1e79086)\attachments\

Can you please also do that:

cd D:\moin\mywiki\data\pages\(e58685e5ad98e7aea1e79086)\attachments\
python
>>> import os
>>> os.listdir('.')

(!) BTW, the long term fix is UnifyPagesAndAttachments.


sorry, that tractrac.html wrong, I updated again. btw. when upload file, rename is "这是一个附件", not "这是一个附件.jpg".

dir result

2007-10-23  15:39            21,016 杩欐槸涓€涓檮浠

That dir result is no valid chinese word, right?

os.listdir('.') result

['\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe9\x99\x84\xe4\xbb']

>>> print '\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe9\x99\x84'.decode('utf-8')
这是一个附件

It looks like it doesn't get the chars at the end of the string right. Maybe this is due to some windows internal recoding?

"这是一个附件" write to bin as utf8 is followings:
\xef\xbb\xbf\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe9\x99\x84\xe4\xbb\xb6

but we only get file name from listdir

'\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe9\x99\x84\xe4\xbb'

it seems lost some byte when writing file name.

from head it lose
\xef\xbb\xbf
from end it lose
\xb6

btw, it is easy to re-produce, just selecting any file to upload, but rename input text set to "这是一个附件".


\xe8\xbf\x99 traslate to utf-8 is 0xfeff. it is utf-8 header, we can ignore it.
So, the most important is why \xb6 has been lost.

anyway, if you write following code in python eviorment:

import struct
stream = open("c:\\abc\xb6", 'wb')
for i in xrange(65536):
    stream.write(struct.pack("H", i))
stream.close()

you will only get file c:\abc. it is strange. from os.listdir you will find 'abc' only. python ignore '\xb6' at end

but if your write it like

import struct
stream = open("c:\\abc\xb6def", 'wb')
for i in xrange(65536):
    stream.write(struct.pack("H", i))
stream.close()

you will get 'abc\xb6def' from os.listdir.

Why?? who konws it. it it really a bad idea to use decode('utf-8') as final attachment file name.

I guess this is a windows specific problem. Does it happen on this wiki here?

no, I have tried on /testing1 . It works well. So, how to deal with this problem without changing server to Linux?

or MoinMoin drop window server users?

Please try this on Windows (tested here on Mac OS X, Python 2.4.4):

>>> import os
>>> dir = 'unicode-test'
>>> filename = u'\u8fd9\u662f\u4e00\u4e2a\u9644\u4ef6'.encode('utf-8')
>>> os.mkdir(dir)
>>> open(os.path.join(dir, filename), 'wb')
<open file 'unicode-test/这是一个附件', mode 'wb' at 0x4c4a0>
>>> os.listdir(dir)
['\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe9\x99\x84\xe4\xbb\xb6']
>>> filename in os.listdir(dir)
True


I got same problem when I attached some files which have japanese charactor in filename. Maybe same cause with this page.

-- HidekazuHara 2008-06-12 12:00:00

I tried fix against 1.7.0 with punycode and urllib. Please see pydoc on AttachFile_base170_modified.py.

Anyway, The first line of the code below is (logically) invalid on MS Windows. Also possibly invalid on some linux systems which don't use utf-8 file system.

f = open(u"SomeUnicodeFilename".encode("someCodec")) # Invalid!
f = open(u"SomeUnicodeFilename") # Valid, we should give unicode object itself.

-- SuzumizakiKimitaka 2008-06-26 09:43:05

Many POSIX systems accept every bytestring (no matter whether they use utf-8 encoding in the file system or not), so we never heard of problems there (we know that it might not display correctly when using ls to see the filename, but that never was a requirement).

For Win32, we need more details:

-- ThomasWaldmann 2008-06-26 10:26:35

Well, lets discuss how to proceed with this. Some infos first:

Looking at this, we have some options that are easier than others about how to proceed:

-- ThomasWaldmann 2008-06-26 11:28:58

Ok, I know. And sorry for a MoinMoin newbie I am. I understand like following, are these right?

Hmm. It seems so hard for me to write migration code just now, because I haven't MoinMoin wiki launched yet, just testing.

-- SuzumizakiKimitaka 2008-06-26 22:03:34

Yes, you understood it right. As the new storage code is still work in progress, we can't be sure about when it will be production-ready. We hope that 1.8-storage repo will be in that state at end of SOC 2008, but we will need some time after that for merging the code, testing and finally creating a new release from it.

As 1.7 is already released, we primarily want to do bugfixes in 1.7.x, not bigger changes (1.8 branch is for developing bigger changes).

I will have a look later at how big the minimum changes are (e.g. when just using quoteWikinameFS instead of encode(config.charset) and whether we can do that within 1.7.x. BTW, if you are interested, maybe join us on MoinMoinChat.

-- ThomasWaldmann 2008-06-26 15:03:27

As discussed above, some files with Japanese characters in their filenames cannot be attached.

UnicodeDecodeError
'utf8' codec can't decode bytes in position 19-21: invalid data

Once this error appears, the page becomes inaccessible. To rescue the page, the attached file must be manually removed from the attachments folder, and the last line in the edit-log file must also be removed.

-- TakeoKatsuki 2009-07-05 00:11:49

Plan


CategoryMoinMoinBug

MoinMoin: MoinMoinBugs/Non-ASCII attachment names on Windows (last edited 2009-11-28 08:27:26 by ReimarBauer)