Description
When pasting text from an MS Office Word document loaded in Open Office Writer, into the GUI editor (FCKEdit), strange characters (Anchors and undefined unicode character codes) are displayed in the editor window. When you try to Preview or Save Changes, an exception is caused:
ConvertError: ExpatError: not well-formed (invalid token): line 376, column 1508 (see dump in /home/andydj/moin-1.8.5/wiki/data/expaterror.log)
This occurs on both the apache-cgi and stand-alone versions of MoinMoin release 1.8.5.
Steps to reproduce
- Load an MS Word document containing tables and "type here" fields into Open Office Writer.
- Select and copy some or all of the text.
Edit a page in MoinMoin and select GUI Mode.
- Paste text directly into GUI editor window or click Paste Word icon and paste into dialogue box.
- Text appears in editor window, but includes strange characters (unicode undefined character boxes - 0004 and 0005) and also anchor symbols.
- Preview or Save Changes.
- Exception above is displayed.
Example
Component selection
- MoinMoin/converter/text_html_text_moin_wiki.py
Details
The traceback:
Traceback (most recent call last): File "/home/andydj/moin-1.8.5/MoinMoin/request/__init__.py", line 1311, in run handler(self.page.page_name, self) File "/home/andydj/moin-1.8.5/MoinMoin/action/edit.py", line 97, in execute savetext = convert(request, pagename, savetext) File "/home/andydj/moin-1.8.5/MoinMoin/converter/text_html_text_moin_wiki.py", line 1441, in convert tree = parse(request, text) File "/home/andydj/moin-1.8.5/MoinMoin/converter/text_html_text_moin_wiki.py", line 1419, in parse raise ConvertError('ExpatError: %s (see dump in %s)' % (msg, logname)) ConvertError: ExpatError: not well-formed (invalid token): line 376, column 1512 (see dump in /home/andydj/moin-1.8.5/wiki/data/expaterror.log)
MoinMoin Version |
1.8.5 (this wiki) |
OS and Version |
CentOS 5.2 and Ubuntu 9.04 |
Python Version |
2.4.3 and 2.6.2 respectively |
Server Setup |
Apache-CGI and Standalone, respectively |
Server Details |
|
Language you are using the wiki in (set in the browser/UserPreferences) |
en-uk |
Workaround
When I looked at the log file in vim, it showed countless ctrl-D and ctrl-E characters, and I guessed it might be expat choking on these, so I modified parse() in MoinMoin/converter/text_html_text_moin_wiki.py, adding in a text.translate() call to delete all control characters before the text is submitted to xml.dom.minidom.parseString(text). It's a bit of a kludge but it works:
def parse(request, text): text = u'<?xml version="1.0"?>%s%s' % (dtd, text) text = text.encode(config.charset) try: text=text.translate('\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff','\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e') return xml.dom.minidom.parseString(text) except xml.parsers.expat.ExpatError, msg: # this sometimes crashes when it should not, so save the stuff to analyze it: logname = os.path.join(request.cfg.data_dir, "expaterror.log") f = file(logname, "w") f.write(text) f.write("\n" + "-"*80 + "\n" + str(msg)) f.close() raise ConvertError('ExpatError: %s (see dump in %s)' % (msg, logname))
Thanks to Thomas Waldmann for some pointers as to where I might find the issue.
Discussion
Plan
- Priority:
- Assigned to:
- Status: