Comparison of existing Free Software for processing MS-Office documents
Project |
License |
Language |
Input formats |
Last commit? |
Code quality? |
Projects using lib |
Notes |
Unknown (presumably GPL or public domain) |
C |
Doc 95/97 (possibly 2000/2003) |
5 months ago |
Dubious (online demo removed due to "security concerns") |
wvWare/Abiword |
C++ bindings available |
|
GNU GPL (any) |
C |
.doc, .ppt, .xsl (but with no support for formatting) |
~2006 |
Simple, GCC -Wall does complain with various signed/unsigned type casting issues (sample output here) |
catd oc/catppt/xsl2csv |
Does pretty-print some formatting |
|
LGPLv3 |
C++/Java |
Office OpenXML (pptx, docx, xslx), Office 95-2003 (ppt, doc, xsl) |
Today (well maintained) |
Unknown |
Libreoffice |
Best format support, difficult to work with code outside of Libreoffice due to the amount of Libreoffice code required to get a working solution. Does include command-line utilities, though. Some Python libraries that wrap Libreoffice functionality also exist, such as py3o.renderserver |
|
Various (mostly GNU GPL) |
C/C++ |
Office 95-2003 (doc) |
Today (well maintained) |
Unknown (good GUI/lib separation) |
Abiword |
Good word document format support. Utilises libwv. Abiword also has --to= and --to-name= options for conversion, includes plugin functionality to create new output filters |
|
GNU GPL |
C |
Office 95-2003 (doc) |
2005 |
Unknown (again, good GUI/lib separation though) |
Antiword |
Reasonably simple code with good cross-platform support. No deps apart from the C stdlib |
|
MIT |
Python |
Office Open XML (docx) |
January 2011 |
Good (only 200 lines of source) |
- |
Demonstrates how easy it is to extract information from Office Open XML documents, it's probably even feasible to create an input converter for them |
|
BSD |
Python |
Excel 95-2003 (xls) |
Last PyPI upload January 2011 |
Good |
- |
Python library to read and write Excel files |
|
GPL v2 |
C |
Office 95-2003 (xls) |
Today (well maintained |
Good (good UI/parser code separation) |
GOffice/Gnumeric |
- |
|
BSD |
Python |
Office 95-2003 (xls) |
2009 |
Good |
- |
Another pure-Python library for reading and writing Excel files |
|
MIT |
Python |
Office Open XML Documents (currently xlsx only) |
March 2011 |
Good |
- |
Pure-Python library for reading/writing OOXML documents, but currently only supports Excel 2007+ files (and appears unmaintained) |
|
GPL v2 |
Python |
Office Open XML Documents (xlsx, pptx, docx) |
2010 |
Okay (not terribly Pythonic API, some methods like .allText do not work as advertised) |
- |
- |
Testing them out
Example word document
Example excel spreadsheet
Example powerpoint
Program |
Output |
catppt |
No output produced |
Libreoffice |
openxmllib
This is a test of openxmllib's .indexableText() method, note that if more styling is required the user can override the .textFromTree() method and use it to output text in a format of their choice.
Analysis
For parsing Office documents for 2007+ (Office Open XML), the easiest solution is likely a roll-your-own input converter similar to python-docx following ECMA-376. Libraries like openxmllib can also assist in this task.
it looks like (as the name says) python-docx only deals with .docx. but yes, the .docx extracting code looks rather easy. it would be interesting to see what can be extracted with a converter as simple as the one we use for open document format (odf). also i have found http://code.google.com/p/openxmllib/ it would be interesting whether that could generally solve our "openxml document indexable text extraction" problem. Can you try it? -- ThomasWaldmann 2011-12-19 20:31:41
- Word 95-2003 documents are best handled by Abiword, Antiword or catdoc, and the easiest solution would be to check for the presence of those programs on the host machine and use them to convert the uploaded item into docbook or plaintext formats for presentation.
Fortunately xlrd exists for working with Excel files from within Python, and being pure-Python it fits the bill perfectly.
This leaves only Office 95-2003 PowerPoint files, which catppt failed to convert in my test. It would appear that the only option is to either roll-your-own converter (which is difficult with a proprietary binary format) or to use the Libreoffice/Openoffice SDK or Python wrappers (see table above).