Comparison of existing Free Software for processing MS-Office documents

Project

License

Language

Input formats

Last commit?

Code quality?

Projects using lib

Notes

libwv (wvWare)

Unknown (presumably GPL or public domain)

C

Doc 95/97 (possibly 2000/2003)

5 months ago

Dubious (online demo removed due to "security concerns")

wvWare/Abiword

C++ bindings available

catdoc/catppt/xsl2csv

GNU GPL (any)

C

.doc, .ppt, .xsl (but with no support for formatting)

~2006

Simple, GCC -Wall does complain with various signed/unsigned type casting issues (sample output here)

catd oc/catppt/xsl2csv

Does pretty-print some formatting

Libreoffice core

LGPLv3

C++/Java

Office OpenXML (pptx, docx, xslx), Office 95-2003 (ppt, doc, xsl)

Today (well maintained)

Unknown

Libreoffice

Best format support, difficult to work with code outside of Libreoffice due to the amount of Libreoffice code required to get a working solution. Does include command-line utilities, though. Some Python libraries that wrap Libreoffice functionality also exist, such as py3o.renderserver

Abiword core

Various (mostly GNU GPL)

C/C++

Office 95-2003 (doc)

Today (well maintained)

Unknown (good GUI/lib separation)

Abiword

Good word document format support. Utilises libwv. Abiword also has --to= and --to-name= options for conversion, includes plugin functionality to create new output filters

Antiword

GNU GPL

C

Office 95-2003 (doc)

2005

Unknown (again, good GUI/lib separation though)

Antiword

Reasonably simple code with good cross-platform support. No deps apart from the C stdlib

python-docx

MIT

Python

Office Open XML (docx)

January 2011

Good (only 200 lines of source)

-

Demonstrates how easy it is to extract information from Office Open XML documents, it's probably even feasible to create an input converter for them

xlrd

BSD

Python

Excel 95-2003 (xls)

Last PyPI upload January 2011

Good

-

Python library to read and write Excel files

Gnumeric

GPL v2

C

Office 95-2003 (xls)

Today (well maintained

Good (good UI/parser code separation)

GOffice/Gnumeric

-

pyExcelerator

BSD

Python

Office 95-2003 (xls)

2009

Good

-

Another pure-Python library for reading and writing Excel files

ooxml

MIT

Python

Office Open XML Documents (currently xlsx only)

March 2011

Good

-

Pure-Python library for reading/writing OOXML documents, but currently only supports Excel 2007+ files (and appears unmaintained)

openxmllib

GPL v2

Python

Office Open XML Documents (xlsx, pptx, docx)

2010

Okay (not terribly Pythonic API, some methods like .allText do not work as advertised)

-

-

Testing them out

Example word document

Input document

Program

Output

Abiword

txt, html

Antiword

txt, docbook

catdoc

txt

Libreoffice

txt, html

Example excel spreadsheet

Input document

Program

Output

Gnumeric

html

Libreoffice

csv, html

xls2csv

csv

Example powerpoint

Input document

Program

Output

catppt

No output produced

Libreoffice

html

openxmllib

This is a test of openxmllib's .indexableText() method, note that if more styling is required the user can override the .textFromTree() method and use it to output text in a format of their choice.

Input

Output

spreadsheet

indexable text

presentation

indexable text

document

indexable text

Analysis

MoinMoin: EasyToDo/TextExtractors/Comparison (last edited 2011-12-22 07:28:07 by ReimarBauer)