Overview

Title: Research existing MS Office text extractors
Duration: 180 [time to delivery, in hours] (ttd = workhours * 12)
Difficulty: Medium
Types: Research
Tags: python,search
Mentors: thomaswaldmann,rb_proj,waldi,esyr,pkumar_7
Count: -1

Description

Abstract

Research existing solutions for extracting text from proprietary Microsoft file formats.

Details

For moin2, we already have quite some converters (including Open Document Format [OpenOffice / LibreOffice]), but nothing for Microsoft Office formats. Now we need to create a survey of the GPL2+ license compatible code that can extract text from these proprietary file formats.

We need to know:

is a license compatible to GPL2+ used?
- for python libraries e.g.: GPL, BSD, MIT, ... (not: Apache License 2)
- in general: a free software license, not any proprietary license
the programming language used
- strongly preferred is library code in python (we can just call it)
- also maybe working is a commandline tool (supported platforms?) that we can call as a subprocess
windows-only solutions are not wanted
compatibility with different file formats (mainly Word but also Excel and Powerpoint)
compatibility with different versions (i.e. .DOC and .DOCX)
reliability (is it well-maintained code, is it recently updated?)

Deliverable: wiki page

Benefits

Many Moin users would like to have a platform-independant, pure python way to extract text for indexing.

Researching existing code base is a first step on this direction.

Skill Requirements

You'll need to do a lot of search on the Web. Discuss with moin devs online on IRC.

Discussion

MoinMoin: EasyToDo/TextExtractors (last edited 2011-12-16 00:33:23 by ThomasWaldmann)

MoinMoin: EasyToDo/TextExtractors