Overview
- Title
- Research existing MS Office text extractors
- Duration
- 180 [time to delivery, in hours] (ttd = workhours * 12)
- Difficulty
- Medium
- Types
- Research
- Tags
- python,search
- Mentors
- thomaswaldmann,rb_proj,waldi,esyr,pkumar_7
- Count
- -1
Description
Abstract
Research existing solutions for extracting text from proprietary Microsoft file formats.
Details
For moin2, we already have quite some converters (including Open Document Format [OpenOffice / LibreOffice]), but nothing for Microsoft Office formats. Now we need to create a survey of the GPL2+ license compatible code that can extract text from these proprietary file formats.
We need to know:
- is a license compatible to GPL2+ used?
- for python libraries e.g.: GPL, BSD, MIT, ... (not: Apache License 2)
- in general: a free software license, not any proprietary license
- the programming language used
- strongly preferred is library code in python (we can just call it)
- also maybe working is a commandline tool (supported platforms?) that we can call as a subprocess
- windows-only solutions are not wanted
- compatibility with different file formats (mainly Word but also Excel and Powerpoint)
- compatibility with different versions (i.e. .DOC and .DOCX)
- reliability (is it well-maintained code, is it recently updated?)
Deliverable: wiki page
Benefits
Many Moin users would like to have a platform-independant, pure python way to extract text for indexing.
Researching existing code base is a first step on this direction.
Skill Requirements
You'll need to do a lot of search on the Web. Discuss with moin devs online on IRC.
Links
This task refers to moin2 (http://moinmo.in/MoinMoin2.0)!
http://hg.moinmo.in/moin/2.0 or http://bitbucket.org/thomaswaldmann/moin-2.0 - repository of moin2
http://moimo.in/MoinMoinChat - please join us on IRC #moin-dev