Overview

Title
Research existing MS Office text extractors
Duration
180 [time to delivery, in hours] (ttd = workhours * 12)
Difficulty
Medium
Types
Research
Tags
python,search
Mentors
thomaswaldmann,rb_proj,waldi,esyr,pkumar_7
Count
-1

Description

Abstract

Research existing solutions for extracting text from proprietary Microsoft file formats.

Details

For moin2, we already have quite some converters (including Open Document Format [OpenOffice / LibreOffice]), but nothing for Microsoft Office formats. Now we need to create a survey of the GPL2+ license compatible code that can extract text from these proprietary file formats.

We need to know:

Deliverable: wiki page

Benefits

Many Moin users would like to have a platform-independant, pure python way to extract text for indexing.

Researching existing code base is a first step on this direction.

Skill Requirements

You'll need to do a lot of search on the Web. Discuss with moin devs online on IRC.

This task refers to moin2 (http://moinmo.in/MoinMoin2.0)!

Discussion

MoinMoin: EasyToDo/TextExtractors (last edited 2011-12-16 00:33:23 by ThomasWaldmann)