Overview

Summary: Search for python code (libraries) for extracting text from file attachments or/and implement a pure python filter
Count: 1
Label: Research

Short Description

Currently we use external programs for several mimetypes to create the search index for xapian search (see MoinMoin.filter). These external programs are additional dependencies we create by using them and these programs might not be easily available on each platform moin runs on.

Thus we would like to replace those external binary programs by python code we can bundle with moin, to get rid of those dependencies and to make installation of moin easier.

This applies for example for e.g. application/msword, application/pdf, text/rtf. Just have a look at MoinMoin.filter package - we want to either replace code there using external programs or write new pure python filters.

Document every python code which could do the job.

pyPdf sounds like it could be used to replace pdftotext. If you like you could try to write a new filter or use pypdf instead of pdftotext for implementing the pdf filter.

You have to deliver a list (at least 10 entries, in wiki markup) of Python libraries/modules that qualify to be used for this. For every code you find you have to describe shortly how it could be used.

We estimate that this task takes 10h work time and you must complete this task within 7 days.

Detailed Description

Discussion

Answer from tuxella :

excel

After some research on internet, it appeared that 2 python libraries can open .xls files :

excelerator
xlrd

It seems that xlrd is compatible with more versions of excel file (from the documentations) and it is less crash prone than excelerator when it finds some quirks in files. I have written a filter based on this library

application_vnd_ms_excel.py Moreover, xlrd is pure python. This is based on the example furnished with xlrd, so it might work quite well.

word

Here is another issue... Word file format is definitely more complex than excel one. Thus there isn't any pure python library that can read and extract text from a .doc. But the lib gsf, used by AbiWord and KWord is able to parse any Ole2 file format. There seems to have python bindings in the packages (--with-python in the configure script and a python directory) but there isn't anything functional. More, according to what I found while googling, this bindings where done a long time ago to do some tests and aren't maintained since, this could explain why I didn't managed to make anything working ...

pdf

As advised in the issue description, I began using pypdf,and anyway there isn't any other working library. I have written a filter for this file format too. But when I tried it on various files, it appears there are still some problems with the textExtract method (what is said in the documentation, by the way). Indeed, I tried with a file in which text is embedded in a frame, and in this case the test isn't extracted. More, pypdf is pure python So, here is the filter (definitly simpler than the xls one):

application_pdf.py

Some questions:

there is also python-pdftools, did you look at it?
did you compare pypdf extraction capabilities to pdftotext (from xpdf-utils)?

rtf

There isn't any python library that can read rtf files. However, the rtf file format isn't very complicated, and then with just some regexp (slightly more complicated that the one used in the OOo filter), we could remove formatting from rtf documents and then extract the text. It would need to get the list of formatting tokens on the Microsoft web page and then remove key words that are between brackets and maybe followed by a parameter.

others

hachoir-metadata

How to use it :

create a parser from the name of the file and its unicode translation (that's where Hachoir guesses the fileformat)
extract metadata acording to a certain level of depth (metadata are sorted in a logic order, for example for a mp3, the artist name is more important than the bitrate ...)

pros :

very simple
already supports a lot of formats
easy to add support for new formats
pure python

cons :

quite rare dependency : hachoir-metadate (which depends on its turn on hachoir-parser)

Supported file formats : Total: 33 file formats.

Audio

aiff: Audio Interchange File Format (AIFF)
mpeg_audio: MPEG audio version 1, 2, 2.5
real_audio: Real audio (.ra)
sun_next_snd: Sun/NeXT audio

Container

matroska: Matroska multimedia container
ogg: Ogg multimedia container
real_media: RealMedia (rm) Container File
riff: Microsoft RIFF container

Image

bmp: Microsoft bitmap (BMP) picture
gif: GIF picture
ico: Microsoft Windows icon or cursor
jpeg: JPEG picture
pcx: PC Paintbrush (PCX) picture
png: Portable Network Graphics (PNG) picture
psd: Photoshop (PSD) picture
targa: Truevision Targa Graphic (TGA)
tiff: TIFF picture
wmf: Microsoft Windows Metafile (WMF)
xcf: Gimp (XCF) picture

Misc

ole2: Microsoft Office document
pcf: X11 Portable Compiled Font (pcf)
torrent: Torrent metainfo file
ttf: TrueType font

Program

exe: Microsoft Windows Portable Executable

Video

asf: Advanced Streaming Format (ASF), used for WMV (video) and WMA (audio)
flv: Macromedia Flash video
mov: Apple QuickTime movie

kaa-metadata

Syntax : info = kaa.metadata.parse(fileName)

pros :

Very simple too

cons :

only able to parse multimedia files (music, movies ...) because it's a part of the freevo project
then, it's quite hard to add new file format

Supported File formats: Total : 27

Audio: ac3, dts, flac, mp3 (with id3 tag support), ogg, pcm, m4a, wma.
Video: avi, mkv, mpg, ogm, asf, wmv, flv, mov, dvd iso, vcd iso.
Media: vcd, cd, dvd.
Image: jpeg (with exif and iptc support), bmp, gif, png, tiff.

extractor

This is the python bindings for the libextractor Gnu project. It's pretty straight forward to use too.

Pros :

Gnu Project : sure
Adding new file format is easy

Cons :

Not pure python

Supported file formats : Total : 28

HTML, PDF, PS, OLE2 (DOC, XLS, PPT), Open Office (sxw), Star Office (sdw), DVI, MAN, MP3 (ID3v1 and ID3v2), NSF (NES Sound Format), SID, OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, REAL, RIFF (AVI), MPEG, QT and ASF.

Text files

http://chardet.feedparser.org/ is a library that tries to guess the encoding of a file. It returns the encoding guessed and the confidence you can give to its prediction. It's a pure python port of the algorithm used in the Mozilla browser.

Here is an example of how you can use it to guess the encoding of a file:

Toggle line numbers

   1 from chardet.universaldetector import UniversalDetector
   2 
   3 def guessEncoding(filename):
   4     detector = UniversalDetector()
   5     for line in file(filename, 'rb'):
   6         detector.feed(line)
   7         if detector.done: break
   8     detector.close()
   9     return(detector.result)

and it returns something that looks like