Overview
- Summary
- Search for python code (libraries) for extracting text from file attachments or/and implement a pure python filter
- Count
- 1
- Label
- Research
Short Description
Currently we use external programs for several mimetypes to create the search index for xapian search (see MoinMoin.filter). These external programs are additional dependencies we create by using them and these programs might not be easily available on each platform moin runs on.
Thus we would like to replace those external binary programs by python code we can bundle with moin, to get rid of those dependencies and to make installation of moin easier.
This applies for example for e.g. application/msword, application/pdf, text/rtf. Just have a look at MoinMoin.filter package - we want to either replace code there using external programs or write new pure python filters.
Document every python code which could do the job.
pyPdf sounds like it could be used to replace pdftotext. If you like you could try to write a new filter or use pypdf instead of pdftotext for implementing the pdf filter.
You have to deliver a list (at least 10 entries, in wiki markup) of Python libraries/modules that qualify to be used for this. For every code you find you have to describe shortly how it could be used.
We estimate that this task takes 10h work time and you must complete this task within 7 days.
Detailed Description
Discussion
Answer from tuxella :
excel
After some research on internet, it appeared that 2 python libraries can open .xls files :
- excelerator
- xlrd
It seems that xlrd is compatible with more versions of excel file (from the documentations) and it is less crash prone than excelerator when it finds some quirks in files. I have written a filter based on this library
application_vnd_ms_excel.py Moreover, xlrd is pure python. This is based on the example furnished with xlrd, so it might work quite well.
word
Here is another issue... Word file format is definitely more complex than excel one. Thus there isn't any pure python library that can read and extract text from a .doc. But the lib gsf, used by AbiWord and KWord is able to parse any Ole2 file format. There seems to have python bindings in the packages (--with-python in the configure script and a python directory) but there isn't anything functional. More, according to what I found while googling, this bindings where done a long time ago to do some tests and aren't maintained since, this could explain why I didn't managed to make anything working ...
As advised in the issue description, I began using pypdf,and anyway there isn't any other working library. I have written a filter for this file format too. But when I tried it on various files, it appears there are still some problems with the textExtract method (what is said in the documentation, by the way). Indeed, I tried with a file in which text is embedded in a frame, and in this case the test isn't extracted. More, pypdf is pure python So, here is the filter (definitly simpler than the xls one):
Some questions:
- there is also python-pdftools, did you look at it?
- did you compare pypdf extraction capabilities to pdftotext (from xpdf-utils)?
rtf
There isn't any python library that can read rtf files. However, the rtf file format isn't very complicated, and then with just some regexp (slightly more complicated that the one used in the OOo filter), we could remove formatting from rtf documents and then extract the text. It would need to get the list of formatting tokens on the Microsoft web page and then remove key words that are between brackets and maybe followed by a parameter.
others
hachoir-metadata
How to use it :
- create a parser from the name of the file and its unicode translation (that's where Hachoir guesses the fileformat)
- extract metadata acording to a certain level of depth (metadata are sorted in a logic order, for example for a mp3, the artist name is more important than the bitrate ...)
pros :
- very simple
- already supports a lot of formats
- easy to add support for new formats
- pure python
cons :
- quite rare dependency : hachoir-metadate (which depends on its turn on hachoir-parser)
Supported file formats : Total: 33 file formats.
Archive
- bzip2: bzip2 archive
- cab: Microsoft Cabinet archive
- gzip: gzip archive
- mar: Microsoft Archive
- tar: TAR archive
- zip: ZIP archive
Audio
- aiff: Audio Interchange File Format (AIFF)
- mpeg_audio: MPEG audio version 1, 2, 2.5
- real_audio: Real audio (.ra)
- sun_next_snd: Sun/NeXT audio
Container
- matroska: Matroska multimedia container
- ogg: Ogg multimedia container
real_media: RealMedia (rm) Container File
- riff: Microsoft RIFF container
Image
- bmp: Microsoft bitmap (BMP) picture
- gif: GIF picture
- ico: Microsoft Windows icon or cursor
- jpeg: JPEG picture
- pcx: PC Paintbrush (PCX) picture
- png: Portable Network Graphics (PNG) picture
- psd: Photoshop (PSD) picture
- targa: Truevision Targa Graphic (TGA)
- tiff: TIFF picture
- wmf: Microsoft Windows Metafile (WMF)
- xcf: Gimp (XCF) picture
Misc
- ole2: Microsoft Office document
- pcf: X11 Portable Compiled Font (pcf)
- torrent: Torrent metainfo file
ttf: TrueType font
Program
- exe: Microsoft Windows Portable Executable
Video
- asf: Advanced Streaming Format (ASF), used for WMV (video) and WMA (audio)
- flv: Macromedia Flash video
mov: Apple QuickTime movie
kaa-metadata
Syntax : info = kaa.metadata.parse(fileName)
pros :
- Very simple too
cons :
- only able to parse multimedia files (music, movies ...) because it's a part of the freevo project
- then, it's quite hard to add new file format
Supported File formats: Total : 27
- Audio: ac3, dts, flac, mp3 (with id3 tag support), ogg, pcm, m4a, wma.
- Video: avi, mkv, mpg, ogm, asf, wmv, flv, mov, dvd iso, vcd iso.
- Media: vcd, cd, dvd.
- Image: jpeg (with exif and iptc support), bmp, gif, png, tiff.
extractor
This is the python bindings for the libextractor Gnu project. It's pretty straight forward to use too.
Pros :
Gnu Project : sure
- Adding new file format is easy
Cons :
- Not pure python
Supported file formats : Total : 28
HTML, PDF, PS, OLE2 (DOC, XLS, PPT), Open Office (sxw), Star Office (sdw), DVI, MAN, MP3 (ID3v1 and ID3v2), NSF (NES Sound Format), SID, OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, REAL, RIFF (AVI), MPEG, QT and ASF.
Text files
http://chardet.feedparser.org/ is a library that tries to guess the encoding of a file. It returns the encoding guessed and the confidence you can give to its prediction. It's a pure python port of the algorithm used in the Mozilla browser.
Here is an example of how you can use it to guess the encoding of a file:
- and it returns something that looks like
So it's then easy to decide to use or not the encoding guessed by this library.
Some stuff copied here from the google ghop tracker
I had already looked to this eventualities but none of those deserve to be in my report :
- PyRTF only GENERATES rtf files, while we want to read it
- zopyx converter converts from html TO rtf, so it's useless to in our case, moreover it comes with scores of dependencies
- rtf2xml is a command line utility and as such doesn't come with a well documented API. More, it isn't done to extract content from a RTF file and then isn't usable for our purpose "out of the box", but for sure it could be a base to use to write a proper rtf filter, even if I still think it would be easier to just remove format markups.