The cache generator
This tool looks in your Wiki's data-pages paths for attachments. If there is a new PDF in some path it creates a searchcache directory for this page and renders a text file with all words in the PDF sorted and uniq:
#
# Copyright (c) 2003 Thomas Renard CyBaer42@web.de>
# All rights reserved, see COPYING for details.
#
# This script extracts word lists from attachments
#
# $Id$
WIKIROOT=/your/wiki/root/here
CACHEREF=$WIKIROOT/cacheref
for i in `find $WIKIROOT/data/pages/*/attachments/ -newer $CACHEREF -and -type f -print 2>/dev/null`
do
file $i | grep PDF >/dev/null
if [ "$?" == "0" ]
then
j=`echo $i|sed "s/attachments.*/searchcache/g"`
k=`echo $i|sed "s/attachments/searchcache/g"`
mkdir -p $j
pstotext $i| sed "s/[[:space:]]/\n/g"| \
sed "s/[\"\':,!0><=\.\;\^\*\|\-\+]//g" | \
tr A-Z a-z | \
sort|uniq >$k
fi
done
touch $CACHEREFThis script should be run via cron. The last sed filters some characters I did not like to have. Maybe this can be made a little bit smoother in future releases. $CACHEREF is a flag to check if an attachment has changed since the last run of this script.
I try to render M$ Word documents via wvText with the next release. It is the same like the PDF-stuff except using wvText instead of pstotext and checking for Word Documents with the file $i
Remarks and Questions
Wouldn't it be easier and more convenient if this script is only run if a pdf file has been attached?
