Logfiles were loaded completely for every request which needs to read them. To get this faster and to consume less memory the log file handling was completly rewritten.
Implementation Log
The new logfile classes use double buffer for sequential access. Line numbers are calculated if possible.
1 class LogFile
2 def __init__(self, filename):
3 self.filter = None
4 def next(self):
5 ...
6 return self.parser(line)
7
8 def previous(self):
9 def __iter__(self):
10 return self
11 def reverse(self):
12 self.to_end()
13 while 1: yield self.previous()
14 def to_begin(self):
15 def to_end(self):
16 def seek(self, position, lineno=None):
17 # position is an implementation depending value that can be printed with backtick and the rebuild
18 # Integer for a plain text file
19 def peek(self, lines): # moves position by lines
20 # O(|lines|) in cached area O(1)
21 def lineno(self): # returns None if acual lineno is unknown
22 def position(self):
23 def calculate_line_no(self): # may be expensive
24
25 def add(self, *data):
26 def parser(self, line):
There are now three logfile classes:
LogFile base class
EditLog
- .parser()
- .add()
- .set_filter(self, **kw)
- uses an empty Python class to return entry contents
EventLog
- returns Tupel
- .parser()
- .add()
- .set_filter()
Todo:
testing
editlog
kB restriction removed (StringIO buffer is gone)
print on till bookmark or max. 90 days
RC RSS
Page
PageEditor
SystemInfo (wikimacro)
page info (wikiaction)
eventlog
request (commented out .getEventLogger())
PageEditor
SystemInfo (wikimarco)
stats/hitcounts
maintain file day: hits, edit
make edits more readable (scaled up by adaptable factor 10^n)
stats/useragents
keep results
The logfile stuff moved into it's own subdir (MoinMoin/logfile).
Results
RecentChanges are not about 4 times as fast
- Acceleration for hitcounts and useragents depends on the logfile size
34MB: 32s -> 1.5s
- hitcounts is now even doable for an event.log of 400MB (== 1.5 years of linuxwiki.org). After initial creation of cache, it will only take a few moments to show an updated graph.
Ideas and Disscussion
Iterate from a date on
- additionally it is possible to search for a date in the log using binary search.
- O(log n) with relativ large constant factor
- needs compare function between entries
for entry in logfile.from_date(timestamp): updatestats(entry)
Loading
Another question is if the log file class should keep loaded entries or not. If entries are accessed twice this is faster, if we process whole files it takes much more memory (with the current implementation this is in the range of xxx MB!).
The only part were log entries are processed several times is the RecentChanges macro. This operates in a short end part of the log file. Perhaps it is a goog idea to cache a relative small areas (50 kB) around the actual file position (must be done for moving backwards anyway, so we can make this buffer a bit bigger).
Additional implementation ideas are welcome.