See also: AntiSpamGlobalSolution/ReFactoring
Description
Moin was too slow when verifying BadContent. This patch adds a cache.
Steps to reproduce
create a page, hit the save button, drink a coffee, wait until save is finished ...
Example
http://pipapo.org/iowiki/SystemInfo
Improvments
Caching antispam regular expressions
--- antispam.py.org 2005-02-26 00:03:50.000000000 +0100 +++ antispam.py 2005-02-26 23:46:34.000000000 +0100 @@ -21,6 +21,7 @@ from MoinMoin.security import Permissions from MoinMoin import caching, wikiutil +mmblcache = [] # Errors --------------------------------------------------------------- @@ -123,6 +124,10 @@ response) p._write_file(response) + # clear the cache + global mmblcache + mmblcache = [] + except (timeoutsocket.Timeout, timeoutsocket.error), err: # Log the error # TODO: check if this does not fill the logs! @@ -158,13 +163,23 @@ for pn in BLACKLISTPAGES: do_update = (pn != "LocalBadContent") blacklist += getblacklist(request, pn, do_update) + global mmblcache if blacklist: - for blacklist_re in blacklist: - try: - match = re.search(blacklist_re, newtext, re.I) - except re.error, err: - dprint("Error in regex '%s': %s. Please check the pages %s." % (blacklist_re, str(err), ', '.join(BLACKLISTPAGES))) - continue + request.clock.start('mmblcache') + if not mmblcache: + for blacklist_re in blacklist: + try: + mmblcache.append(re.compile(blacklist_re, re.I)) + except re.error, err: + dprint("Error in regex '%s': %s. Please check the pages %s." % (blacklist_re, str(err), ', '.join(BLACKLISTPAGES))) + continue + request.clock.stop('mmblcache') + + for blacklist_re in mmblcache: + request.clock.start('regexmatch') + match = blacklist_re.search(newtext) + request.clock.stop('regexmatch') + if match: # Log error and raise SaveError, PageEditor # should handle this.
Note: will only improve persistent servers, not for CGI.
Discussion
Gives a huge speedup on persistent servers. I dont know how much it affects CGI servers, it prolly gives a small slowdown.
Could be a workaround for 1.3 until the final persistent caching of compiled regex is implemented in 1.4 (and hopefully backported to 1.3).
This cache save the compiling time, but most of the time is needed for searching in the page. For big page, this can be 10s on a fast machine, while the compiling time is about 1s. For small pages, like this one, it does help. See AntiSpamGlobalSolution/ReFactoring, section "Searching only in links". If you combine the two, anti spam will run great even on low end machines.
This is only meant as partial workaround until something better is finished, luckily most pages are usually small and the small pages are the pages where long save times annoy the most (like PasteBins or such). Next optimization could be to check only lines which got added in an edit then even small edits on huge pages will speed up, diffing out the addded lines is lilely trivial or? When that works then the searching-in-links could be obsolete, note that I just deleted some non-link spam from this wiki! -- CehTeh 2005-02-26 04:36:39
I think that your first patch could be much simpler using Python builtin re cache. When you create a regex object (using re.compile(pattern)), python cache the object (probably by pattern), and the next compile of the same pattern cost nothing. So to have this cache, all we have to do is create a list of re object and then use the objects.search method, instead of using re.search. Check how this is done in the time_antispam.py script in AntiSpamGlobalSolution/ReFactoring. Check the example timings on that page. -- NirSoffer 2005-02-27 02:05:04
Err.. isn't that exactly what I am doing? .. besides that I suggested that optimization ages ago in IRC and AlexanderSchremmer made the proprosal after that conversation :P. The only missing point (for 1.4) is to cache the compiled regex on disk. Currently they are still too often recompiled because FastCGI (thats what I use) spawns serveral processes and kills them after some time (i'll try to increase that timeout), so each new process has to compile the regexps again on its own. -- CehTeh 2005-02-27 03:36:07
No, its not exactly what you are doing, anyway it does not work like I described. It does work as shown in the time_antispam.py code, when you load the antispam list from a pickle, the load take no time after the first load, e.g. no re recompilation is done.
Sorry, I am not a regular Python programmer, this was meant as proof of concept and it works fine on my wiki, would like if something like this gets included, maybe one who knows Pyhton better, can go over it. -- CehTeh 2005-02-27 02:18:49
Another problem with this patch is you empty the all the cache when either BadContent or LocalBadContent change - but you don't need to do this. Here is a patch that hold a cache for each page, and update the cache for each page separately. Its not ready for merging, I don't have time to test it now.
1 * looking for arch@arch.thinkmo.de--2003-archives/moin--main--1.3--patch-633 to compare with
2 * comparing to arch@arch.thinkmo.de--2003-archives/moin--main--1.3--patch-633
3 M MoinMoin/util/antispam.py
4
5 * modified files
6
7 --- orig/MoinMoin/util/antispam.py
8 +++ mod/MoinMoin/util/antispam.py
9 @@ -21,6 +21,8 @@
10 from MoinMoin.security import Permissions
11 from MoinMoin import caching, wikiutil
12
13 +BLACKLISTPAGES = ["BadContent", "LocalBadContent"]
14 +_cache = {}
15
16 # Errors ---------------------------------------------------------------
17
18 @@ -53,14 +55,22 @@
19
20
21 def makelist(text):
22 - """ Split text into lines, strip them, skip # comments """
23 + """ Split text into lines, strip them, skip # comments
24 +
25 + Retunr list of re objects.
26 + """
27 lines = text.splitlines()
28 list = []
29 for line in lines:
30 line = line.split(' # ', 1)[0] # rest of line comment
31 line = line.strip()
32 if line and not line.startswith('#'):
33 - list.append(line)
34 + try:
35 + scanner = re.compile(line, re.I)
36 + list.append(scanner)
37 + except re.error, err:
38 + dprint("Error in regex '%s': %s. Please check the pages %s." % (
39 + line, str(err), ', '.join(BLACKLISTPAGES)))
40 return list
41
42
43 @@ -72,6 +82,8 @@
44 @rtype: list
45 @return: list of blacklisted regular expressions
46 """
47 + global _cache
48 +
49 from MoinMoin.PageEditor import PageEditor
50 p = PageEditor(request, pagename, uid_override="Antispam subsystem")
51 if do_update:
52 @@ -122,6 +134,8 @@
53 raise WikirpcError("failed to get BadContent data",
54 response)
55 p._write_file(response)
56 + # Delete cache for this page
57 + _cache[pagename] = None
58
59 except (timeoutsocket.Timeout, timeoutsocket.error), err:
60 # Log the error
61 @@ -138,34 +152,37 @@
62
63 # set back socket timeout
64 timeoutsocket.setDefaultSocketTimeout(old_timeout)
65 -
66 - blacklist = p.get_raw_body()
67 - return makelist(blacklist)
68 +
69 + # Return cached blacklist or create new
70 + if not pagename in _cache:
71 + blacklist = p.get_raw_body()
72 + _cache[pagename] = makelist(blacklist)
73 +
74 + return _cache[pagename]
75
76
77 class SecurityPolicy(Permissions):
78 """ Extend the default security policy with antispam feature """
79
80 def save(self, editor, newtext, rev, **kw):
81 - BLACKLISTPAGES = ["BadContent", "LocalBadContent"]
82 if not editor.page_name in BLACKLISTPAGES:
83 request = editor.request
84
85 # Start timing of antispam operation
86 request.clock.start('antispam')
87 -
88 +
89 + request.clock.start('antispam-get-blacklist')
90 blacklist = []
91 for pn in BLACKLISTPAGES:
92 do_update = (pn != "LocalBadContent")
93 blacklist += getblacklist(request, pn, do_update)
94 + request.clock.stop('antispam-get-blacklist')
95 if blacklist:
96 - for blacklist_re in blacklist:
97 - try:
98 - match = re.search(blacklist_re, newtext, re.I)
99 - except re.error, err:
100 - dprint("Error in regex '%s': %s. Please check the pages %s." % (blacklist_re, str(err), ', '.join(BLACKLISTPAGES)))
101 - continue
102 + request.clock.start('antispam-match')
103 + for scanner in blacklist:
104 + match = scanner.search(newtext)
105 if match:
106 + request.clock.stop('antispam-match')
107 # Log error and raise SaveError, PageEditor
108 # should handle this.
109 _ = editor.request.getText
110 @@ -175,6 +192,7 @@
111 }
112 dprint(msg)
113 raise editor.SaveError(msg)
114 + request.clock.stop('antispam-match')
115 request.clock.stop('antispam')
116
117 # No problem to save if my base class agree
About sharing the cache between processes, this introduce new problems, and probably needs locking. As processes are not created and killed often, its not very important.
Check only diffs
Here is the patch which checks only new lines added against BadContent regexps. This gives a HUGE speed improvement even on HUGE pages.
Note:
Lines which are already on the page are not checked again. With the old method someone would be refused to edit a page even if he doesnt used lines which matching BadContent on his own. It's certainly not intended to blame someone for things he hasn't done. Things which become added to BadContent afterwards (funny note: imagine some Witzbold pastes LocalSpellingWords to LocalBadContent) will not affect the wiki experience that much. Something which was accepted on a wiki should be legal until it is explicitly deleted. Thats a good thing[tm] imo. -- CehTeh 2005-02-27 01:16:21
- This make sense - unless there is a way to make too edits, each one pass antispam, but together they create bad content. I think that it will not happen because we use line diffs.
Nice, but where are the numbers that show the HUGE difference?
I have no test setup and testing constantly on my productive server, so if u want to have numbers, just try it by urself or play with my testfile at http://cehteh.homeunix.org/pipawiki/PasteBin/test (the server is quite overloaded and currently reassembling some raids, but i can see that saving small changes on that big file takes only a few seconds, there is no test how long it would take without .. but can tell from my experience its in the range of 30 secs (ViaC3 processor 1GHz, quite loaded system)) -- CehTeh 2005-02-27 02:18:49
addendum: http://cehteh.homeunix.org/pipawiki/PasteBin/absmi ... really big page (page size is 297156 bytes), takes ages to render in my browser
- antispam = 0.428s
- compile_huge_and_ugly = 0.001s
- diff = 0.231s
- getACL = 0.077s
- mmblcache = 0.000s
- oldpage = 0.000s
- oldtext = 0.056s
- regexmatch = 0.054s
- run = 18.490s
- send_page = 17.278s
- send_page_content = 17.125s
- total = 18.493s
Plan
- Priority: Not really a bug, but will be nice to improve
Assigned to: ThomasWaldmann
Status: Committed by ThomasWaldmann, fixed by AlexanderSchremmer.