Description

Moin was too slow when verifying BadContent. This patch adds a cache.

Steps to reproduce

create a page, hit the save button, drink a coffee, wait until save is finished ...

Example

Improvments

Caching antispam regular expressions

--- antispam.py.org     2005-02-26 00:03:50.000000000 +0100
+++ antispam.py 2005-02-26 23:46:34.000000000 +0100
@@ -21,6 +21,7 @@
 from MoinMoin.security import Permissions
 from MoinMoin import caching, wikiutil
 
+mmblcache = []
 
 # Errors ---------------------------------------------------------------
 
@@ -123,6 +124,10 @@
                                            response)
                     p._write_file(response)
 
+                # clear the cache
+                global mmblcache
+                mmblcache = []
+
             except (timeoutsocket.Timeout, timeoutsocket.error), err:
                 # Log the error
                 # TODO: check if this does not fill the logs!
@@ -158,13 +163,23 @@
             for pn in BLACKLISTPAGES:
                 do_update = (pn != "LocalBadContent")
                 blacklist += getblacklist(request, pn, do_update)
+            global mmblcache
             if blacklist:
-                for blacklist_re in blacklist:
-                    try:
-                        match = re.search(blacklist_re, newtext, re.I)
-                    except re.error, err:
-                        dprint("Error in regex '%s': %s. Please check the pages %s." % (blacklist_re, str(err), ', '.join(BLACKLISTPAGES)))
-                        continue
+                request.clock.start('mmblcache')
+                if not mmblcache:
+                    for blacklist_re in blacklist:
+                        try:
+                            mmblcache.append(re.compile(blacklist_re, re.I))
+                        except re.error, err:
+                            dprint("Error in regex '%s': %s. Please check the pages %s." % (blacklist_re, str(err), ', '.join(BLACKLISTPAGES)))
+                            continue
+                request.clock.stop('mmblcache')
+
+                for blacklist_re in mmblcache:
+                    request.clock.start('regexmatch')
+                    match = blacklist_re.search(newtext)
+                    request.clock.stop('regexmatch')
+
                     if match:
                         # Log error and raise SaveError, PageEditor
                         # should handle this.

Note: will only improve persistent servers, not for CGI.

Discussion

Gives a huge speedup on persistent servers. I dont know how much it affects CGI servers, it prolly gives a small slowdown.

Could be a workaround for 1.3 until the final persistent caching of compiled regex is implemented in 1.4 (and hopefully backported to 1.3).

This cache save the compiling time, but most of the time is needed for searching in the page. For big page, this can be 10s on a fast machine, while the compiling time is about 1s. For small pages, like this one, it does help. See AntiSpamGlobalSolution/ReFactoring, section "Searching only in links". If you combine the two, anti spam will run great even on low end machines.

This is only meant as partial workaround until something better is finished, luckily most pages are usually small and the small pages are the pages where long save times annoy the most (like PasteBins or such). Next optimization could be to check only lines which got added in an edit then even small edits on huge pages will speed up, diffing out the addded lines is lilely trivial or? When that works then the searching-in-links could be obsolete, note that I just deleted some non-link spam from this wiki! -- CehTeh 2005-02-26 04:36:39

I think that your first patch could be much simpler using Python builtin re cache. When you create a regex object (using re.compile(pattern)), python cache the object (probably by pattern), and the next compile of the same pattern cost nothing. So to have this cache, all we have to do is create a list of re object and then use the objects.search method, instead of using re.search. Check how this is done in the time_antispam.py script in AntiSpamGlobalSolution/ReFactoring. Check the example timings on that page. -- NirSoffer 2005-02-27 02:05:04

Err.. isn't that exactly what I am doing? .. besides that I suggested that optimization ages ago in IRC and AlexanderSchremmer made the proprosal after that conversation :P. The only missing point (for 1.4) is to cache the compiled regex on disk. Currently they are still too often recompiled because FastCGI (thats what I use) spawns serveral processes and kills them after some time (i'll try to increase that timeout), so each new process has to compile the regexps again on its own. -- CehTeh 2005-02-27 03:36:07
- No, its not exactly what you are doing, anyway it does not work like I described. It does work as shown in the time_antispam.py code, when you load the antispam list from a pickle, the load take no time after the first load, e.g. no re recompilation is done.
Sorry, I am not a regular Python programmer, this was meant as proof of concept and it works fine on my wiki, would like if something like this gets included, maybe one who knows Pyhton better, can go over it. -- CehTeh 2005-02-27 02:18:49

Another problem with this patch is you empty the all the cache when either BadContent or LocalBadContent change - but you don't need to do this. Here is a patch that hold a cache for each page, and update the cache for each page separately. Its not ready for merging, I don't have time to test it now.

   1 * looking for arch@arch.thinkmo.de--2003-archives/moin--main--1.3--patch-633 to compare with
   2 * comparing to arch@arch.thinkmo.de--2003-archives/moin--main--1.3--patch-633
   3 M  MoinMoin/util/antispam.py
   4 
   5 * modified files
   6 
   7 --- orig/MoinMoin/util/antispam.py
   8 +++ mod/MoinMoin/util/antispam.py
   9 @@ -21,6 +21,8 @@
  10  from MoinMoin.security import Permissions
  11  from MoinMoin import caching, wikiutil
  12  
  13 +BLACKLISTPAGES = ["BadContent", "LocalBadContent"]
  14 +_cache = {}
  15  
  16  # Errors ---------------------------------------------------------------
  17  
  18 @@ -53,14 +55,22 @@
  19  
  20  
  21  def makelist(text):
  22 -    """ Split text into lines, strip them, skip # comments """
  23 +    """ Split text into lines, strip them, skip # comments
  24 +
  25 +    Retunr list of re objects.
  26 +    """
  27      lines = text.splitlines()
  28      list = []
  29      for line in lines:
  30          line = line.split(' # ', 1)[0] # rest of line comment
  31          line = line.strip()
  32          if line and not line.startswith('#'):
  33 -            list.append(line)
  34 +            try:
  35 +                scanner = re.compile(line, re.I)
  36 +                list.append(scanner)
  37 +            except re.error, err:
  38 +                dprint("Error in regex '%s': %s. Please check the pages %s." % (
  39 +                    line, str(err), ', '.join(BLACKLISTPAGES)))
  40      return list
  41  
  42  
  43 @@ -72,6 +82,8 @@
  44      @rtype: list
  45      @return: list of blacklisted regular expressions
  46      """
  47 +    global _cache
  48 +    
  49      from MoinMoin.PageEditor import PageEditor
  50      p = PageEditor(request, pagename, uid_override="Antispam subsystem")
  51      if do_update:
  52 @@ -122,6 +134,8 @@
  53                          raise WikirpcError("failed to get BadContent data",
  54                                             response)
  55                      p._write_file(response)
  56 +                    # Delete cache for this page
  57 +                    _cache[pagename] = None
  58  
  59              except (timeoutsocket.Timeout, timeoutsocket.error), err:
  60                  # Log the error
  61 @@ -138,34 +152,37 @@
  62  
  63              # set back socket timeout
  64              timeoutsocket.setDefaultSocketTimeout(old_timeout)
  65 -                
  66 -    blacklist = p.get_raw_body()
  67 -    return makelist(blacklist)
  68 +            
  69 +    # Return cached blacklist or create new
  70 +    if not pagename in _cache:
  71 +        blacklist = p.get_raw_body()
  72 +        _cache[pagename] = makelist(blacklist)
  73 +
  74 +    return _cache[pagename]
  75  
  76  
  77  class SecurityPolicy(Permissions):
  78      """ Extend the default security policy with antispam feature """
  79      
  80      def save(self, editor, newtext, rev, **kw):
  81 -        BLACKLISTPAGES = ["BadContent", "LocalBadContent"]
  82          if not editor.page_name in BLACKLISTPAGES:
  83              request = editor.request
  84  
  85              # Start timing of antispam operation
  86              request.clock.start('antispam')
  87 -            
  88 +
  89 +            request.clock.start('antispam-get-blacklist')
  90              blacklist = []
  91              for pn in BLACKLISTPAGES:
  92                  do_update = (pn != "LocalBadContent")
  93                  blacklist += getblacklist(request, pn, do_update)
  94 +            request.clock.stop('antispam-get-blacklist')
  95              if blacklist:
  96 -                for blacklist_re in blacklist:
  97 -                    try:
  98 -                        match = re.search(blacklist_re, newtext, re.I)
  99 -                    except re.error, err:
 100 -                        dprint("Error in regex '%s': %s. Please check the pages %s." % (blacklist_re, str(err), ', '.join(BLACKLISTPAGES)))
 101 -                        continue
 102 +                request.clock.start('antispam-match')
 103 +                for scanner in blacklist:
 104 +                    match = scanner.search(newtext)
 105                      if match:
 106 +                        request.clock.stop('antispam-match')
 107                          # Log error and raise SaveError, PageEditor
 108                          # should handle this.
 109                          _ = editor.request.getText
 110 @@ -175,6 +192,7 @@
 111                              }
 112                          dprint(msg)
 113                          raise editor.SaveError(msg)
 114 +                request.clock.stop('antispam-match')
 115              request.clock.stop('antispam')
 116              
 117          # No problem to save if my base class agree

cache-blacklist.patch

About sharing the cache between processes, this introduce new problems, and probably needs locking. As processes are not created and killed often, its not very important.

Check only diffs

Here is the patch which checks only new lines added against BadContent regexps. This gives a HUGE speed improvement even on HUGE pages.

Note:

Lines which are already on the page are not checked again. With the old method someone would be refused to edit a page even if he doesnt used lines which matching BadContent on his own. It's certainly not intended to blame someone for things he hasn't done. Things which become added to BadContent afterwards (funny note: imagine some Witzbold pastes LocalSpellingWords to LocalBadContent) will not affect the wiki experience that much. Something which was accepted on a wiki should be legal until it is explicitly deleted. Thats a good thing[tm] imo. -- CehTeh 2005-02-27 01:16:21
- This make sense - unless there is a way to make too edits, each one pass antispam, but together they create bad content. I think that it will not happen because we use line diffs.

Nice, but where are the numbers that show the HUGE difference?

I have no test setup and testing constantly on my productive server, so if u want to have numbers, just try it by urself or play with my testfile at http://cehteh.homeunix.org/pipawiki/PasteBin/test (the server is quite overloaded and currently reassembling some raids, but i can see that saving small changes on that big file takes only a few seconds, there is no test how long it would take without .. but can tell from my experience its in the range of 30 secs (ViaC3 processor 1GHz, quite loaded system)) -- CehTeh 2005-02-27 02:18:49

addendum: http://cehteh.homeunix.org/pipawiki/PasteBin/absmi ... really big page (page size is 297156 bytes), takes ages to render in my browser

antispam = 0.428s
compile_huge_and_ugly = 0.001s
diff = 0.231s
getACL = 0.077s
mmblcache = 0.000s
oldpage = 0.000s
oldtext = 0.056s
regexmatch = 0.054s
run = 18.490s
send_page = 17.278s
send_page_content = 17.125s
total = 18.493s

Plan

Priority: Not really a bug, but will be nice to improve
Assigned to: ThomasWaldmann
Status: Committed by ThomasWaldmann, fixed by AlexanderSchremmer.

CategoryMoinMoinBugFixed CategoryRelease1.3.5

MoinMoin: MoinMoinBugs/SlowAntispam (last edited 2007-10-29 19:19:14 by localhost)