This code implements three ways of reading the BadContent page and matching its data to a normal page.

Discussion

Code

   1 import time,re,urllib
   2 
   3 old = False # use old code
   4 tw_new = True # use tw's recommendation
   5 
   6 def timeit():
   7     return time.time()
   8 
   9 #c = urllib.urlopen("http://moinmaster.wikiwikiweb.de:8000/BadContent?action=raw").read()
  10 c = file(r"H:\BadContent").read() # spam regex file
  11 
  12 #t = urllib.urlopen("http://moinmaster.wikiwikiweb.de:8000/SyntaxReference?action=raw").read()
  13 t = file(r"H:\SyntaxReference").read() # page file
  14 
  15 if old:
  16     def makelist(text):
  17         """ Split text into lines, strip them, skip # comments """
  18         lines = text.splitlines()
  19         list = []
  20         for line in lines:
  21             line = line.split(' # ', 1)[0] # rest of line comment
  22             line = line.strip()
  23             if line and not line.startswith('#'):
  24                 list.append(line)
  25         return list
  26 elif tw_new:
  27     def reduce_res(rlist):
  28         tlist = []
  29         result = []
  30         for pos in range(len(rlist)):
  31             tlist.append(rlist[pos])
  32             if len(tlist) >= 100 or pos+1 == len(rlist):
  33                 result.append(re.compile('|'.join(tlist)))
  34                 tlist = []
  35         print "%i;" % (len(result),),
  36         return result
  37     def makelist(text):
  38         """
  39         Split text into lines, strip them, skip # comments;
  40         sort them into regular expressions and strings
  41         """
  42         lines = text.splitlines()
  43         rlist = []
  44         for line in lines:
  45             line = line.split(' # ', 1)[0].strip() # strip the comments on the line
  46             if line and not line.startswith('#'):
  47                 rlist.append(line)
  48         return ([], reduce_res(rlist))
  49 else:
  50     def makelist(text):
  51         """
  52         Split text into lines, strip them, skip # comments;
  53         sort them into regular expressions and strings
  54         """
  55         lines = text.splitlines()
  56         rlist = []
  57         slist = []
  58         IsRegEx = False
  59         for line in lines:
  60             line = line.split(' # ', 1)[0].strip() # strip the comments on the line
  61             if line == '#regex on':
  62                 IsRegEx = True
  63             elif line == '#regex off':
  64                 IsRegEx = False
  65             elif line and not line.startswith('#'):
  66                 if IsRegEx:
  67                     rlist.append(line)
  68                 else:
  69                     slist.append(line)
  70         return (slist, rlist)
  71 
  72 b = timeit()
  73 
  74 if old:
  75     blacklist = makelist(c)
  76     for blacklist_re in blacklist:
  77         match = re.search(blacklist_re, t, re.I)
  78         if match:
  79             raise StandardError
  80 elif tw_new:
  81     rlist = makelist(c)[1]
  82     print "%f;" % (timeit() - b),
  83 
  84     b = timeit()
  85     for blacklist_re in rlist:
  86         match = re.search(blacklist_re, t, re.I)
  87         if match:
  88             raise StandardError
  89     print "%f" % (timeit() - b)
  90     b = timeit()
  91 else:
  92     slist, rlist = makelist(c)
  93     t = t.lower()
  94     for blacklist_re in rlist:
  95         match = re.search(blacklist_re, t, re.I)
  96         if match:
  97             raise StandardError
  98     for blacklist_str in slist:
  99         match = (t.find(blacklist_str.lower()) != -1)
 100         if match:
 101             raise StandardError
 102 
 103 print timeit() - b

Comments

There is a major problem with this code, it does not use unicode. To compare with current code, you must use unicode for both page text and bad content.

Searching only in links

I written a time script to compare current code and new methods. I tested 3 setups:

  1. Current code copied from antispam - create list of patterns, then iterate over the list and use re.search on for each item on the whole page body.
  2. Usuall re optimization - create a lists of compiled re, each containing up to 100 patterns or up to 100 groups (which match first), based on Alexander results, that about 100 patterns is the sweet spot. Probably this should be check again becuase he did not use unicode. The list is pickled to disk, and loaded for each reauest, then each calling each re.search method on the whole page text. This save the compile step for the second request on long running process, becuase python cache automatically compiled re objects.
  3. Same as 2, but instead of matching all the text, extract the links from the text first, then match against the links only.

All tests are trying all re patterns - this is the typical and the worst case we have, a honset user that try to save, and wait for the wiki. A spammer will get the result much faster, because a match will stop the iteration.

Here are the results for same file Alexander tested: (I get much slower times on much faster hardware, I don't know why, I did not run his code yet).

Aluminum:~/Desktop nir$ py time_antispam.py current
testing current code, using re.search
first request on long running process / cgi: 1.57658982
second request on long running process: 1.51904202
Aluminum:~/Desktop nir$ py time_antispam.py compiled
testing new code, using few big compiled re objects
first request on long running process / cgi: 1.59112692
second request on long running process: 0.52169299
Aluminum:~/Desktop nir$ py time_antispam.py compiled-links
testing new code, using compiled re objects on page links only
found 12 links
first request on long running process / cgi: 1.12022805
found 12 links
second request on long running process: 0.05015397

And here are results for a small page FrontPage:

Aluminum:~/Desktop nir$ py time_antispam.py current
testing current code, using re.search
first request on long running process / cgi: 1.29024196
second request on long running process: 1.21046686
Aluminum:~/Desktop nir$ py time_antispam.py compiled
testing new code, using few big compiled re objects
first request on long running process / cgi: 1.26842117
second request on long running process: 0.20568013
Aluminum:~/Desktop nir$ py time_antispam.py compiled-links
testing new code, using compiled re objects on page links only
found 3 links
first request on long running process / cgi: 1.07657290
found 3 links
second request on long running process: 0.00885296

And here are results for big page, MoinMoinQuestions:

Aluminum:~/Desktop nir$ py time_antispam.py current
testing current code, using re.search
first request on long running process / cgi: 9.15815902
second request on long running process: 9.08648396
Aluminum:~/Desktop nir$ py time_antispam.py compiled
testing new code, using few big compiled re objects
first request on long running process / cgi: 8.97461605
second request on long running process: 7.92046213
Aluminum:~/Desktop nir$ py time_antispam.py compiled-links
testing new code, using compiled re objects on page links only
found 49 links
first request on long running process / cgi: 1.27297783
found 49 links
second request on long running process: 0.19782519

Summary

Checking links only:

Here is the test code: time_antispam.py

-- NirSoffer 2004-12-10 12:57:41

MoinMoin: AntiSpamGlobalSolution/ReFactoring (last edited 2007-10-29 19:18:17 by localhost)