This code implements three ways of reading the BadContent page and matching its data to a normal page.
- These times can be seen as the antispam delay for one page save on my computer (1533 MHz).
old is used currently (performs worse, 1.5 seconds in my testcase)
tw_new uses a grouped compilation algorithm
- x is the number of compiled pattern objects
else: is the new code suggested by AlexanderSchremmer.
- needs less than 0.13 seconds in this test case
Discussion
How about merging the else and the tw_new branch?
Code
1 import time,re,urllib
2
3 old = False # use old code
4 tw_new = True # use tw's recommendation
5
6 def timeit():
7 return time.time()
8
9 #c = urllib.urlopen("http://moinmaster.wikiwikiweb.de:8000/BadContent?action=raw").read()
10 c = file(r"H:\BadContent").read() # spam regex file
11
12 #t = urllib.urlopen("http://moinmaster.wikiwikiweb.de:8000/SyntaxReference?action=raw").read()
13 t = file(r"H:\SyntaxReference").read() # page file
14
15 if old:
16 def makelist(text):
17 """ Split text into lines, strip them, skip # comments """
18 lines = text.splitlines()
19 list = []
20 for line in lines:
21 line = line.split(' # ', 1)[0] # rest of line comment
22 line = line.strip()
23 if line and not line.startswith('#'):
24 list.append(line)
25 return list
26 elif tw_new:
27 def reduce_res(rlist):
28 tlist = []
29 result = []
30 for pos in range(len(rlist)):
31 tlist.append(rlist[pos])
32 if len(tlist) >= 100 or pos+1 == len(rlist):
33 result.append(re.compile('|'.join(tlist)))
34 tlist = []
35 print "%i;" % (len(result),),
36 return result
37 def makelist(text):
38 """
39 Split text into lines, strip them, skip # comments;
40 sort them into regular expressions and strings
41 """
42 lines = text.splitlines()
43 rlist = []
44 for line in lines:
45 line = line.split(' # ', 1)[0].strip() # strip the comments on the line
46 if line and not line.startswith('#'):
47 rlist.append(line)
48 return ([], reduce_res(rlist))
49 else:
50 def makelist(text):
51 """
52 Split text into lines, strip them, skip # comments;
53 sort them into regular expressions and strings
54 """
55 lines = text.splitlines()
56 rlist = []
57 slist = []
58 IsRegEx = False
59 for line in lines:
60 line = line.split(' # ', 1)[0].strip() # strip the comments on the line
61 if line == '#regex on':
62 IsRegEx = True
63 elif line == '#regex off':
64 IsRegEx = False
65 elif line and not line.startswith('#'):
66 if IsRegEx:
67 rlist.append(line)
68 else:
69 slist.append(line)
70 return (slist, rlist)
71
72 b = timeit()
73
74 if old:
75 blacklist = makelist(c)
76 for blacklist_re in blacklist:
77 match = re.search(blacklist_re, t, re.I)
78 if match:
79 raise StandardError
80 elif tw_new:
81 rlist = makelist(c)[1]
82 print "%f;" % (timeit() - b),
83
84 b = timeit()
85 for blacklist_re in rlist:
86 match = re.search(blacklist_re, t, re.I)
87 if match:
88 raise StandardError
89 print "%f" % (timeit() - b)
90 b = timeit()
91 else:
92 slist, rlist = makelist(c)
93 t = t.lower()
94 for blacklist_re in rlist:
95 match = re.search(blacklist_re, t, re.I)
96 if match:
97 raise StandardError
98 for blacklist_str in slist:
99 match = (t.find(blacklist_str.lower()) != -1)
100 if match:
101 raise StandardError
102
103 print timeit() - b
Comments
There is a major problem with this code, it does not use unicode. To compare with current code, you must use unicode for both page text and bad content.
Searching only in links
I written a time script to compare current code and new methods. I tested 3 setups:
- Current code copied from antispam - create list of patterns, then iterate over the list and use re.search on for each item on the whole page body.
- Usuall re optimization - create a lists of compiled re, each containing up to 100 patterns or up to 100 groups (which match first), based on Alexander results, that about 100 patterns is the sweet spot. Probably this should be check again becuase he did not use unicode. The list is pickled to disk, and loaded for each reauest, then each calling each re.search method on the whole page text. This save the compile step for the second request on long running process, becuase python cache automatically compiled re objects.
- Same as 2, but instead of matching all the text, extract the links from the text first, then match against the links only.
All tests are trying all re patterns - this is the typical and the worst case we have, a honset user that try to save, and wait for the wiki. A spammer will get the result much faster, because a match will stop the iteration.
Here are the results for same file Alexander tested: (I get much slower times on much faster hardware, I don't know why, I did not run his code yet).
Aluminum:~/Desktop nir$ py time_antispam.py current testing current code, using re.search first request on long running process / cgi: 1.57658982 second request on long running process: 1.51904202 Aluminum:~/Desktop nir$ py time_antispam.py compiled testing new code, using few big compiled re objects first request on long running process / cgi: 1.59112692 second request on long running process: 0.52169299 Aluminum:~/Desktop nir$ py time_antispam.py compiled-links testing new code, using compiled re objects on page links only found 12 links first request on long running process / cgi: 1.12022805 found 12 links second request on long running process: 0.05015397
And here are results for a small page FrontPage:
Aluminum:~/Desktop nir$ py time_antispam.py current testing current code, using re.search first request on long running process / cgi: 1.29024196 second request on long running process: 1.21046686 Aluminum:~/Desktop nir$ py time_antispam.py compiled testing new code, using few big compiled re objects first request on long running process / cgi: 1.26842117 second request on long running process: 0.20568013 Aluminum:~/Desktop nir$ py time_antispam.py compiled-links testing new code, using compiled re objects on page links only found 3 links first request on long running process / cgi: 1.07657290 found 3 links second request on long running process: 0.00885296
And here are results for big page, MoinMoinQuestions:
Aluminum:~/Desktop nir$ py time_antispam.py current testing current code, using re.search first request on long running process / cgi: 9.15815902 second request on long running process: 9.08648396 Aluminum:~/Desktop nir$ py time_antispam.py compiled testing new code, using few big compiled re objects first request on long running process / cgi: 8.97461605 second request on long running process: 7.92046213 Aluminum:~/Desktop nir$ py time_antispam.py compiled-links testing new code, using compiled re objects on page links only found 49 links first request on long running process / cgi: 1.27297783 found 49 links second request on long running process: 0.19782519
Summary
Checking links only:
- 1-10x time faster then current code for cgi
- About 100x times faster then current code for long running process
Here is the test code: time_antispam.py
Note: the code that fetch the pages tends to fail on MoinMoinQuestions and I downaloded the text manualy with a browser.
-- NirSoffer 2004-12-10 12:57:41