Attachment 'UrlGrab.py'
Download 1 """
2 MoinMoin - UrlGrab Macro
3
4 Get content from an URL and insert it into the wiki page.
5
6 Allows to also access pages on site requiring cookie-remembered
7 logins.
8
9 ATTENTION: although it is possible to hide login information from wiki
10 source text, absolute security is not guaranteed. Use at your own
11 risks.
12
13 To do:
14 - more tests, especially with [] of filters, separators
15 - handle session id when passed in URL
16
17 @copyright: Pascal Bauermeister <pascal DOT bauermeister AT hispeed DOT ch>
18 @license: GPL
19
20 Updates:
21
22 * [v0.1.1] 2006/01/05
23 Added CookieCopy
24
25 * [v0.1.0] 2005/12/97
26 Original version
27
28 ----
29
30 Usage:
31 [[ UrlGrab ]]
32 [[ UrlGrab (KEYWORD=VALUE [, ...] ) ]]
33
34 If no arguments are given, the usage is inserted in the HTML result.
35 Possible keywords:
36
37 Help = 0, 1, 2
38 Displays 1:short or 2:full help in the page.
39 Default: 0 (i.e. no help).
40
41 Url = 'STRING' or '$varname'
42 URL of page to grab.
43 Mandatory.
44
45 LoginUrl = 'STRING' or '$varname'
46 LoginForm = {'name': 'value', ...}
47 URL of page to visit to perform a login, and the form fields.
48 Default: empty (i.e. do not perform login).
49
50 CookieCopy = 'STRING' or '$varname'
51 Allows to create a new cookie by duplicating an existing one.
52 Form:
53 ExistingCookieName NewName Value
54 Example:
55 Bugzilla_login COLUMNLIST priority assigned_to status_whiteboard short_desc
56
57 Debug = 0 or 1
58 Dump the fetched and filtered content to display the HTML. Useful
59 to tune filters and to reverse-engineer login forms.
60 Default: 0 (i.e. no debug).
61
62 Encoding = 'STRING'
63 Specifies the name of the encoding used by the source HTML
64
65 Separator = 'HTML_STRING' or '$varname'
66 HTML text inserted between matches, if the filter include lists or tuples.
67 Default: empty (i.e. no separator).
68
69 Filter = 'FILTER_STRING', or list of Filter, or tuple of Filter
70 or '$varname'
71 A filter string has one of these forms:
72 REGEX : if no match, use input and stop processing (default)
73 *S*REGEX : if no match, use input and stop processing (default)
74 *C*REGEX : if no match, use input and continue processing
75 *s*REGEX : if no match, just stop processing
76 *c*REGEX : if no match, just continue processing
77 *TEXT*Regex : if no match, fail with TEXT as error message
78 The prefix can also be e.g. *=s* (etc) in which case a case-sensitive
79 match is done.
80
81 A regex may contain expressions between ()'s, in which case the
82 result will be the concatenation of the matches between ()'s.
83
84 It is possible to chain filters as follows:
85 - Tuple of filters, i.e. filters between ()'s:
86 the filters are applied in sequence, until one fails and
87 requires to stop; a filter in the sequence can be a string, a
88 list or a tuple.
89 - List of filters, i.e. filters between []'s:
90 the filters are applied in parallel; results are concatenated;
91 a filter in the sequence can be a string, a list or a tuple.
92
93 The filter parameter is mandatory.
94
95 Keywords can be also given in upper or lower cases, or abbreviated.
96 Example: SearchText, searchtext, SEARCHTEXT, st, ST, Pages, p, etc.
97
98 Some values may be a string begining with '$', in which case the rest
99 of the value is a variable name, which value is defined in the
100 wikiconfig.py file as follows:
101 class Macro_UrlGrab:
102 my_variable1 = "my string value"
103 my_variable2 = {"my": "dict", "value": ""}
104 This allows to define confidential values (like credentials) hidden in
105 the wikiconfig.py and only known by the wiki site admin, and use them
106 unrevealed in the wiki pages.
107
108 ----
109
110 Sample 1: Grab a bugzilla page
111
112 Wiki page:
113 ...
114 [[UrlGrab(LoginUrl="$bz_login_url", Filter="$bz_filter", URL="http://my.bugzilla.site/cgi-bin/bugzilla/buglist.cgi?bug_status=__open__")]]
115 ...
116
117 wikiconfig.py:
118 ...
119 class Macro_UrlGrab:
120 # Bugzilla login URL to a generic account:
121 bz_login_url = "http://my.bugzilla.site/cgi-bin/bugzilla/query.cgi?GoAheadAndLogIn=1&Bugzilla_login=lucky.starr@asimov.com&Bugzilla_password=SpaceRanger"
122 # chained filters to keep only the buglist table:
123 bz_filter = (
124 # keep bugs table:
125 '(<table class="bz_buglist".*)<div id="footer">',
126 # remove footer box:
127 '(.*)<table>.*action="long_list.cgi">.*</table>'
128 )
129
130 """
131
132 # Imports
133 import re, sys, StringIO, urllib, urllib2
134 from string import ascii_lowercase, maketrans
135 from MoinMoin import config, wikiutil, version
136 from MoinMoin.Page import Page
137 from MoinMoin.parser import wiki
138 from MoinMoin.action import AttachFile
139 from MoinMoin import config
140
141 Dependencies = ["time"] # macro cannot be cached
142 FAKETRANS = maketrans ("","")
143
144 ###############################################################################
145 # Macro general utilities
146 ###############################################################################
147
148 _trace = []
149
150 def escape (str):
151 return str.replace ('&','&').replace ('<','<').replace ('>','>')
152
153
154 def TRACE (message=None):
155 return ## COMMENT ME TO ENABLE TRACES !!!
156 global _trace
157 if message is None: _trace = []
158 else: _trace.append (message)
159
160
161 def GET_TRACE ():
162 if len (_trace) == 0: return ""
163 return "<pre>%s</pre>" % escape ('\n'.join (_trace))
164
165
166 def _delparam (keyword, params):
167 value = params [keyword]
168 del params [keyword]
169 return value
170
171
172 def _param_get (params, spec, default):
173
174 """Returns the value for a parameter, if specified with one of
175 several acceptable keyword names, or returns its default value if
176 it is missing from the macro call. If the parameter is specified,
177 it is removed from the list, so that remaining params can be
178 signalled as unknown"""
179
180 # param name is litteral ?
181 if params.has_key (spec): return _delparam (spec, params)
182
183 # param name is all lower or all upper ?
184 lspec = spec.lower ()
185 if params.has_key (lspec): return _delparam (lspec, params)
186 uspec = spec.upper ()
187 if params.has_key (uspec): return _delparam (uspec, params)
188
189 # param name is abbreviated ?
190 cspec = spec [0].upper () + spec [1:] # capitalize 1st letter
191 cspec = cspec.translate (FAKETRANS, ascii_lowercase)
192 if params.has_key (cspec): return _delparam (cspec, params)
193 cspec = cspec.lower ()
194 if params.has_key (cspec): return _delparam (cspec, params)
195
196 # nope: return default value
197 return default
198
199
200 def _usage (full = False):
201
202 """Returns the interesting part of the module's doc"""
203
204 if full: return __doc__
205
206 lines = __doc__.replace ('\\n', '\\\\n'). splitlines ()
207 start = 0
208 end = len (lines)
209 for i in range (end):
210 if lines [i].strip ().lower () == "usage:":
211 start = i
212 break
213 for i in range (start, end):
214 if lines [i].startswith ('--'):
215 end = i
216 break
217 return '\n'.join (lines [start:end])
218
219
220 def _re_compile (text, name):
221 try:
222 return re.compile (text, re.IGNORECASE)
223 except Exception, msg:
224 raise _Error ("%s for regex argument %s: '%s'" % (msg, name, text))
225
226
227 class _Error (Exception): pass
228
229
230 def execute (macro, text, args_re=None):
231 try: res = _execute (macro, text)
232 except _Error, msg:
233 return """
234 %s
235 <p><strong class="error">
236 Error: macro UrlGrab: %s</strong> </p>
237 """ % (GET_TRACE (), msg)
238 return res
239
240
241 ###############################################################################
242 # Macro specific utilities
243 ###############################################################################
244
245 # build a regex to match text within a given HTML tag
246 def within (tag):
247 return '<%s.*?>(.*)</%s>' % (tag, tag)
248
249
250 # Recursively apply filters
251 def apply_filters (filters, inputs):
252 # inputs is an array: recurse for each input
253 if inputs.__class__ == list:
254 TRACE('### i:list f:%s' % `filters`)
255 out = []
256 for i in inputs:
257 ok, res = apply_filter (filters, i)
258 if res is not None: out.append (res) # create a new result branch
259 if not ok: return False, [] # if one fails, cancel all
260 return ok, out # done
261
262 # apply sequentially
263 elif filters.__class__ == tuple :
264 TRACE('### f:():%s' % `filters`)
265 for f in filters:
266 ok, out = apply_filters (f, inputs)
267 if not ok: break # fail
268 inputs = out # prepare for next loop, using result as input
269 return ok, out # sequence done
270
271 # apply in parallel
272 elif filters.__class__ == list :
273 TRACE('### f:[]:%s' % `filters`)
274 out = []
275 for f in filters:
276 ok, res = apply_filters (f, inputs)
277 if res is not None: out.append (res) # create a new result branch
278 if not ok: return False, [] # if one fails, cancel all
279 return ok, out # done
280
281 # filters is (hopefully) a string: execute filter
282 else:
283 TRACE('### f:str:%s' % `filters`)
284 return apply_filter (filters, inputs)
285
286 OPT_CONT = 1
287 OPT_DROP = 2
288 OPT_CASE = 4
289 OPTIONS = {
290 'S' : 0,
291 'C' : OPT_CONT,
292 's' : OPT_DROP,
293 'c' : OPT_CONT + OPT_DROP,
294 }
295
296 def apply_filter (filt, text):
297 if len (filt) == 0: return True, text
298 options = 0
299 fail_text = None
300
301 # parse optional prefix
302 parts = filt.split ('*')
303 if parts [0] == '' and len (parts) >= 3:
304 prefix, filt = parts [1], ''.join (parts [2:])
305 if prefix.startswith ('='):
306 options, prefix = options+OPT_CASE, prefix [1:]
307 if prefix in OPTIONS.keys ():
308 options += OPTIONS [prefix]
309 else: fail_text = prefix
310
311 # compile RE and apply it
312 rx_opt = re.DOTALL + re.MULTILINE
313 if options & OPT_CASE: pass
314 else: rx_opt += re.I
315 rx = re.compile (filt, rx_opt)
316 res = rx.findall (text)
317
318 # return according to options
319 if len (res): return True, "".join (res) # success
320 res = None
321 TRACE('### apply_filter failed1')
322 if fail_text: return False, fail_text # fail and return err mesg
323 TRACE('### apply_filter failed2')
324 if not (options & OPT_DROP):
325 TRACE('### apply_filter failed3')
326 res = text # return input
327 return (options & OPT_CONT) != 0, res # fail and continue or stop
328
329
330 # unroll list of (strings or lists) to a string (recursively)
331 def list2str (x, separator):
332 if x.__class__ == str : return x
333 l = []
334 for i in x: l.append (list2str (i, separator))
335 return separator.join (l)
336
337
338 # run the filters, and assemble the result
339 def filter_string (filters, input_str, separator):
340 ok, res = apply_filters (filters, input_str)
341 if res.__class__ == list:
342 res = list2str (res, separator)
343 return res
344
345
346 ###############################################################################
347 # The "raison d'etre" of this module
348 ###############################################################################
349
350 def _execute (macro, text):
351
352 result = ""
353 TRACE ()
354
355 # get args
356 try:
357 # eval macro params
358 params = eval ("(lambda **opts: opts)(%s)" % text,
359 {'__builtins__': []}, {})
360 except Exception, msg:
361 raise _Error ("""<pre>malformed arguments list:
362 %s<br>cause:
363 %s
364 </pre>
365 <br> usage:
366 <pre>%s</pre>
367 """ % (text, msg, _usage () ) )
368
369 # by default, only keep the body; this should avoid loading stylesheets
370 # (I know at last why they should be loaded in <head>! ;-)
371 def_filter = (within ('html'), within ('body'))
372
373 # see if there is a class named 'Macro_UrlGrab' in the
374 # wikiconfig.py config file, and load variables defined in
375 # that class
376 conf_vars = {}
377 if 'Macro_UrlGrab' in dir(macro.request.cfg):
378 conf_vars = macro.request.cfg.Macro_UrlGrab.__dict__
379 # remove system members:
380 for k in conf_vars.keys ():
381 if k.startswith ('__'): del conf_vars [k]
382
383 # if an arg begins with '$', resolve from conf vars
384 def r (arg):
385 if arg.__class__ == str and arg.startswith ('$'):
386 var = arg [1:]
387 if conf_vars.has_key (var): return conf_vars [var]
388 raise _Error ("No such conf var: '%s'" % var)
389 else: return arg
390
391 # get macro arguments
392 arg_url = r (_param_get (params, 'Url', None))
393 arg_filter = r (_param_get (params, 'Filter', def_filter))
394
395 opt_login_url = r (_param_get (params, 'LoginUrl', None))
396 opt_login_form = r (_param_get (params, 'LoginForm', None))
397
398 opt_encoding = r (_param_get (params, 'Encoding', None))
399
400 opt_cookie_copy = r (_param_get (params, 'CookieCopy', ""))
401
402 opt_separator = r (_param_get (params, 'Separator', ''))
403
404 opt_help = _param_get (params, 'Help', 0)
405 opt_debug = _param_get (params, 'Debug', 0)
406
407 # help ?
408 if opt_help:
409 return """
410 <p>
411 Macro SearchInPagesAndSort usage:
412 <pre>%s</pre></p>
413 """ % _usage (opt_help==2)
414
415 # check the args a little bit
416 if len (params):
417 raise _Error ("""unknown argument(s): %s
418 <br> usage:
419 <pre>%s</pre>
420 """ % (`params.keys ()`, _usage () ) )
421
422 if arg_url is None:
423 raise _Error ("missing 'Url' argument")
424
425 # get ready: clean up cookie file
426 # (If I remember well, Apache queues the requests, so we should not
427 # have concurrent access to that file; what does Twisted ?)
428 pagename = macro.formatter.page.page_name
429 request = macro.request
430 attdir = AttachFile.getAttachDir(request, pagename, create=1)
431 cookiefile = os.path.join (attdir, "cookies.lwp")
432 page_text = ""
433 try: os.remove (cookiefile)
434 except: pass
435
436 # grab with cookies!
437 if opt_login_url:
438 try:
439 # prepare login form
440 if opt_login_form: form_data = urllib.urlencode (opt_login_form)
441 else: form_data = None
442
443 # prepare cookie creation by copy
444 cookie_copy = opt_cookie_copy.split (" ")
445 if len (cookie_copy) >=3:
446 copy_from, copy_to = cookie_copy [0:2]
447 copy_val = ' '.join (cookie_copy [2:])
448 else:
449 copy_from = copy_to = copy_val = None
450
451 # load login page
452 dummy = urlopen_cookie (cookiefile, opt_login_url, 1, form_data,
453 copy_from, copy_to, copy_val)
454
455 # load page
456 page_text = urlopen_cookie (cookiefile, arg_url)
457 finally:
458 # clean up cookie file
459 try: os.remove (cookiefile)
460 except: pass
461
462 # grab w/o cookies!
463 else:
464 try:
465 page_text = urllib2.urlopen (arg_url).read ()
466 except IOError, e:
467 msg = 'Could not open URL "%s"' % arg_url
468 if hasattr(e, 'code'): msg += ' : %s.' % e.code
469 raise _Error (msg)
470
471
472 # post-process the result
473 try:
474 if opt_encoding:
475 page_text = unicode (page_text, opt_encoding)
476 else:
477 page_text = unicode (page_text)
478 except UnicodeDecodeError, e:
479 msg = 'Could not convert to unicode'
480 if opt_encoding: msg += '. Try another encoding than %s' % opt_encoding
481 else: msg += '. Try calling the macro with the parameter encoding=... ."'
482 if hasattr(e, 'code'): msg += ' : %s.' % e.code
483 raise _Error (msg)
484
485 res = filter_string (arg_filter, page_text, opt_separator)
486
487
488 if opt_debug:
489 res = "<pre>%s</pre>" % escape (res)
490 else:
491 # set the base
492 res = '<base href="%s"/>\n%s\n<base href="%s"/>\n' % (
493 arg_url, # base for grabbed page
494 res, # grabbed text
495 macro.request.getBaseURL() # set back our base
496 )
497
498 # done
499 return GET_TRACE () + res
500
501
502 # Taken from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302930
503 import os.path, urllib
504 def urlopen_cookie (cookiefile, theurl, terse=False, txdata=None,
505 copy_name_from=None, copy_name_to=None, copy_value=None):
506 cj = None
507 ClientCookie = None
508 cookielib = None
509
510 try:
511 # Let's see if cookielib is available
512 import cookielib
513 except ImportError:
514 pass
515 else:
516 import urllib2
517 urlopen = urllib2.urlopen
518
519 # This is a subclass of FileCookieJar that has useful load and
520 # save methods
521 cj = cookielib.LWPCookieJar()
522
523 Request = urllib2.Request
524
525 if not cookielib:
526 # If importing cookielib fails let's try ClientCookie
527 try:
528 import ClientCookie
529 except ImportError:
530 import urllib2
531 urlopen = urllib2.urlopen
532 Request = urllib2.Request
533 else:
534 urlopen = ClientCookie.urlopen
535 cj = ClientCookie.LWPCookieJar()
536 Request = ClientCookie.Request
537
538 ####################################################
539 # We've now imported the relevant library - whichever library is
540 # being used urlopen is bound to the right function for retrieving
541 # URLs Request is bound to the right function for creating Request
542 # objects Let's load the cookies, if they exist.
543
544 if cj != None:
545 # now we have to install our CookieJar so that it is used as
546 # the default CookieProcessor in the default opener handler
547 if os.path.isfile(cookiefile):
548 cj.load(cookiefile)
549 if cookielib:
550 opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
551 urllib2.install_opener(opener)
552 else:
553 opener = ClientCookie.build_opener(
554 ClientCookie.HTTPCookieProcessor(cj)
555 )
556 ClientCookie.install_opener(opener)
557
558 # If one of the cookie libraries is available, any call to urlopen
559 # will handle cookies using the CookieJar instance we've created
560 # (Note that if we are using ClientCookie we haven't explicitly
561 # imported urllib2) as an example :
562 # fake a user agent, some websites (like google) don't like scripts
563 txheaders = {
564 'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
565 }
566 try:
567 # create a request object
568 req = Request(theurl, txdata, txheaders)
569 # and open it to return a handle on the url
570 handle = urlopen(req)
571 except IOError, e:
572 msg = 'Could not open URL'
573 if not terse: msg += ' "%s"' % theurl
574 if hasattr(e, 'code'): msg += ' : %s.' % e.code
575 raise _Error (msg)
576 else:
577 if 0:
578 print 'Here are the headers of the page :'
579 print handle.info()
580 # handle.read() returns the page, handle.geturl() returns the
581 # true url of the page fetched (in case urlopen has followed
582 # any redirects, which it sometimes does)
583
584 if cj == None:
585 if 0:
586 print "We don't have a cookie library available - sorry."
587 print "I can't show you any cookies."
588 raise _Error ("No cookie library available. " +
589 "Please install python-clientcookie or " +
590 "python-cookielib on this server")
591 else:
592 # Create new cookie out of an old one
593 if copy_name_from:
594 for index, cookie in enumerate(cj):
595 if cookie.name == copy_name_from:
596 if cookielib:
597 new_cookie = cookielib.Cookie (
598 cookie.version,
599 copy_name_to,
600 copy_value,
601 cookie.port,
602 cookie.port_specified,
603 cookie.domain,
604 cookie.domain_specified,
605 cookie.domain_initial_dot,
606 cookie.path,
607 cookie.path_specified,
608 cookie.secure,
609 cookie.expires,
610 cookie.discard,
611 cookie.comment,
612 cookie.comment_url,
613 {})
614 elif ClientCookie:
615 new_cookie = ClientCookie.Cookie (
616 cookie.version,
617 copy_name_to,
618 copy_value,
619 cookie.port,
620 cookie.port_specified,
621 cookie.domain,
622 cookie.domain_specified,
623 cookie.domain_initial_dot,
624 cookie.path,
625 cookie.path_specified,
626 cookie.secure,
627 cookie.expires,
628 cookie.discard,
629 cookie.comment,
630 cookie.comment_url,
631 {},
632 cookie.rfc2109)
633 cj.set_cookie(new_cookie)
634 break
635
636 if 0:
637 print 'These are the cookies we have received so far :'
638 for index, cookie in enumerate(cj):
639 print index, ' : ', cookie
640 # save the cookies back
641 cj.save(cookiefile)
642 return handle.read()
Attached Files
To refer to attachments on a page, use attachment:filename, as shown below in the list of files. Do NOT use the URL of the [get] link, since this is subject to change and can break easily.You are not allowed to attach a file to this page.