attachment:UrlGrab.py of MacroMarket/UrlGrab

Attachment 'UrlGrab.py'

   1 """
   2 MoinMoin - UrlGrab Macro
   3 
   4 Get content from an URL and insert it into the wiki page.
   5 
   6 Allows to also access pages on site requiring cookie-remembered
   7 logins.
   8 
   9 ATTENTION: although it is possible to hide login information from wiki
  10 source text, absolute security is not guaranteed. Use at your own
  11 risks.
  12 
  13 To do:
  14 - more tests, especially with [] of filters, separators
  15 - handle session id when passed in URL
  16 
  17 @copyright: Pascal Bauermeister <pascal DOT bauermeister AT hispeed DOT ch>
  18 @license: GPL
  19 
  20 Updates:
  21 
  22   * [v0.1.1] 2006/01/05
  23     Added CookieCopy
  24 
  25   * [v0.1.0] 2005/12/97
  26     Original version
  27 
  28 ----
  29 
  30 Usage:
  31   [[ UrlGrab ]]
  32   [[ UrlGrab (KEYWORD=VALUE [, ...] ) ]]
  33 
  34 If no arguments are given, the usage is inserted in the HTML result.
  35 Possible keywords:
  36 
  37   Help           = 0, 1, 2
  38     Displays 1:short or 2:full help in the page.
  39     Default: 0 (i.e. no help).
  40 
  41   Url            = 'STRING' or '$varname'
  42     URL of page to grab.
  43     Mandatory.
  44 
  45   LoginUrl       = 'STRING' or '$varname'
  46   LoginForm      = {'name': 'value', ...}
  47     URL of page to visit to perform a login, and the form fields.
  48     Default: empty (i.e. do not perform login).
  49 
  50   CookieCopy     = 'STRING' or '$varname'
  51     Allows to create a new cookie by duplicating an existing one.
  52     Form:
  53       ExistingCookieName NewName Value
  54     Example:
  55       Bugzilla_login COLUMNLIST priority assigned_to status_whiteboard short_desc
  56     
  57   Debug          = 0 or 1
  58     Dump the fetched and filtered content to display the HTML. Useful
  59     to tune filters and to reverse-engineer login forms.
  60     Default: 0 (i.e. no debug).
  61 
  62   Encoding       = 'STRING'
  63     Specifies the name of the encoding used by the source HTML
  64 
  65   Separator      = 'HTML_STRING' or '$varname'
  66     HTML text inserted between matches, if the filter include lists or tuples.
  67     Default: empty (i.e. no separator).
  68 
  69   Filter         = 'FILTER_STRING', or list of Filter, or tuple of Filter
  70                    or '$varname'
  71     A filter string has one of these forms:
  72       REGEX       : if no match, use input and stop processing (default)
  73       *S*REGEX    : if no match, use input and stop processing (default)
  74       *C*REGEX    : if no match, use input and continue processing
  75       *s*REGEX    : if no match, just stop processing
  76       *c*REGEX    : if no match, just continue processing
  77       *TEXT*Regex : if no match, fail with TEXT as error message
  78     The prefix can also be e.g. *=s* (etc) in which case a case-sensitive
  79     match is done.
  80 
  81     A regex may contain expressions between ()'s, in which case the
  82     result will be the concatenation of the matches between ()'s.
  83 
  84     It is possible to chain filters as follows:
  85     - Tuple of filters, i.e. filters between ()'s:
  86         the filters are applied in sequence, until one fails and
  87         requires to stop; a filter in the sequence can be a string, a
  88         list or a tuple.
  89     - List of filters, i.e. filters between []'s:   
  90         the filters are applied in parallel; results are concatenated;
  91         a filter in the sequence can be a string, a list or a tuple.
  92 
  93     The filter parameter is mandatory.
  94     
  95 Keywords can be also given in upper or lower cases, or abbreviated.
  96 Example: SearchText, searchtext, SEARCHTEXT, st, ST, Pages, p, etc.
  97 
  98 Some values may be a string begining with '$', in which case the rest
  99 of the value is a variable name, which value is defined in the
 100 wikiconfig.py file as follows:
 101     class Macro_UrlGrab:
 102         my_variable1 = "my string value"
 103         my_variable2 = {"my": "dict", "value": ""}
 104 This allows to define confidential values (like credentials) hidden in
 105 the wikiconfig.py and only known by the wiki site admin, and use them
 106 unrevealed in the wiki pages.
 107                         
 108 ----
 109 
 110 Sample 1: Grab a bugzilla page
 111 
 112   Wiki page:
 113     ...
 114     [[UrlGrab(LoginUrl="$bz_login_url", Filter="$bz_filter", URL="http://my.bugzilla.site/cgi-bin/bugzilla/buglist.cgi?bug_status=__open__")]]
 115     ...
 116 
 117   wikiconfig.py:
 118     ...
 119     class Macro_UrlGrab:
 120         # Bugzilla login URL to a generic account:
 121         bz_login_url = "http://my.bugzilla.site/cgi-bin/bugzilla/query.cgi?GoAheadAndLogIn=1&Bugzilla_login=lucky.starr@asimov.com&Bugzilla_password=SpaceRanger"
 122         # chained filters to keep only the buglist table:
 123         bz_filter = (
 124             # keep bugs table:
 125             '(<table class="bz_buglist".*)<div id="footer">',
 126             # remove footer box:
 127             '(.*)<table>.*action="long_list.cgi">.*</table>'
 128             )
 129 
 130 """
 131 
 132 # Imports
 133 import re, sys, StringIO, urllib, urllib2
 134 from string import ascii_lowercase, maketrans
 135 from MoinMoin import config, wikiutil, version
 136 from MoinMoin.Page import Page
 137 from MoinMoin.parser import wiki
 138 from MoinMoin.action import AttachFile
 139 from MoinMoin import config
 140 
 141 Dependencies = ["time"] # macro cannot be cached
 142 FAKETRANS = maketrans ("","")
 143 
 144 ###############################################################################
 145 # Macro general utilities
 146 ###############################################################################
 147 
 148 _trace = []
 149 
 150 def escape (str):
 151     return str.replace ('&','&amp;').replace ('<','&lt;').replace ('>','&gt;')
 152 
 153 
 154 def TRACE (message=None):
 155     return ## COMMENT ME TO ENABLE TRACES !!!
 156     global _trace
 157     if message is None: _trace = []
 158     else: _trace.append (message)
 159 
 160 
 161 def GET_TRACE ():
 162     if len (_trace) == 0: return ""
 163     return "<pre>%s</pre>" % escape ('\n'.join (_trace))
 164 
 165 
 166 def _delparam (keyword, params):
 167     value = params [keyword]
 168     del params [keyword]
 169     return value
 170 
 171 
 172 def _param_get (params, spec, default):
 173 
 174     """Returns the value for a parameter, if specified with one of
 175     several acceptable keyword names, or returns its default value if
 176     it is missing from the macro call. If the parameter is specified,
 177     it is removed from the list, so that remaining params can be
 178     signalled as unknown"""
 179 
 180     # param name is litteral ?
 181     if params.has_key (spec): return _delparam (spec, params)
 182 
 183     # param name is all lower or all upper ?
 184     lspec = spec.lower ()
 185     if params.has_key (lspec): return _delparam (lspec, params)
 186     uspec = spec.upper ()
 187     if params.has_key (uspec): return _delparam (uspec, params)
 188 
 189     # param name is abbreviated ?
 190     cspec = spec [0].upper () + spec [1:] # capitalize 1st letter
 191     cspec = cspec.translate (FAKETRANS, ascii_lowercase)
 192     if params.has_key (cspec): return _delparam (cspec, params)
 193     cspec = cspec.lower ()
 194     if params.has_key (cspec): return _delparam (cspec, params)
 195 
 196     # nope: return default value
 197     return default
 198 
 199 
 200 def _usage (full = False):
 201 
 202     """Returns the interesting part of the module's doc"""
 203 
 204     if full: return __doc__
 205 
 206     lines = __doc__.replace ('\\n', '\\\\n'). splitlines ()
 207     start = 0
 208     end = len (lines)
 209     for i in range (end):
 210         if lines [i].strip ().lower () == "usage:":
 211             start = i
 212             break
 213     for i in range (start, end):
 214         if lines [i].startswith ('--'):
 215             end = i
 216             break
 217     return '\n'.join (lines [start:end])
 218 
 219 
 220 def _re_compile (text, name):
 221     try:
 222         return re.compile (text, re.IGNORECASE)
 223     except Exception, msg:
 224         raise _Error ("%s for regex argument %s: '%s'" % (msg, name, text))
 225 
 226 
 227 class _Error (Exception): pass
 228 
 229 
 230 def execute (macro, text, args_re=None):
 231     try:     res = _execute (macro, text)
 232     except _Error, msg:
 233         return """
 234         %s
 235         <p><strong class="error">
 236         Error: macro UrlGrab: %s</strong> </p>
 237         """ % (GET_TRACE (), msg)
 238     return res
 239 
 240 
 241 ###############################################################################
 242 # Macro specific utilities
 243 ###############################################################################
 244 
 245 # build a regex to match text within a given HTML tag
 246 def within (tag):
 247     return '<%s.*?>(.*)</%s>' % (tag, tag)
 248     
 249 
 250 # Recursively apply filters
 251 def apply_filters (filters, inputs):
 252     # inputs is an array: recurse for each input
 253     if inputs.__class__ == list:
 254         TRACE('### i:list f:%s' % `filters`)
 255         out = []
 256         for i in inputs:
 257             ok, res = apply_filter (filters, i)
 258             if res is not None: out.append (res) # create a new result branch
 259             if not ok: return False, []          # if one fails, cancel all
 260         return ok, out                           # done            
 261 
 262     # apply sequentially
 263     elif filters.__class__ == tuple :
 264         TRACE('### f:():%s' % `filters`)
 265         for f in filters:
 266             ok, out = apply_filters (f, inputs)
 267             if not ok: break  # fail
 268             inputs = out      # prepare for next loop, using result as input
 269         return ok, out        # sequence done
 270 
 271     # apply in parallel
 272     elif filters.__class__ == list :
 273         TRACE('### f:[]:%s' % `filters`)
 274         out = []
 275         for f in filters:
 276             ok, res = apply_filters (f, inputs)
 277             if res is not None: out.append (res) # create a new result branch
 278             if not ok: return False, []          # if one fails, cancel all
 279         return ok, out                           # done
 280 
 281     # filters is (hopefully) a string: execute filter
 282     else:
 283         TRACE('### f:str:%s' % `filters`)
 284         return apply_filter (filters, inputs)
 285 
 286 OPT_CONT = 1
 287 OPT_DROP = 2
 288 OPT_CASE = 4
 289 OPTIONS = {
 290     'S' : 0,
 291     'C' : OPT_CONT,
 292     's' : OPT_DROP,
 293     'c' : OPT_CONT + OPT_DROP,
 294     }
 295 
 296 def apply_filter (filt, text):
 297     if len (filt) == 0: return True, text
 298     options = 0
 299     fail_text = None
 300 
 301     # parse optional prefix
 302     parts = filt.split ('*')
 303     if parts [0] == '' and len (parts) >= 3:
 304         prefix, filt = parts [1], ''.join (parts [2:])
 305         if prefix.startswith ('='):
 306             options, prefix = options+OPT_CASE, prefix [1:]
 307         if prefix in OPTIONS.keys ():
 308             options += OPTIONS [prefix]
 309         else: fail_text = prefix
 310 
 311     # compile RE and apply it
 312     rx_opt = re.DOTALL + re.MULTILINE
 313     if options & OPT_CASE: pass
 314     else: rx_opt += re.I
 315     rx = re.compile (filt, rx_opt)
 316     res = rx.findall (text)
 317 
 318     # return according to options
 319     if len (res): return True, "".join (res) # success
 320     res = None
 321     TRACE('### apply_filter failed1')
 322     if fail_text: return False, fail_text    # fail and return err mesg
 323     TRACE('### apply_filter failed2')
 324     if not (options & OPT_DROP):
 325         TRACE('### apply_filter failed3')
 326         res = text                           # return input
 327     return (options & OPT_CONT) != 0, res    # fail and continue or stop
 328 
 329 
 330 # unroll list of (strings or lists) to a string (recursively)
 331 def list2str (x, separator):
 332     if x.__class__ == str : return x
 333     l = [] 
 334     for i in x: l.append (list2str (i, separator))
 335     return separator.join (l)
 336 
 337 
 338 # run the filters, and assemble the result
 339 def filter_string (filters, input_str, separator):
 340     ok, res = apply_filters (filters, input_str)
 341     if res.__class__ == list:
 342         res = list2str (res, separator)
 343     return res
 344 
 345 
 346 ###############################################################################
 347 # The "raison d'etre" of this module
 348 ###############################################################################
 349 
 350 def _execute (macro, text):
 351 
 352     result = ""
 353     TRACE ()
 354 
 355     # get args
 356     try:
 357         # eval macro params
 358         params = eval ("(lambda **opts: opts)(%s)" % text,
 359                        {'__builtins__': []}, {})
 360     except Exception, msg:
 361         raise _Error ("""<pre>malformed arguments list:
 362         %s<br>cause:
 363         %s
 364         </pre>
 365         <br> usage:
 366         <pre>%s</pre>
 367         """ % (text, msg, _usage () ) )
 368 
 369     # by default, only keep the body; this should avoid loading stylesheets
 370     # (I know at last why they should be loaded in <head>! ;-)
 371     def_filter = (within ('html'), within ('body'))
 372 
 373     # see if there is a class named 'Macro_UrlGrab' in the
 374     # wikiconfig.py config file, and load variables defined in
 375     # that class        
 376     conf_vars = {}
 377     if 'Macro_UrlGrab' in dir(macro.request.cfg):
 378         conf_vars = macro.request.cfg.Macro_UrlGrab.__dict__
 379         # remove system members:
 380         for k in conf_vars.keys ():
 381             if k.startswith ('__'): del conf_vars [k]
 382 
 383     # if an arg begins with '$', resolve from conf vars
 384     def r (arg):
 385         if arg.__class__ == str and arg.startswith ('$'):
 386             var = arg [1:]
 387             if conf_vars.has_key (var): return conf_vars [var]
 388             raise _Error ("No such conf var: '%s'" % var)
 389         else: return arg
 390 
 391     # get macro arguments
 392     arg_url             = r (_param_get (params, 'Url',          None))
 393     arg_filter          = r (_param_get (params, 'Filter',       def_filter))
 394 
 395     opt_login_url       = r (_param_get (params, 'LoginUrl',     None))
 396     opt_login_form      = r (_param_get (params, 'LoginForm',    None))
 397 
 398     opt_encoding        = r (_param_get (params, 'Encoding',     None))
 399 
 400     opt_cookie_copy     = r (_param_get (params, 'CookieCopy',   ""))
 401 
 402     opt_separator       = r (_param_get (params, 'Separator',    ''))
 403 
 404     opt_help            =   _param_get (params, 'Help',         0)
 405     opt_debug           =   _param_get (params, 'Debug',        0)
 406         
 407     # help ?
 408     if opt_help:
 409         return """
 410         <p>
 411         Macro SearchInPagesAndSort usage:
 412         <pre>%s</pre></p>
 413         """ % _usage (opt_help==2)
 414 
 415     # check the args a little bit
 416     if len (params):
 417         raise _Error ("""unknown argument(s): %s
 418         <br> usage:
 419         <pre>%s</pre>
 420         """ % (`params.keys ()`, _usage () ) )
 421 
 422     if arg_url is None:
 423         raise _Error ("missing 'Url' argument")
 424 
 425     # get ready: clean up cookie file
 426     # (If I remember well, Apache queues the requests, so we should not
 427     # have concurrent access to that file; what does Twisted ?)
 428     pagename = macro.formatter.page.page_name
 429     request = macro.request
 430     attdir = AttachFile.getAttachDir(request, pagename, create=1)
 431     cookiefile = os.path.join (attdir, "cookies.lwp")
 432     page_text = ""
 433     try: os.remove (cookiefile)
 434     except: pass
 435 
 436     # grab with cookies!
 437     if opt_login_url:
 438         try:
 439             # prepare login form
 440             if opt_login_form: form_data = urllib.urlencode (opt_login_form)
 441             else: form_data = None
 442 
 443             # prepare cookie creation by copy
 444             cookie_copy = opt_cookie_copy.split (" ")
 445             if len (cookie_copy) >=3:
 446                 copy_from, copy_to = cookie_copy [0:2]
 447                 copy_val = ' '.join (cookie_copy [2:])
 448             else:
 449                 copy_from = copy_to = copy_val = None
 450 
 451             # load login page
 452             dummy = urlopen_cookie (cookiefile, opt_login_url, 1, form_data,
 453                                     copy_from, copy_to, copy_val)
 454 
 455             # load page
 456             page_text = urlopen_cookie (cookiefile, arg_url)
 457         finally:
 458             # clean up cookie file
 459             try: os.remove (cookiefile)
 460             except: pass
 461 
 462     # grab w/o cookies!
 463     else:
 464         try:
 465             page_text = urllib2.urlopen (arg_url).read ()
 466         except IOError, e:
 467             msg = 'Could not open URL "%s"' % arg_url
 468             if hasattr(e, 'code'): msg += ' : %s.' % e.code
 469             raise _Error (msg)
 470                                 
 471 
 472     # post-process the result
 473     try:
 474         if opt_encoding:
 475             page_text = unicode (page_text, opt_encoding)
 476         else:
 477             page_text = unicode (page_text)
 478     except UnicodeDecodeError, e:
 479         msg = 'Could not convert to unicode'
 480         if opt_encoding: msg += '. Try another encoding than %s' % opt_encoding
 481         else: msg += '. Try calling the macro with the parameter encoding=... ."'
 482         if hasattr(e, 'code'): msg += ' : %s.' % e.code
 483         raise _Error (msg)
 484 
 485     res = filter_string (arg_filter, page_text, opt_separator)
 486 
 487                 
 488     if opt_debug:
 489         res = "<pre>%s</pre>" % escape (res)
 490     else:
 491         # set the base
 492         res = '<base href="%s"/>\n%s\n<base href="%s"/>\n' % (
 493             arg_url,                   # base for grabbed page
 494             res,                       # grabbed text
 495             macro.request.getBaseURL() # set back our base
 496             )
 497         
 498     # done
 499     return GET_TRACE () + res
 500 
 501 
 502 # Taken from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302930
 503 import os.path, urllib
 504 def urlopen_cookie (cookiefile, theurl, terse=False, txdata=None,
 505                     copy_name_from=None, copy_name_to=None, copy_value=None):
 506     cj = None
 507     ClientCookie = None
 508     cookielib = None
 509 
 510     try:
 511         # Let's see if cookielib is available
 512         import cookielib            
 513     except ImportError:
 514         pass
 515     else:
 516         import urllib2    
 517         urlopen = urllib2.urlopen
 518 
 519         # This is a subclass of FileCookieJar that has useful load and
 520         # save methods
 521         cj = cookielib.LWPCookieJar()
 522 
 523         Request = urllib2.Request
 524 
 525     if not cookielib:
 526         # If importing cookielib fails let's try ClientCookie
 527         try:                                            
 528             import ClientCookie 
 529         except ImportError:
 530             import urllib2
 531             urlopen = urllib2.urlopen
 532             Request = urllib2.Request
 533         else:
 534             urlopen = ClientCookie.urlopen
 535             cj = ClientCookie.LWPCookieJar()
 536             Request = ClientCookie.Request
 537             
 538     ####################################################
 539     # We've now imported the relevant library - whichever library is
 540     # being used urlopen is bound to the right function for retrieving
 541     # URLs Request is bound to the right function for creating Request
 542     # objects Let's load the cookies, if they exist.
 543         
 544     if cj != None:
 545         # now we have to install our CookieJar so that it is used as
 546         # the default CookieProcessor in the default opener handler
 547         if os.path.isfile(cookiefile):
 548             cj.load(cookiefile)
 549         if cookielib:
 550             opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
 551             urllib2.install_opener(opener)
 552         else:
 553             opener = ClientCookie.build_opener(
 554                 ClientCookie.HTTPCookieProcessor(cj)
 555                 )
 556             ClientCookie.install_opener(opener)
 557     
 558     # If one of the cookie libraries is available, any call to urlopen
 559     # will handle cookies using the CookieJar instance we've created
 560     # (Note that if we are using ClientCookie we haven't explicitly
 561     # imported urllib2) as an example :
 562     # fake a user agent, some websites (like google) don't like scripts
 563     txheaders =  {
 564         'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
 565         }
 566     try:
 567         # create a request object
 568         req = Request(theurl, txdata, txheaders)
 569         # and open it to return a handle on the url
 570         handle = urlopen(req)
 571     except IOError, e:
 572         msg = 'Could not open URL'
 573         if not terse: msg += ' "%s"' % theurl
 574         if hasattr(e, 'code'): msg += ' : %s.' % e.code
 575         raise _Error (msg)
 576     else:
 577         if 0:
 578             print 'Here are the headers of the page :'
 579             print handle.info()
 580         # handle.read() returns the page, handle.geturl() returns the
 581         # true url of the page fetched (in case urlopen has followed
 582         # any redirects, which it sometimes does)
 583 
 584     if cj == None:
 585         if 0:
 586             print "We don't have a cookie library available - sorry."
 587             print "I can't show you any cookies."
 588         raise _Error ("No cookie library available. " +
 589                       "Please install python-clientcookie or " +
 590                       "python-cookielib on this server")
 591     else:
 592         # Create new cookie out of an old one
 593         if copy_name_from:
 594             for index, cookie in enumerate(cj):
 595                 if cookie.name == copy_name_from:
 596                     if cookielib:
 597                         new_cookie = cookielib.Cookie (
 598                             cookie.version,
 599                             copy_name_to,
 600                             copy_value,
 601                             cookie.port,
 602                             cookie.port_specified,
 603                             cookie.domain,
 604                             cookie.domain_specified,
 605                             cookie.domain_initial_dot,
 606                             cookie.path,
 607                             cookie.path_specified,
 608                             cookie.secure,
 609                             cookie.expires,
 610                             cookie.discard,
 611                             cookie.comment,
 612                             cookie.comment_url,
 613                             {})
 614                     elif ClientCookie:
 615                         new_cookie = ClientCookie.Cookie (
 616                             cookie.version,
 617                             copy_name_to,
 618                             copy_value,
 619                             cookie.port,
 620                             cookie.port_specified,
 621                             cookie.domain,
 622                             cookie.domain_specified,
 623                             cookie.domain_initial_dot,
 624                             cookie.path,
 625                             cookie.path_specified,
 626                             cookie.secure,
 627                             cookie.expires,
 628                             cookie.discard,
 629                             cookie.comment,
 630                             cookie.comment_url,
 631                             {},
 632                             cookie.rfc2109)
 633                     cj.set_cookie(new_cookie)
 634                     break
 635     
 636         if 0:
 637             print 'These are the cookies we have received so far :'
 638             for index, cookie in enumerate(cj):
 639                 print index, '  :  ', cookie        
 640         # save the cookies back
 641         cj.save(cookiefile)
 642     return handle.read()
Attached Files

To refer to attachments on a page, use attachment:filename, as shown below in the list of files. Do NOT use the URL of the [get] link, since this is subject to change and can break easily.
You are not allowed to attach a file to this page.
MoinMoin: attachment:UrlGrab.py of MacroMarket/UrlGrab

Attachment 'UrlGrab.py'

Attached Files