"""
MoinMoin - UrlGrab Macro

Get content from an URL and insert it into the wiki page.

Allows to also access pages on site requiring cookie-remembered
logins.

ATTENTION: although it is possible to hide login information from wiki
source text, absolute security is not guaranteed. Use at your own
risks.

To do:
- more tests, especially with [] of filters, separators
- handle session id when passed in URL

@copyright: Pascal Bauermeister <pascal DOT bauermeister AT hispeed DOT ch>
@license: GPL

Updates:

  * [v0.1.1] 2006/01/05
    Added CookieCopy

  * [v0.1.0] 2005/12/97
    Original version

----

Usage:
  [[ UrlGrab ]]
  [[ UrlGrab (KEYWORD=VALUE [, ...] ) ]]

If no arguments are given, the usage is inserted in the HTML result.
Possible keywords:

  Help           = 0, 1, 2
    Displays 1:short or 2:full help in the page.
    Default: 0 (i.e. no help).

  Url            = 'STRING' or '$varname'
    URL of page to grab.
    Mandatory.

  LoginUrl       = 'STRING' or '$varname'
  LoginForm      = {'name': 'value', ...}
    URL of page to visit to perform a login, and the form fields.
    Default: empty (i.e. do not perform login).

  CookieCopy     = 'STRING' or '$varname'
    Allows to create a new cookie by duplicating an existing one.
    Form:
      ExistingCookieName NewName Value
    Example:
      Bugzilla_login COLUMNLIST priority assigned_to status_whiteboard short_desc
    
  Debug          = 0 or 1
    Dump the fetched and filtered content to display the HTML. Useful
    to tune filters and to reverse-engineer login forms.
    Default: 0 (i.e. no debug).

  Encoding       = 'STRING'
    Specifies the name of the encoding used by the source HTML

  Separator      = 'HTML_STRING' or '$varname'
    HTML text inserted between matches, if the filter include lists or tuples.
    Default: empty (i.e. no separator).

  Filter         = 'FILTER_STRING', or list of Filter, or tuple of Filter
                   or '$varname'
    A filter string has one of these forms:
      REGEX       : if no match, use input and stop processing (default)
      *S*REGEX    : if no match, use input and stop processing (default)
      *C*REGEX    : if no match, use input and continue processing
      *s*REGEX    : if no match, just stop processing
      *c*REGEX    : if no match, just continue processing
      *TEXT*Regex : if no match, fail with TEXT as error message
    The prefix can also be e.g. *=s* (etc) in which case a case-sensitive
    match is done.

    A regex may contain expressions between ()'s, in which case the
    result will be the concatenation of the matches between ()'s.

    It is possible to chain filters as follows:
    - Tuple of filters, i.e. filters between ()'s:
        the filters are applied in sequence, until one fails and
        requires to stop; a filter in the sequence can be a string, a
        list or a tuple.
    - List of filters, i.e. filters between []'s:   
        the filters are applied in parallel; results are concatenated;
        a filter in the sequence can be a string, a list or a tuple.

    The filter parameter is mandatory.
    
Keywords can be also given in upper or lower cases, or abbreviated.
Example: SearchText, searchtext, SEARCHTEXT, st, ST, Pages, p, etc.

Some values may be a string begining with '$', in which case the rest
of the value is a variable name, which value is defined in the
wikiconfig.py file as follows:
    class Macro_UrlGrab:
        my_variable1 = "my string value"
        my_variable2 = {"my": "dict", "value": ""}
This allows to define confidential values (like credentials) hidden in
the wikiconfig.py and only known by the wiki site admin, and use them
unrevealed in the wiki pages.
                        
----

Sample 1: Grab a bugzilla page

  Wiki page:
    ...
    [[UrlGrab(LoginUrl="$bz_login_url", Filter="$bz_filter", URL="http://my.bugzilla.site/cgi-bin/bugzilla/buglist.cgi?bug_status=__open__")]]
    ...

  wikiconfig.py:
    ...
    class Macro_UrlGrab:
        # Bugzilla login URL to a generic account:
        bz_login_url = "http://my.bugzilla.site/cgi-bin/bugzilla/query.cgi?GoAheadAndLogIn=1&Bugzilla_login=lucky.starr@asimov.com&Bugzilla_password=SpaceRanger"
        # chained filters to keep only the buglist table:
        bz_filter = (
            # keep bugs table:
            '(<table class="bz_buglist".*)<div id="footer">',
            # remove footer box:
            '(.*)<table>.*action="long_list.cgi">.*</table>'
            )

"""

# Imports
import re, sys, StringIO, urllib, urllib2
from string import ascii_lowercase, maketrans
from MoinMoin import config, wikiutil, version
from MoinMoin.Page import Page
from MoinMoin.parser import wiki
from MoinMoin.action import AttachFile
from MoinMoin import config

Dependencies = ["time"] # macro cannot be cached
FAKETRANS = maketrans ("","")

###############################################################################
# Macro general utilities
###############################################################################

_trace = []

def escape (str):
    return str.replace ('&','&amp;').replace ('<','&lt;').replace ('>','&gt;')


def TRACE (message=None):
    return ## COMMENT ME TO ENABLE TRACES !!!
    global _trace
    if message is None: _trace = []
    else: _trace.append (message)


def GET_TRACE ():
    if len (_trace) == 0: return ""
    return "<pre>%s</pre>" % escape ('\n'.join (_trace))


def _delparam (keyword, params):
    value = params [keyword]
    del params [keyword]
    return value


def _param_get (params, spec, default):

    """Returns the value for a parameter, if specified with one of
    several acceptable keyword names, or returns its default value if
    it is missing from the macro call. If the parameter is specified,
    it is removed from the list, so that remaining params can be
    signalled as unknown"""

    # param name is litteral ?
    if params.has_key (spec): return _delparam (spec, params)

    # param name is all lower or all upper ?
    lspec = spec.lower ()
    if params.has_key (lspec): return _delparam (lspec, params)
    uspec = spec.upper ()
    if params.has_key (uspec): return _delparam (uspec, params)

    # param name is abbreviated ?
    cspec = spec [0].upper () + spec [1:] # capitalize 1st letter
    cspec = cspec.translate (FAKETRANS, ascii_lowercase)
    if params.has_key (cspec): return _delparam (cspec, params)
    cspec = cspec.lower ()
    if params.has_key (cspec): return _delparam (cspec, params)

    # nope: return default value
    return default


def _usage (full = False):

    """Returns the interesting part of the module's doc"""

    if full: return __doc__

    lines = __doc__.replace ('\\n', '\\\\n'). splitlines ()
    start = 0
    end = len (lines)
    for i in range (end):
        if lines [i].strip ().lower () == "usage:":
            start = i
            break
    for i in range (start, end):
        if lines [i].startswith ('--'):
            end = i
            break
    return '\n'.join (lines [start:end])


def _re_compile (text, name):
    try:
        return re.compile (text, re.IGNORECASE)
    except Exception, msg:
        raise _Error ("%s for regex argument %s: '%s'" % (msg, name, text))


class _Error (Exception): pass


def execute (macro, text, args_re=None):
    try:     res = _execute (macro, text)
    except _Error, msg:
        return """
        %s
        <p><strong class="error">
        Error: macro UrlGrab: %s</strong> </p>
        """ % (GET_TRACE (), msg)
    return res


###############################################################################
# Macro specific utilities
###############################################################################

# build a regex to match text within a given HTML tag
def within (tag):
    return '<%s.*?>(.*)</%s>' % (tag, tag)
    

# Recursively apply filters
def apply_filters (filters, inputs):
    # inputs is an array: recurse for each input
    if inputs.__class__ == list:
        TRACE('### i:list f:%s' % `filters`)
        out = []
        for i in inputs:
            ok, res = apply_filter (filters, i)
            if res is not None: out.append (res) # create a new result branch
            if not ok: return False, []          # if one fails, cancel all
        return ok, out                           # done            

    # apply sequentially
    elif filters.__class__ == tuple :
        TRACE('### f:():%s' % `filters`)
        for f in filters:
            ok, out = apply_filters (f, inputs)
            if not ok: break  # fail
            inputs = out      # prepare for next loop, using result as input
        return ok, out        # sequence done

    # apply in parallel
    elif filters.__class__ == list :
        TRACE('### f:[]:%s' % `filters`)
        out = []
        for f in filters:
            ok, res = apply_filters (f, inputs)
            if res is not None: out.append (res) # create a new result branch
            if not ok: return False, []          # if one fails, cancel all
        return ok, out                           # done

    # filters is (hopefully) a string: execute filter
    else:
        TRACE('### f:str:%s' % `filters`)
        return apply_filter (filters, inputs)

OPT_CONT = 1
OPT_DROP = 2
OPT_CASE = 4
OPTIONS = {
    'S' : 0,
    'C' : OPT_CONT,
    's' : OPT_DROP,
    'c' : OPT_CONT + OPT_DROP,
    }

def apply_filter (filt, text):
    if len (filt) == 0: return True, text
    options = 0
    fail_text = None

    # parse optional prefix
    parts = filt.split ('*')
    if parts [0] == '' and len (parts) >= 3:
        prefix, filt = parts [1], ''.join (parts [2:])
        if prefix.startswith ('='):
            options, prefix = options+OPT_CASE, prefix [1:]
        if prefix in OPTIONS.keys ():
            options += OPTIONS [prefix]
        else: fail_text = prefix

    # compile RE and apply it
    rx_opt = re.DOTALL + re.MULTILINE
    if options & OPT_CASE: pass
    else: rx_opt += re.I
    rx = re.compile (filt, rx_opt)
    res = rx.findall (text)

    # return according to options
    if len (res): return True, "".join (res) # success
    res = None
    TRACE('### apply_filter failed1')
    if fail_text: return False, fail_text    # fail and return err mesg
    TRACE('### apply_filter failed2')
    if not (options & OPT_DROP):
        TRACE('### apply_filter failed3')
        res = text                           # return input
    return (options & OPT_CONT) != 0, res    # fail and continue or stop


# unroll list of (strings or lists) to a string (recursively)
def list2str (x, separator):
    if x.__class__ == str : return x
    l = [] 
    for i in x: l.append (list2str (i, separator))
    return separator.join (l)


# run the filters, and assemble the result
def filter_string (filters, input_str, separator):
    ok, res = apply_filters (filters, input_str)
    if res.__class__ == list:
        res = list2str (res, separator)
    return res


###############################################################################
# The "raison d'etre" of this module
###############################################################################

def _execute (macro, text):

    result = ""
    TRACE ()

    # get args
    try:
        # eval macro params
        params = eval ("(lambda **opts: opts)(%s)" % text,
                       {'__builtins__': []}, {})
    except Exception, msg:
        raise _Error ("""<pre>malformed arguments list:
        %s<br>cause:
        %s
        </pre>
        <br> usage:
        <pre>%s</pre>
        """ % (text, msg, _usage () ) )

    # by default, only keep the body; this should avoid loading stylesheets
    # (I know at last why they should be loaded in <head>! ;-)
    def_filter = (within ('html'), within ('body'))

    # see if there is a class named 'Macro_UrlGrab' in the
    # wikiconfig.py config file, and load variables defined in
    # that class        
    conf_vars = {}
    if 'Macro_UrlGrab' in dir(macro.request.cfg):
        conf_vars = macro.request.cfg.Macro_UrlGrab.__dict__
        # remove system members:
        for k in conf_vars.keys ():
            if k.startswith ('__'): del conf_vars [k]

    # if an arg begins with '$', resolve from conf vars
    def r (arg):
        if arg.__class__ == str and arg.startswith ('$'):
            var = arg [1:]
            if conf_vars.has_key (var): return conf_vars [var]
            raise _Error ("No such conf var: '%s'" % var)
        else: return arg

    # get macro arguments
    arg_url             = r (_param_get (params, 'Url',          None))
    arg_filter          = r (_param_get (params, 'Filter',       def_filter))

    opt_login_url       = r (_param_get (params, 'LoginUrl',     None))
    opt_login_form      = r (_param_get (params, 'LoginForm',    None))

    opt_encoding        = r (_param_get (params, 'Encoding',     None))

    opt_cookie_copy     = r (_param_get (params, 'CookieCopy',   ""))

    opt_separator       = r (_param_get (params, 'Separator',    ''))

    opt_help            =   _param_get (params, 'Help',         0)
    opt_debug           =   _param_get (params, 'Debug',        0)
        
    # help ?
    if opt_help:
        return """
        <p>
        Macro SearchInPagesAndSort usage:
        <pre>%s</pre></p>
        """ % _usage (opt_help==2)

    # check the args a little bit
    if len (params):
        raise _Error ("""unknown argument(s): %s
        <br> usage:
        <pre>%s</pre>
        """ % (`params.keys ()`, _usage () ) )

    if arg_url is None:
        raise _Error ("missing 'Url' argument")

    # get ready: clean up cookie file
    # (If I remember well, Apache queues the requests, so we should not
    # have concurrent access to that file; what does Twisted ?)
    pagename = macro.formatter.page.page_name
    request = macro.request
    attdir = AttachFile.getAttachDir(request, pagename, create=1)
    cookiefile = os.path.join (attdir, "cookies.lwp")
    page_text = ""
    try: os.remove (cookiefile)
    except: pass

    # grab with cookies!
    if opt_login_url:
        try:
            # prepare login form
            if opt_login_form: form_data = urllib.urlencode (opt_login_form)
            else: form_data = None

            # prepare cookie creation by copy
            cookie_copy = opt_cookie_copy.split (" ")
            if len (cookie_copy) >=3:
                copy_from, copy_to = cookie_copy [0:2]
                copy_val = ' '.join (cookie_copy [2:])
            else:
                copy_from = copy_to = copy_val = None

            # load login page
            dummy = urlopen_cookie (cookiefile, opt_login_url, 1, form_data,
                                    copy_from, copy_to, copy_val)

            # load page
            page_text = urlopen_cookie (cookiefile, arg_url)
        finally:
            # clean up cookie file
            try: os.remove (cookiefile)
            except: pass

    # grab w/o cookies!
    else:
        try:
            page_text = urllib2.urlopen (arg_url).read ()
        except IOError, e:
            msg = 'Could not open URL "%s"' % arg_url
            if hasattr(e, 'code'): msg += ' : %s.' % e.code
            raise _Error (msg)
                                

    # post-process the result
    try:
        if opt_encoding:
            page_text = unicode (page_text, opt_encoding)
        else:
            page_text = unicode (page_text)
    except UnicodeDecodeError, e:
        msg = 'Could not convert to unicode'
        if opt_encoding: msg += '. Try another encoding than %s' % opt_encoding
        else: msg += '. Try calling the macro with the parameter encoding=... ."'
        if hasattr(e, 'code'): msg += ' : %s.' % e.code
        raise _Error (msg)

    res = filter_string (arg_filter, page_text, opt_separator)

                
    if opt_debug:
        res = "<pre>%s</pre>" % escape (res)
    else:
        # set the base
        res = '<base href="%s"/>\n%s\n<base href="%s"/>\n' % (
            arg_url,                   # base for grabbed page
            res,                       # grabbed text
            macro.request.getBaseURL() # set back our base
            )
        
    # done
    return GET_TRACE () + res


# Taken from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302930
import os.path, urllib
def urlopen_cookie (cookiefile, theurl, terse=False, txdata=None,
                    copy_name_from=None, copy_name_to=None, copy_value=None):
    cj = None
    ClientCookie = None
    cookielib = None

    try:
        # Let's see if cookielib is available
        import cookielib            
    except ImportError:
        pass
    else:
        import urllib2    
        urlopen = urllib2.urlopen

        # This is a subclass of FileCookieJar that has useful load and
        # save methods
        cj = cookielib.LWPCookieJar()

        Request = urllib2.Request

    if not cookielib:
        # If importing cookielib fails let's try ClientCookie
        try:                                            
            import ClientCookie 
        except ImportError:
            import urllib2
            urlopen = urllib2.urlopen
            Request = urllib2.Request
        else:
            urlopen = ClientCookie.urlopen
            cj = ClientCookie.LWPCookieJar()
            Request = ClientCookie.Request
            
    ####################################################
    # We've now imported the relevant library - whichever library is
    # being used urlopen is bound to the right function for retrieving
    # URLs Request is bound to the right function for creating Request
    # objects Let's load the cookies, if they exist.
        
    if cj != None:
        # now we have to install our CookieJar so that it is used as
        # the default CookieProcessor in the default opener handler
        if os.path.isfile(cookiefile):
            cj.load(cookiefile)
        if cookielib:
            opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
            urllib2.install_opener(opener)
        else:
            opener = ClientCookie.build_opener(
                ClientCookie.HTTPCookieProcessor(cj)
                )
            ClientCookie.install_opener(opener)
    
    # If one of the cookie libraries is available, any call to urlopen
    # will handle cookies using the CookieJar instance we've created
    # (Note that if we are using ClientCookie we haven't explicitly
    # imported urllib2) as an example :
    # fake a user agent, some websites (like google) don't like scripts
    txheaders =  {
        'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        }
    try:
        # create a request object
        req = Request(theurl, txdata, txheaders)
        # and open it to return a handle on the url
        handle = urlopen(req)
    except IOError, e:
        msg = 'Could not open URL'
        if not terse: msg += ' "%s"' % theurl
        if hasattr(e, 'code'): msg += ' : %s.' % e.code
        raise _Error (msg)
    else:
        if 0:
            print 'Here are the headers of the page :'
            print handle.info()
        # handle.read() returns the page, handle.geturl() returns the
        # true url of the page fetched (in case urlopen has followed
        # any redirects, which it sometimes does)

    if cj == None:
        if 0:
            print "We don't have a cookie library available - sorry."
            print "I can't show you any cookies."
        raise _Error ("No cookie library available. " +
                      "Please install python-clientcookie or " +
                      "python-cookielib on this server")
    else:
        # Create new cookie out of an old one
        if copy_name_from:
            for index, cookie in enumerate(cj):
                if cookie.name == copy_name_from:
                    if cookielib:
                        new_cookie = cookielib.Cookie (
                            cookie.version,
                            copy_name_to,
                            copy_value,
                            cookie.port,
                            cookie.port_specified,
                            cookie.domain,
                            cookie.domain_specified,
                            cookie.domain_initial_dot,
                            cookie.path,
                            cookie.path_specified,
                            cookie.secure,
                            cookie.expires,
                            cookie.discard,
                            cookie.comment,
                            cookie.comment_url,
                            {})
                    elif ClientCookie:
                        new_cookie = ClientCookie.Cookie (
                            cookie.version,
                            copy_name_to,
                            copy_value,
                            cookie.port,
                            cookie.port_specified,
                            cookie.domain,
                            cookie.domain_specified,
                            cookie.domain_initial_dot,
                            cookie.path,
                            cookie.path_specified,
                            cookie.secure,
                            cookie.expires,
                            cookie.discard,
                            cookie.comment,
                            cookie.comment_url,
                            {},
                            cookie.rfc2109)
                    cj.set_cookie(new_cookie)
                    break
    
        if 0:
            print 'These are the cookies we have received so far :'
            for index, cookie in enumerate(cj):
                print index, '  :  ', cookie        
        # save the cookies back
        cj.save(cookiefile)
    return handle.read()