I tried to program a macro that shows (part of) Wikipedia-pages in MoinMoin. Here is the prototype:
1 import urllib, re
2
3 def execute(macro, args):
4 url = "http://de.wikipedia.org/wiki/%s" % args
5 urllib.URLopener.version = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
6 page = urllib.urlopen(url).read().decode('utf-8')
7 regexp = re.compile(r"<p>(?P<text>.*?)</p>", re.DOTALL)
8 result = regexp.search(page)
9 text = result.group("text")
10
11 results = """<table align="right" width="100" border="0" style="background-color:#FFDDDD;">
12 <tr>
13 <td>
14 <b>%(args)s</b> bei Wikipedia<br>
15 %(text)s<br>
16 ... mehr bei <a href="%(url)s">Wikipedia</a>
17 </td>
18 </tr>
19 </table>
20 """ % locals()
21 return results
It looks like this: http://zosel.dyndns.org/testwiki/BundesVerfassungsGericht
The regex maches the whole text, if you remove the "?" after the ".*" in the line
regexp = re.compile(r"<p>(?P<text>.*?)</p>", re.DOTALL)
ToDo:
- What about timeouts?
- Get it using the caching system so the data will persist if wikipedia is down.
- How to rewrite links?
- It shouldn't be too hard to do a regex, unfortunately I'm just not very familiar with Python regex's. -- Adam.
Not perfect, but a beginning (look here: http://zosel.dyndns.org/testwiki/BundesVerfassungsGericht):
{{{text = re.sub ("(<a[^>]*href=[\"'])(?!http:\/\/|ftp:\/\/)", "\\1http://de.wikipedia.org", text)
- It shouldn't be too hard to do a regex, unfortunately I'm just not very familiar with Python regex's. -- Adam.
}}}
This is getting closer, thanks! I added another similar regex, and touched up yours, to match "img src" links and leave the table of contents anchor links alone (see the version below for code and NelsonMandela as an example). Unforunately the regex's don't work when there is a line feed in the middle of the link. Probably going to have to parse it as XML rather then using regex's if we want any sort of reliability on the parsing -- Adam.
- What says the Wikipedia-license about this?
It should be okay so long as we maintain a direct link back, which I believe is just a link. -- AdamShand
I think that you should use a different UserAgent String, best would be the user's one. Maybe give it as an argument to the script? -- TomK32 (from deWikiPedia)
This is very cool, I've just wasted much of my morning playing with it! I've make a slightly different English version and converted it from using tables to using div tags. The multiple div tags is to get around badly formatted content from Wikipedia where they steal my closing div tag.
1 import urllib, re
2
3 def execute(macro, args):
4 url = "http://en.wikipedia.org/wiki/%s" % args
5 urllib.URLopener.version = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
6 page = urllib.urlopen(url).read().decode('utf-8')
7 regexp = re.compile(r"<p>(?P<text>.*)</p>", re.DOTALL)
8 result = regexp.search(page)
9 text = result.group("text")
10 text = re.sub ("(<a[^>]*href=[\"'])(?!http:\/\/|ftp:\/\/|#)", "\\1http://en.wikipedia.org", text)
11 text = re.sub ("(<img[^>]*src=[\"'])(?!http:\/\/|ftp:\/\/)", "\\1http://en.wikipedia.org", text)
12
13
14 results = """
15 <p align="center"><font size="-1"><i>
16 The below content is from <a href="http://www.wikipedia.org/">wikipedia.org</a> and is licensed under the terms of the
17 <a href="%(url)s/Wikipedia:Text_of_the_GNU_Free_Documentation_License">GNU Free Documentation License</a>.
18 </i></font></p>
19 <div><div><div style="border: 1px solid black; background-color:#f8f8f8; padding: .5em">
20 <h2>%(args)s</h2>
21 %(text)s
22 <p align="right">Source: <a href="%(url)s">Wikipedia:%(args)s</a></p>
23 </div></div></div>
24 """ % locals()
25 return results