Human readable heading anchors for MoinMoin
1.7 has nice heading IDs now.
Contents
Headings in MoinMoin are identified by anchors, which is fine, but the text of the anchor id is close to unreadable. It is actually made of:
(more or less extracted from wiki.py)
We see that a SHA hash is used to avoid encoding issue. It is not, as I understand the code, used to avoid clashes, because clashing string titles will logically result in clashing SHA hashes, hence the unique_id appended when clashes are found.
What I propose is that we use the title itself as the anchor id. I have written a crude patch for this that rips out illegal characters and replaces them with dashes. Unicode normalisation is also attempted, but it does not do what I want exactly.
(I would like "é" to turn into "e", but instead it turns it into "e" + acute:
If someone finds a nicer solution, please mention it here...
RadomirDopieralski says: Well, maybe you could normalize first, and then just remove all non-ascii characters...
Patch (applies to 1.5.5a and 1.5.7)
1 macro/Include.py | 11 +----------
2 macro/TableOfContents.py | 12 ++----------
3 parser/wiki.py | 10 +---------
4 wikiutil.py | 19 +++++++++++++++++++
5 4 files changed, 23 insertions(+), 29 deletions(-)
6 --- wikiutil.py.orig Thu May 10 16:35:47 2007
7 +++ wikiutil.py Thu May 10 18:44:33 2007
8 @@ -274,6 +274,25 @@
9 newtext.append(part)
10 return " ".join(newtext)
11
12 +def unique_heading_id(headings, text):
13 + """ generate an ID for a heading that is unique to this request, human-readable and HTML-compliant
14 + """
15 + import unicodedata
16 + # ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
17 + # followed by any number of letters, digits ([0-9]), hyphens ("-"),
18 + # underscores ("_"), colons (":"), and periods (".").
19 + # http://www.w3.org/TR/html4/types.html
20 + pntt = re.sub('[^-A-Za-z0-9_:.]+', '-', unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')).lower()
21 + hid = "head-" + pntt # basic heading structure
22 + # count the number of times this heading is found in this request
23 + headings.setdefault(pntt, 0)
24 + headings[pntt] += 1
25 + # spcial case: if the text is strictly non-ascii, add a number anyways so it looks nicer
26 + if headings[pntt] > 1 or pntt == "-":
27 + hid += '-%d' % (headings[pntt], ) # increment the heading id, to avoid duplicates
28 + return re.sub('--+', '-', hid) # necessary because the last line might have added another duplicate -
29 +
30 +
31 ########################################################################
32 ### Storage
33 ########################################################################
34 --- parser/wiki.py.orig Sat Sep 16 18:21:52 2006
35 +++ parser/wiki.py Thu May 10 18:42:59 2007
36 @@ -744,8 +744,6 @@
37
38 def _heading_repl(self, word):
39 """Handle section headings."""
40 - import sha
41 -
42 h = word.strip()
43 level = 1
44 while h[level:level+1] == '=':
45 @@ -756,15 +754,9 @@
46 # TODO but it might still result in unpredictable results
47 # when included the same page multiple times
48 title_text = h[level:-level].strip()
49 - pntt = self.formatter.page.page_name + title_text
50 - self.titles.setdefault(pntt, 0)
51 - self.titles[pntt] += 1
52
53 - unique_id = ''
54 - if self.titles[pntt] > 1:
55 - unique_id = '-%d' % self.titles[pntt]
56 result = self._closeP()
57 - result += self.formatter.heading(1, depth, id="head-"+sha.new(pntt.encode(config.charset)).hexdigest()+unique_id)
58 + result += self.formatter.heading(1, depth, id=wikiutil.unique_heading_id(self.request._page_headings, title_text))
59
60 return (result + self.formatter.text(title_text) +
61 self.formatter.heading(0, depth))
62 --- macro/Include.py.orig Wed Apr 18 14:56:22 2007
63 +++ macro/Include.py Thu May 10 18:43:11 2007
64 @@ -190,21 +190,12 @@
65 macro.formatter.text(heading) +
66 macro.formatter.heading(0, level))
67 else:
68 - import sha
69 - from MoinMoin import config
70 # this heading id might produce duplicate ids,
71 # if the same page is included multiple times
72 - # Encode stuf we feed into sha module.
73 - pntt = (inc_name + heading).encode(config.charset)
74 - hid = "head-" + sha.new(pntt).hexdigest()
75 - request._page_headings.setdefault(pntt, 0)
76 - request._page_headings[pntt] += 1
77 - if request._page_headings[pntt] > 1:
78 - hid += '-%d'%(request._page_headings[pntt],)
79 result.append(
80 #macro.formatter.heading(1, level, hid,
81 # icons=edit_icon.replace('<img ', '<img align="right" ')) +
82 - macro.formatter.heading(1, level, id=hid) +
83 + macro.formatter.heading(1, level, id=wikiutil.unique_heading_id(request._page_headings, heading)) +
84 inc_page.link_to(request, heading, css_class="include-heading-link") +
85 macro.formatter.heading(0, level)
86 )
87 --- macro/TableOfContents.py.orig Fri Nov 10 17:02:52 2006
88 +++ macro/TableOfContents.py Thu May 10 18:43:32 2007
89 @@ -8,7 +8,7 @@
90 @license: GNU GPL, see COPYING for details.
91 """
92
93 -import re, sha
94 +import re
95 from MoinMoin import config, wikiutil
96
97 #Dependencies = ["page"]
98 @@ -125,9 +125,6 @@
99 match = self.head_re.match(line)
100 if not match: return
101 title_text = match.group('htext').strip()
102 - pntt = pagename + title_text
103 - self.titles.setdefault(pntt, 0)
104 - self.titles[pntt] += 1
105
106 # Get new indent level
107 newindent = len(match.group('hmarker'))
108 @@ -147,11 +144,6 @@
109 self.result.append(self.macro.formatter.number_list(1))
110 self.result.append(self.macro.formatter.listitem(1))
111
112 - # Add the heading
113 - unique_id = ''
114 - if self.titles[pntt] > 1:
115 - unique_id = '-%d' % (self.titles[pntt],)
116 -
117 # close last listitem if same level
118 if self.indent == newindent:
119 self.result.append(self.macro.formatter.listitem(0))
120 @@ -159,7 +151,7 @@
121 if self.indent >= newindent:
122 self.result.append(self.macro.formatter.listitem(1))
123 self.result.append(self.macro.formatter.anchorlink(1,
124 - "head-" + sha.new(pntt.encode(config.charset)).hexdigest() + unique_id) +
125 + wikiutil.unique_heading_id(self.titles, title_text)) +
126 self.macro.formatter.text(title_text) +
127 self.macro.formatter.anchorlink(0))
128
Changelog:
fix the display of accentuated characters using a technique that removes those pesky dashes -- TheAnarcat 2006-10-16 17:37:16
use proper spacing (PEP8) and don't use the string module. Upload a copy of the patch instead of relying on my svn version and remove the old versions in this page. -- TheAnarcat 2006-10-15 22:52:20
fix all the remaining todos -- TheAnarcat 2007-05-10 23:05:26
1.6 patch
1 # HG changeset patch
2 # User anarcat@titine.anarcat.ath.cx
3 # Date 1178839159 14400
4 # Node ID 7de937813f1a07b3ff98f7de4b68092780ab7e11
5 # Parent dc9a3809af61aa74bdb4861f1ab7d02f8b730c0e
6 factor out the heading uniqueness code into wikiutil
7
8 rework the code so that ascii charsets are readable (and not SHA-1 encrypted)
9
10 non-ascii charsets will receive incremental headings
11
12 all tests show that heading ids are still unique after this, and this actually fixes a bug in the Include macro where the generated heading had a duplicate id
13
14 Ref: MoinMoin:FeatureRequests/NicerHeadingIds
15
16 diff -r dc9a3809af61 -r 7de937813f1a MoinMoin/macro/Include.py
17 --- a/MoinMoin/macro/Include.py Mon May 07 22:50:51 2007 +0200
18 +++ b/MoinMoin/macro/Include.py Thu May 10 19:19:19 2007 -0400
19 @@ -188,19 +188,8 @@ def execute(macro, text, args_re=re.comp
20 macro.formatter.text(heading) +
21 macro.formatter.heading(0, level))
22 else:
23 - import sha
24 - from MoinMoin import config
25 - # this heading id might produce duplicate ids,
26 - # if the same page is included multiple times
27 - # Encode stuf we feed into sha module.
28 - pntt = (inc_name + heading).encode(config.charset)
29 - hid = "head-" + sha.new(pntt).hexdigest()
30 - request._page_headings.setdefault(pntt, 0)
31 - request._page_headings[pntt] += 1
32 - if request._page_headings[pntt] > 1:
33 - hid += '-%d' % (request._page_headings[pntt], )
34 result.append(
35 - macro.formatter.heading(1, level, id=hid) +
36 + macro.formatter.heading(1, level, id=wikiutil.unique_heading_id(request._page_headings, heading)) +
37 inc_page.link_to(request, heading, css_class="include-heading-link") +
38 macro.formatter.heading(0, level)
39 )
40 diff -r dc9a3809af61 -r 7de937813f1a MoinMoin/macro/TableOfContents.py
41 --- a/MoinMoin/macro/TableOfContents.py Mon May 07 22:50:51 2007 +0200
42 +++ b/MoinMoin/macro/TableOfContents.py Thu May 10 19:19:19 2007 -0400
43 @@ -8,7 +8,7 @@
44 @license: GNU GPL, see COPYING for details.
45 """
46
47 -import re, sha
48 +import re
49 from MoinMoin import config, wikiutil
50
51 #Dependencies = ["page"]
52 @@ -126,9 +126,6 @@ class TableOfContents:
53 if not match:
54 return
55 title_text = match.group('htext').strip()
56 - pntt = pagename + title_text
57 - self.titles.setdefault(pntt, 0)
58 - self.titles[pntt] += 1
59
60 # Get new indent level
61 newindent = len(match.group('hmarker'))
62 @@ -148,11 +145,6 @@ class TableOfContents:
63 self.result.append(self.macro.formatter.number_list(1))
64 self.result.append(self.macro.formatter.listitem(1))
65
66 - # Add the heading
67 - unique_id = ''
68 - if self.titles[pntt] > 1:
69 - unique_id = '-%d' % (self.titles[pntt],)
70 -
71 # close last listitem if same level
72 if self.indent == newindent:
73 self.result.append(self.macro.formatter.listitem(0))
74 @@ -160,7 +152,7 @@ class TableOfContents:
75 if self.indent >= newindent:
76 self.result.append(self.macro.formatter.listitem(1))
77 self.result.append(self.macro.formatter.anchorlink(1,
78 - "head-" + sha.new(pntt.encode(config.charset)).hexdigest() + unique_id) +
79 + wikiutil.unique_heading_id(self.titles, title_text)) +
80 self.macro.formatter.text(title_text) +
81 self.macro.formatter.anchorlink(0))
82
83 diff -r dc9a3809af61 -r 7de937813f1a MoinMoin/parser/text_moin_wiki.py
84 --- a/MoinMoin/parser/text_moin_wiki.py Mon May 07 22:50:51 2007 +0200
85 +++ b/MoinMoin/parser/text_moin_wiki.py Thu May 10 19:19:19 2007 -0400
86 @@ -777,8 +777,6 @@ class Parser:
87
88 def _heading_repl(self, word):
89 """Handle section headings."""
90 - import sha
91 -
92 h = word.strip()
93 level = 1
94 while h[level:level+1] == '=':
95 @@ -788,15 +786,8 @@ class Parser:
96 # FIXME: needed for Included pages but might still result in unpredictable results
97 # when included the same page multiple times
98 title_text = h[level:-level].strip()
99 - pntt = self.formatter.page.page_name + title_text
100 - self.titles.setdefault(pntt, 0)
101 - self.titles[pntt] += 1
102 -
103 - unique_id = ''
104 - if self.titles[pntt] > 1:
105 - unique_id = '-%d' % self.titles[pntt]
106 result = self._closeP()
107 - result += self.formatter.heading(1, depth, id="head-"+sha.new(pntt.encode(config.charset)).hexdigest()+unique_id)
108 + result += self.formatter.heading(1, depth, id=wikiutil.unique_heading_id(self.request._page_headings, title_text))
109
110 return (result + self.formatter.text(title_text) +
111 self.formatter.heading(0, depth))
112 diff -r dc9a3809af61 -r 7de937813f1a MoinMoin/wikiutil.py
113 --- a/MoinMoin/wikiutil.py Mon May 07 22:50:51 2007 +0200
114 +++ b/MoinMoin/wikiutil.py Thu May 10 19:19:19 2007 -0400
115 @@ -271,6 +271,25 @@ def make_breakable(text, maxlen):
116 else:
117 newtext.append(part)
118 return " ".join(newtext)
119 +
120 +def unique_heading_id(headings, text):
121 + """ generate an ID for a heading that is unique to this request, human-readable and HTML-compliant
122 + """
123 + import unicodedata
124 + # ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
125 + # followed by any number of letters, digits ([0-9]), hyphens ("-"),
126 + # underscores ("_"), colons (":"), and periods (".").
127 + # http://www.w3.org/TR/html4/types.html
128 + pntt = re.sub('[^-A-Za-z0-9_:.]+', '-', unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')).lower()
129 + hid = "head-" + pntt # basic heading structure
130 + # count the number of times this heading is found in this request
131 + headings.setdefault(pntt, 0)
132 + headings[pntt] += 1
133 + # spcial case: if the text is strictly non-ascii, add a number anyways so it looks nicer
134 + if headings[pntt] > 1 or pntt == "-":
135 + hid += '-%d' % (headings[pntt], ) # increment the heading id, to avoid duplicates
136 + return re.sub('--+', '-', hid) # necessary because the last line might have added another duplicate -
137 +
138
139 ########################################################################
140 ### Storage
Note: this patch is as broken as the default behavior for multiple includes of the same page, because 1.6.x now caches included pages, which 1.5 does not. It should otherwise work correctly.
1.7 patch
This patch is a crude adaptation of the above patch. It tries to guess wether it did a good job at creating a nice headings by using the following heuristic: a good conversion is a conversion that: has a len() at least the half of the original
It is much simpler than the 1.6 patch since it assumes that the work done by JohannesBerg deals with other issues my patch was solving. Indeed, it does already factor out id sanitization and gets rid of the ugly SHA hashes. I assume it also deals properly with cross-page numbering (ie. through includes and TOC).
1 # HG changeset patch
2 # User anarcat@titine.anarcat.ath.cx
3 # Date 1190262113 14400
4 # Node ID 3fcaf6561a8915f4eb83f5737fe258c6888509a9
5 # Parent 93be75db205186c2932e6512b9a9c803aba83da1
6 make nicer headings for latin1 charsets
7
8 we use a trivial heuristic to guess if our nicer heading is really nicer. the converted string is accepted if:
9
10 * it's longer than 1 characters
11 * it's longer than half the length of the original string
12
13 diff -r 93be75db2051 -r 3fcaf6561a89 MoinMoin/wikiutil.py
14 --- a/MoinMoin/wikiutil.py Wed Sep 19 21:39:48 2007 +0200
15 +++ b/MoinMoin/wikiutil.py Thu Sep 20 00:21:53 2007 -0400
16 @@ -2154,8 +2154,16 @@ def anchor_name_from_text(text):
17 Generate an anchor name from the given text
18 This function generates valid HTML IDs.
19 '''
20 - quoted = urllib.quote_plus(text.encode('utf-7'))
21 - res = quoted.replace('%', '.').replace('+', '').replace('_', '')
22 + import unicodedata
23 + if not isinstance(text, unicode):
24 + text = unicode(text, 'utf8')
25 + res = re.sub('[^-A-Za-z0-9_:.]+', '-', unicodedata.normalize('NFKD', text).encode('ascii', 'ignore'))
26 + # Heuristic to guess if we made a good job at interpreting the string, if:
27 + # the resulted string is too small OR
28 + # the resulting string is more that 50% smaller
29 + # then we consider that we failed and revert to a systematic utf7 encoding
30 + if len(res) <= 1 or len(res) <= (len(text) / 2):
31 + res = urllib.quote_plus(text.encode('utf-7')).replace('%', '.').replace('+', '').replace('_', '')
32 if not res[:1].isalpha():
33 return 'A%s' % res
34 return res
Current issues
- The name of the link could be shorter by removing the "head-" tag. This might cause problems because of conflicts with other anchors in the rendered HTML.
This patch even fixes an old bug that occured when the Include and TableOfContents macros were used together: the ID generated from the Include title was wrong and didn't worked when clicked into the TOC. It's now a "nice heading" and actually works. -- TheAnarcat 2006-10-03 01:31:09
todo: add pagename into anchor id to not create new problems, e.g. #Pagename:headingstring
- otherwise you get duplicate IDs if you include multiple other pages that have same headline texts (e.g. because they were created from the same template)
- if you add the pagename, you will run into the same problems (pure non-ascii pagenames) as with the heading texts
my position is that there is no way to fix this without reverting to the old cryptic behaviour or having extremely long anchor names, both of which are not desirable. another, better, approach would be to use that magic WASP caching system to have python code interpreted each time a heading is sent, regardless of the cache setting.
The simple solution that will work everywhere: use the headings numbers as id, e.g.
<h1 id="sec1">Heading text</h1> ... <h2 id="sec1.1">First sub heading</h2> ... <h2 id="sec1.2">Heading from included page</h2> ...
- This is actually very similar to what the page already does and doesn't fix the issue at hand. To be really clear, the problem boils down to this use case:
- PageOne
= Foo =
- PageTwo
= Foo =
- PageThree
[[Include(PageOne)]] [[Include(PageTwo)]]
- This will generate, under 1.6, without the patch
<h1 id="head-VERY_LONG_SHA_HASH">Foo</h1> <h1 id="head-VERY_LONG_SHA_HASH">Foo</h1>
- With the patch, under 1.6
<h1 id="head-foo">Foo</h1> <h1 id="head-foo">Foo</h1>
- With the patch, under 1.5
<h1 id="head-foo">Foo</h1> <h1 id="head-foo-2">Foo</h1>
Only the latter result is proper HTML, all the other cases are problematic. The difference between 1.5 and 1.6 is due to do_cache=False being removed from the send_page call.
- Don't use the heading name for the id but he heading serial numbers which can never be the same, and created at cache run time.
- If I understand your idea properly, this is what's currently going on: the unique IDs are created at cache generation. Now, if you mean that the ids should be generated when the cache is "ran" (not when it's generated), then yes, I agree, but I don't know how to do that. note that the current code does generate unique IDs, at cache "compile-time", but not at cache "run time".
Check text_python.py (1.5.7):
def heading(self, on, depth, **kw): if on: code = [ self.__adjust_language_state(), 'request.write(%s.heading(%r, %r, **%r))' % (self.__formatter, on, depth, kw), ] return self.__insert_code(''.join(code)) else: return self.formatter.heading(on, depth, **kw)
calls to formatter.heading(1, ...) happen at cache runtime. Section numbers are generated on those calls (currently only if section numbers enabled, but you can change that). You can use the section number as a unique identifier.
- If I understand your idea properly, this is what's currently going on: the unique IDs are created at cache generation. Now, if you mean that the ids should be generated when the cache is "ran" (not when it's generated), then yes, I agree, but I don't know how to do that. note that the current code does generate unique IDs, at cache "compile-time", but not at cache "run time".
Examples: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.8
- If you would include those pages multiple time, you'd have the same problem.
To some extent, it is the responsability of the wiki editors to take care of those issues and make sure there's no duplicate ids... I suspect similar issues exist with the [[Anchor(Foo)]] macro.
- No, it is the responsibility of the wiki engine to create unique ids. Anchors are different - if you let the user add ids to the page, you can't control the output.
Testing
Testing area: wsb on koumbit
Related pages
See also: MoinMoinBugs/ReImplementCleanerIncludeMacro - meta bug discussion related issues