Human readable heading anchors for MoinMoin

1.7 has nice heading IDs now.

Contents

Human readable heading anchors for MoinMoin

Headings in MoinMoin are identified by anchors, which is fine, but the text of the anchor id is close to unreadable. It is actually made of:

   1 pntt = self.formatter.page.page_name + title_text
   2 id = "head-"+sha.new(pntt.encode(config.charset)).hexdigest()+unique_id

(more or less extracted from wiki.py)

We see that a SHA hash is used to avoid encoding issue. It is not, as I understand the code, used to avoid clashes, because clashing string titles will logically result in clashing SHA hashes, hence the unique_id appended when clashes are found.

What I propose is that we use the title itself as the anchor id. I have written a crude patch for this that rips out illegal characters and replaces them with dashes. Unicode normalisation is also attempted, but it does not do what I want exactly.

(I would like "é" to turn into "e", but instead it turns it into "e" + acute:

   1 >>> unicodedata.normalize('NFKD', u'é')
   2 u'e\u0301'
   3 >>> unicodedata.normalize('NFKC', u'é')
   4 u'\xe9'
   5 >>> unicodedata.normalize('NFD', u'é')
   6 u'e\u0301'
   7 >>> unicodedata.normalize('NFC', u'é')
   8 u'\xe9'

If someone finds a nicer solution, please mention it here...

RadomirDopieralski says: Well, maybe you could normalize first, and then just remove all non-ascii characters...

Patch (applies to 1.5.5a and 1.5.7)

   1  macro/Include.py         |   11 +----------
   2  macro/TableOfContents.py |   12 ++----------
   3  parser/wiki.py           |   10 +---------
   4  wikiutil.py              |   19 +++++++++++++++++++
   5  4 files changed, 23 insertions(+), 29 deletions(-)
   6 --- wikiutil.py.orig	Thu May 10 16:35:47 2007
   7 +++ wikiutil.py	Thu May 10 18:44:33 2007
   8 @@ -274,6 +274,25 @@
   9              newtext.append(part)
  10      return " ".join(newtext)
  11  
  12 +def unique_heading_id(headings, text):
  13 +    """ generate an ID for a heading that is unique to this request, human-readable and HTML-compliant
  14 +    """
  15 +    import unicodedata
  16 +    # ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
  17 +    # followed by any number of letters, digits ([0-9]), hyphens ("-"),
  18 +    # underscores ("_"), colons (":"), and periods (".").
  19 +    # http://www.w3.org/TR/html4/types.html
  20 +    pntt = re.sub('[^-A-Za-z0-9_:.]+', '-', unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')).lower()
  21 +    hid = "head-" + pntt # basic heading structure
  22 +    # count the number of times this heading is found in this request
  23 +    headings.setdefault(pntt, 0)
  24 +    headings[pntt] += 1
  25 +    # spcial case: if the text is strictly non-ascii, add a number anyways so it looks nicer
  26 +    if headings[pntt] > 1 or pntt == "-":  
  27 +        hid += '-%d' % (headings[pntt], ) # increment the heading id, to avoid duplicates
  28 +    return re.sub('--+', '-', hid) # necessary because the last line might have added another duplicate -
  29 +
  30 +
  31  ########################################################################
  32  ### Storage
  33  ########################################################################
  34 --- parser/wiki.py.orig	Sat Sep 16 18:21:52 2006
  35 +++ parser/wiki.py	Thu May 10 18:42:59 2007
  36 @@ -744,8 +744,6 @@
  37  
  38      def _heading_repl(self, word):
  39          """Handle section headings."""
  40 -        import sha
  41 -
  42          h = word.strip()
  43          level = 1
  44          while h[level:level+1] == '=':
  45 @@ -756,15 +754,9 @@
  46          # TODO but it might still result in unpredictable results
  47          # when included the same page multiple times
  48          title_text = h[level:-level].strip()
  49 -        pntt = self.formatter.page.page_name + title_text
  50 -        self.titles.setdefault(pntt, 0)
  51 -        self.titles[pntt] += 1
  52  
  53 -        unique_id = ''
  54 -        if self.titles[pntt] > 1:
  55 -            unique_id = '-%d' % self.titles[pntt]
  56          result = self._closeP()
  57 -        result += self.formatter.heading(1, depth, id="head-"+sha.new(pntt.encode(config.charset)).hexdigest()+unique_id)
  58 +        result += self.formatter.heading(1, depth, id=wikiutil.unique_heading_id(self.request._page_headings, title_text))
  59                                       
  60          return (result + self.formatter.text(title_text) +
  61                  self.formatter.heading(0, depth))
  62 --- macro/Include.py.orig	Wed Apr 18 14:56:22 2007
  63 +++ macro/Include.py	Thu May 10 18:43:11 2007
  64 @@ -190,21 +190,12 @@
  65                                macro.formatter.text(heading) +
  66                                macro.formatter.heading(0, level))
  67              else:
  68 -                import sha
  69 -                from MoinMoin import config
  70                  # this heading id might produce duplicate ids,
  71                  # if the same page is included multiple times
  72 -                # Encode stuf we feed into sha module.
  73 -                pntt = (inc_name + heading).encode(config.charset)
  74 -                hid = "head-" + sha.new(pntt).hexdigest()
  75 -                request._page_headings.setdefault(pntt, 0)
  76 -                request._page_headings[pntt] += 1
  77 -                if request._page_headings[pntt] > 1:
  78 -                    hid += '-%d'%(request._page_headings[pntt],)
  79                  result.append(
  80                      #macro.formatter.heading(1, level, hid,
  81                      #    icons=edit_icon.replace('<img ', '<img align="right" ')) +
  82 -                    macro.formatter.heading(1, level, id=hid) +
  83 +                    macro.formatter.heading(1, level, id=wikiutil.unique_heading_id(request._page_headings, heading)) +
  84                      inc_page.link_to(request, heading, css_class="include-heading-link") +
  85                      macro.formatter.heading(0, level)
  86                  )
  87 --- macro/TableOfContents.py.orig	Fri Nov 10 17:02:52 2006
  88 +++ macro/TableOfContents.py	Thu May 10 18:43:32 2007
  89 @@ -8,7 +8,7 @@
  90      @license: GNU GPL, see COPYING for details.
  91  """
  92  
  93 -import re, sha
  94 +import re
  95  from MoinMoin import config, wikiutil
  96  
  97  #Dependencies = ["page"]
  98 @@ -125,9 +125,6 @@
  99          match = self.head_re.match(line)
 100          if not match: return
 101          title_text = match.group('htext').strip()
 102 -        pntt = pagename + title_text
 103 -        self.titles.setdefault(pntt, 0)
 104 -        self.titles[pntt] += 1
 105  
 106          # Get new indent level
 107          newindent = len(match.group('hmarker'))
 108 @@ -147,11 +144,6 @@
 109              self.result.append(self.macro.formatter.number_list(1))
 110              self.result.append(self.macro.formatter.listitem(1))
 111  
 112 -        # Add the heading
 113 -        unique_id = ''
 114 -        if self.titles[pntt] > 1:
 115 -            unique_id = '-%d' % (self.titles[pntt],)
 116 -
 117          # close last listitem if same level
 118          if self.indent == newindent:
 119              self.result.append(self.macro.formatter.listitem(0))
 120 @@ -159,7 +151,7 @@
 121          if self.indent >= newindent:
 122              self.result.append(self.macro.formatter.listitem(1))
 123          self.result.append(self.macro.formatter.anchorlink(1,
 124 -            "head-" + sha.new(pntt.encode(config.charset)).hexdigest() + unique_id) +
 125 +                           wikiutil.unique_heading_id(self.titles, title_text)) +
 126                             self.macro.formatter.text(title_text) +
 127                             self.macro.formatter.anchorlink(0))
 128

nice_headings.diff

Changelog:

fix the display of accentuated characters using a technique that removes those pesky dashes -- TheAnarcat 2006-10-16 17:37:16
use proper spacing (PEP8) and don't use the string module. Upload a copy of the patch instead of relying on my svn version and remove the old versions in this page. -- TheAnarcat 2006-10-15 22:52:20
fix all the remaining todos -- TheAnarcat 2007-05-10 23:05:26

1.6 patch

   1 # HG changeset patch
   2 # User anarcat@titine.anarcat.ath.cx
   3 # Date 1178839159 14400
   4 # Node ID 7de937813f1a07b3ff98f7de4b68092780ab7e11
   5 # Parent  dc9a3809af61aa74bdb4861f1ab7d02f8b730c0e
   6 factor out the heading uniqueness code into wikiutil
   7 
   8 rework the code so that ascii charsets are readable (and not SHA-1 encrypted)
   9 
  10 non-ascii charsets will receive incremental headings
  11 
  12 all tests show that heading ids are still unique after this, and this actually fixes a bug in the Include macro where the generated heading had a duplicate id
  13 
  14 Ref: MoinMoin:FeatureRequests/NicerHeadingIds
  15 
  16 diff -r dc9a3809af61 -r 7de937813f1a MoinMoin/macro/Include.py
  17 --- a/MoinMoin/macro/Include.py	Mon May 07 22:50:51 2007 +0200
  18 +++ b/MoinMoin/macro/Include.py	Thu May 10 19:19:19 2007 -0400
  19 @@ -188,19 +188,8 @@ def execute(macro, text, args_re=re.comp
  20                                macro.formatter.text(heading) +
  21                                macro.formatter.heading(0, level))
  22              else:
  23 -                import sha
  24 -                from MoinMoin import config
  25 -                # this heading id might produce duplicate ids,
  26 -                # if the same page is included multiple times
  27 -                # Encode stuf we feed into sha module.
  28 -                pntt = (inc_name + heading).encode(config.charset)
  29 -                hid = "head-" + sha.new(pntt).hexdigest()
  30 -                request._page_headings.setdefault(pntt, 0)
  31 -                request._page_headings[pntt] += 1
  32 -                if request._page_headings[pntt] > 1:
  33 -                    hid += '-%d' % (request._page_headings[pntt], )
  34                  result.append(
  35 -                    macro.formatter.heading(1, level, id=hid) +
  36 +                    macro.formatter.heading(1, level, id=wikiutil.unique_heading_id(request._page_headings, heading)) +
  37                      inc_page.link_to(request, heading, css_class="include-heading-link") +
  38                      macro.formatter.heading(0, level)
  39                  )
  40 diff -r dc9a3809af61 -r 7de937813f1a MoinMoin/macro/TableOfContents.py
  41 --- a/MoinMoin/macro/TableOfContents.py	Mon May 07 22:50:51 2007 +0200
  42 +++ b/MoinMoin/macro/TableOfContents.py	Thu May 10 19:19:19 2007 -0400
  43 @@ -8,7 +8,7 @@
  44      @license: GNU GPL, see COPYING for details.
  45  """
  46  
  47 -import re, sha
  48 +import re
  49  from MoinMoin import config, wikiutil
  50  
  51  #Dependencies = ["page"]
  52 @@ -126,9 +126,6 @@ class TableOfContents:
  53          if not match:
  54              return
  55          title_text = match.group('htext').strip()
  56 -        pntt = pagename + title_text
  57 -        self.titles.setdefault(pntt, 0)
  58 -        self.titles[pntt] += 1
  59  
  60          # Get new indent level
  61          newindent = len(match.group('hmarker'))
  62 @@ -148,11 +145,6 @@ class TableOfContents:
  63              self.result.append(self.macro.formatter.number_list(1))
  64              self.result.append(self.macro.formatter.listitem(1))
  65  
  66 -        # Add the heading
  67 -        unique_id = ''
  68 -        if self.titles[pntt] > 1:
  69 -            unique_id = '-%d' % (self.titles[pntt],)
  70 -
  71          # close last listitem if same level
  72          if self.indent == newindent:
  73              self.result.append(self.macro.formatter.listitem(0))
  74 @@ -160,7 +152,7 @@ class TableOfContents:
  75          if self.indent >= newindent:
  76              self.result.append(self.macro.formatter.listitem(1))
  77          self.result.append(self.macro.formatter.anchorlink(1,
  78 -            "head-" + sha.new(pntt.encode(config.charset)).hexdigest() + unique_id) +
  79 +                           wikiutil.unique_heading_id(self.titles, title_text)) +
  80                             self.macro.formatter.text(title_text) +
  81                             self.macro.formatter.anchorlink(0))
  82  
  83 diff -r dc9a3809af61 -r 7de937813f1a MoinMoin/parser/text_moin_wiki.py
  84 --- a/MoinMoin/parser/text_moin_wiki.py	Mon May 07 22:50:51 2007 +0200
  85 +++ b/MoinMoin/parser/text_moin_wiki.py	Thu May 10 19:19:19 2007 -0400
  86 @@ -777,8 +777,6 @@ class Parser:
  87  
  88      def _heading_repl(self, word):
  89          """Handle section headings."""
  90 -        import sha
  91 -
  92          h = word.strip()
  93          level = 1
  94          while h[level:level+1] == '=':
  95 @@ -788,15 +786,8 @@ class Parser:
  96          # FIXME: needed for Included pages but might still result in unpredictable results
  97          # when included the same page multiple times
  98          title_text = h[level:-level].strip()
  99 -        pntt = self.formatter.page.page_name + title_text
 100 -        self.titles.setdefault(pntt, 0)
 101 -        self.titles[pntt] += 1
 102 -
 103 -        unique_id = ''
 104 -        if self.titles[pntt] > 1:
 105 -            unique_id = '-%d' % self.titles[pntt]
 106          result = self._closeP()
 107 -        result += self.formatter.heading(1, depth, id="head-"+sha.new(pntt.encode(config.charset)).hexdigest()+unique_id)
 108 +        result += self.formatter.heading(1, depth, id=wikiutil.unique_heading_id(self.request._page_headings, title_text))
 109  
 110          return (result + self.formatter.text(title_text) +
 111                  self.formatter.heading(0, depth))
 112 diff -r dc9a3809af61 -r 7de937813f1a MoinMoin/wikiutil.py
 113 --- a/MoinMoin/wikiutil.py	Mon May 07 22:50:51 2007 +0200
 114 +++ b/MoinMoin/wikiutil.py	Thu May 10 19:19:19 2007 -0400
 115 @@ -271,6 +271,25 @@ def make_breakable(text, maxlen):
 116          else:
 117              newtext.append(part)
 118      return " ".join(newtext)
 119 +
 120 +def unique_heading_id(headings, text):
 121 +    """ generate an ID for a heading that is unique to this request, human-readable and HTML-compliant
 122 +    """
 123 +    import unicodedata
 124 +    # ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
 125 +    # followed by any number of letters, digits ([0-9]), hyphens ("-"),
 126 +    # underscores ("_"), colons (":"), and periods (".").
 127 +    # http://www.w3.org/TR/html4/types.html
 128 +    pntt = re.sub('[^-A-Za-z0-9_:.]+', '-', unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')).lower()
 129 +    hid = "head-" + pntt # basic heading structure
 130 +    # count the number of times this heading is found in this request
 131 +    headings.setdefault(pntt, 0)
 132 +    headings[pntt] += 1
 133 +    # spcial case: if the text is strictly non-ascii, add a number anyways so it looks nicer
 134 +    if headings[pntt] > 1 or pntt == "-":  
 135 +        hid += '-%d' % (headings[pntt], ) # increment the heading id, to avoid duplicates
 136 +    return re.sub('--+', '-', hid) # necessary because the last line might have added another duplicate -
 137 +
 138  
 139  ########################################################################
 140  ### Storage

nice_headings-1.6.diff

Note: this patch is as broken as the default behavior for multiple includes of the same page, because 1.6.x now caches included pages, which 1.5 does not. It should otherwise work correctly.

1.7 patch

This patch is a crude adaptation of the above patch. It tries to guess wether it did a good job at creating a nice headings by using the following heuristic: a good conversion is a conversion that:

has a len() of 2 or more
has a len() at least the half of the original

It is much simpler than the 1.6 patch since it assumes that the work done by JohannesBerg deals with other issues my patch was solving. Indeed, it does already factor out id sanitization and gets rid of the ugly SHA hashes. I assume it also deals properly with cross-page numbering (ie. through includes and TOC).

   1 # HG changeset patch
   2 # User anarcat@titine.anarcat.ath.cx
   3 # Date 1190262113 14400
   4 # Node ID 3fcaf6561a8915f4eb83f5737fe258c6888509a9
   5 # Parent  93be75db205186c2932e6512b9a9c803aba83da1
   6 make nicer headings for latin1 charsets
   7 
   8 we use a trivial heuristic to guess if our nicer heading is really nicer. the converted string is accepted if:
   9 
  10  * it's longer than 1 characters
  11  * it's longer than half the length of the original string
  12 
  13 diff -r 93be75db2051 -r 3fcaf6561a89 MoinMoin/wikiutil.py
  14 --- a/MoinMoin/wikiutil.py	Wed Sep 19 21:39:48 2007 +0200
  15 +++ b/MoinMoin/wikiutil.py	Thu Sep 20 00:21:53 2007 -0400
  16 @@ -2154,8 +2154,16 @@ def anchor_name_from_text(text):
  17      Generate an anchor name from the given text
  18      This function generates valid HTML IDs.
  19      '''
  20 -    quoted = urllib.quote_plus(text.encode('utf-7'))
  21 -    res = quoted.replace('%', '.').replace('+', '').replace('_', '')
  22 +    import unicodedata
  23 +    if not isinstance(text, unicode):
  24 +        text = unicode(text, 'utf8')
  25 +    res = re.sub('[^-A-Za-z0-9_:.]+', '-', unicodedata.normalize('NFKD', text).encode('ascii', 'ignore'))
  26 +    # Heuristic to guess if we made a good job at interpreting the string, if:
  27 +    # the resulted string is too small OR
  28 +    # the resulting string is more that 50% smaller 
  29 +    # then we consider that we failed and revert to a systematic utf7 encoding
  30 +    if len(res) <= 1 or len(res) <= (len(text) / 2):
  31 +        res = urllib.quote_plus(text.encode('utf-7')).replace('%', '.').replace('+', '').replace('_', '')
  32      if not res[:1].isalpha():
  33          return 'A%s' % res
  34      return res

nice_headings-1.7.diff

Current issues

The name of the link could be shorter by removing the "head-" tag. This might cause problems because of conflicts with other anchors in the rendered HTML.

This patch even fixes an old bug that occured when the Include and TableOfContents macros were used together: the ID generated from the Include title was wrong and didn't worked when clicked into the TOC. It's now a "nice heading" and actually works. -- TheAnarcat 2006-10-03 01:31:09

todo: add pagename into anchor id to not create new problems, e.g. #Pagename:headingstring
- otherwise you get duplicate IDs if you include multiple other pages that have same headline texts (e.g. because they were created from the same template)
- if you add the pagename, you will run into the same problems (pure non-ascii pagenames) as with the heading texts
- my position is that there is no way to fix this without reverting to the old cryptic behaviour or having extremely long anchor names, both of which are not desirable. another, better, approach would be to use that magic WASP caching system to have python code interpreted each time a heading is sent, regardless of the cache setting.

The simple solution that will work everywhere: use the headings numbers as id, e.g.

<h1 id="sec1">Heading text</h1>
...
<h2 id="sec1.1">First sub heading</h2>
...
<h2 id="sec1.2">Heading from included page</h2>
...

This is actually very similar to what the page already does and doesn't fix the issue at hand. To be really clear, the problem boils down to this use case:

PageOne

```
= Foo =
```

PageTwo

```
= Foo =
```

PageThree

[[Include(PageOne)]]
[[Include(PageTwo)]]

This will generate, under 1.6, without the patch

<h1 id="head-VERY_LONG_SHA_HASH">Foo</h1>
<h1 id="head-VERY_LONG_SHA_HASH">Foo</h1>

With the patch, under 1.6

<h1 id="head-foo">Foo</h1>
<h1 id="head-foo">Foo</h1>

With the patch, under 1.5

<h1 id="head-foo">Foo</h1>
<h1 id="head-foo-2">Foo</h1>

Only the latter result is proper HTML, all the other cases are problematic. The difference between 1.5 and 1.6 is due to do_cache=False being removed from the send_page call.

Don't use the heading name for the id but he heading serial numbers which can never be the same, and created at cache run time.
- If I understand your idea properly, this is what's currently going on: the unique IDs are created at cache generation. Now, if you mean that the ids should be generated when the cache is "ran" (not when it's generated), then yes, I agree, but I don't know how to do that. note that the current code does generate unique IDs, at cache "compile-time", but not at cache "run time".
  - Check text_python.py (1.5.7):
```
    def heading(self, on, depth, **kw):        
        if on:
            code = [
                self.__adjust_language_state(),
                'request.write(%s.heading(%r, %r, **%r))' % (self.__formatter,
                                                             on, depth, kw),
                ]     
            return self.__insert_code(''.join(code))
        else:
            return self.formatter.heading(on, depth, **kw)
```
    calls to formatter.heading(1, ...) happen at cache runtime. Section numbers are generated on those calls (currently only if section numbers enabled, but you can change that). You can use the section number as a unique identifier.

Examples: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.8

If you would include those pages multiple time, you'd have the same problem.

To some extent, it is the responsability of the wiki editors to take care of those issues and make sure there's no duplicate ids... I suspect similar issues exist with the [[Anchor(Foo)]] macro.

No, it is the responsibility of the wiki engine to create unique ids. Anchors are different - if you let the user add ids to the page, you can't control the output.

Testing

Testing area: wsb on koumbit

See also: MoinMoinBugs/ReImplementCleanerIncludeMacro - meta bug discussion related issues
FeatureRequests/ShorterPageInternalLinks

CategoryFeatureImplemented

MoinMoin: FeatureRequests/NicerHeadingIds (last edited 2008-06-18 10:55:48 by JohannesBerg)

MoinMoin: FeatureRequests/NicerHeadingIds