These notes represent FlorianFesti & LionKimbro's work making CommunityWiki:MachineCodeBlocks. They are hosted here on the MoinMoin wiki because this wiki is regularly read by FlorianFesti. (I understand that this is probably okay with ThomasWaldmann.)
For more on machine code blocks and past notes, see CommunityWiki:MachineCodeBlocks.
Contents
Machine Code Blocks Specification
Abstract: The "machine code blocks" format is a way to encode key-value pairs on wiki pages and within text files. The format includes the ability to link across the web, and is made to work well with most wiki.
Format
A page can contain several machine code blocks.
The block is written in the following form:
MACHINECODEBLOCK ::= START LINE* END START ::= "BEGINBLOCK" END ::= "ENDBLOCK" LINE ::= WS KEY WS VALUE VALUE ::= DELIM1 PLAINVALUE DELIMEND WS | DELIM2 MASCHINECODEBLOCK WS DELIMEND | DELIM3 RAWHTMLVALUE DELIMEND KEY ::= ([:alphanum:]|_)* DELIM1 ::= ":" DELIM2 ::= "?" DELIM3 ::= "$" DELIMEND ::= ";" WS ::= (\w*) PLAINVALUE ::= ([^;]|;;)* RAWHTMLVALUE ::= ((<[^>]*?>|[^;]|;;)*)
It gets problematic if you want to directly use the wiki output as HTML fragment. The problem is that you don't know how to treat the HTML you find there. We could introduce a special marker that allows to use the HTML directly without unquoting.
XXX/TODO: Rewrite HTML mode to fit this.
We talked about it but it doesn't say it here- in RAWHTMLVALUE, semicolons that are part of entities are interpreted as part of the HTML- the semicolon in an entity is not a delimiter.
Page Processing
HTML Mode
Idea: Strip out most XML/HTML tags, because wiki (and other engines) put in a lot of tags.
There are two modi:
- unquoting:
- ignore all tags
- unquote HTML enties
- raw mode
- keep the tags and entities
- restrict the search for the end delimiter (";") to text nodes
A MCB parser MUST check if the document is HTML or plain text. If it is HTML it must use parse the HTML file und start in the "unquoting" mode. In plain text files raw mode must be used. "RAWHTMLVALUE"s are always parsed in raw mode. In HTML documents the parser must switch back to quoting mode after the value ended.
In both modi two directly following semicolons (";;") are treated as one semicolon within the value. Semicolons that are part of HTML tags or entities are seen as part of these and therefore must not be treated as end delimiter and also cannot be part of two following semicolons.
On the one hand, there is redundancy (end-delim is ;\n), on the other one you cannot represent รค; as ä;. Thus it is obviously broken.
Is there a difference between raw_mode and plain text? rawmode should do HTML processing, Text mode not?
Consult unicode tables to identify whitespace, alphanum.
Alphabetic characters or use u'string'.isalnum()
White_Space characters or use u'string'.isspace()
Interpretation
Each key is bound to a list of strings. Key definitions append strings to the end of the list. XXX Blocks!
Machine Code Block Schema
After some thought and a deeper look at RDF I think we should not create an own schema format. Schemas are always complicated, hard to read and write. The schema format we could invent wouldn't be much easier than RDF schema. So I propose to use RDF schema (RDFS) internally. see /RdfIntegration for details. -- FlorianFesti 2005-06-09 10:37:06
per schema:
- id: ID for addressing the block, no dashes allowed
- type: Schema;
- label: (string);
- attribute: BLOCK(Attribute);
per attribute:
- "key" - the key that shows up in a machine code block
- "label" - may be a longer name than the key, including white space
- "description" - human readable description, may be a paragraph long
- "required" - 1 (required) or 0 (optional)
- "multiple" values allowed
- "type"
see /MetaSchema
Schema Example:
BEGINBLOCK id: #community type: Schema label: Internet Community attribute $ BEGINBLOCK key: community-name; label: Community Name; description: Name of the Internet Community; required: 1; multiple: 0; type: string; ENDBLOCK attribute $ BEGINBLOCK key: community-member: label: Community Member; description: Block representing a member of the Internet Community; required: 0; multiple: 1; type: block; <----- this is the right name, right? ENDBLOCK ... and so on and so forth ... ENDBLOCK
Types
Literal Types
May be present within a block as value. They are detected by the parser.
- _string (HTML tags are ignored, HTML entities unquoted)
- _raw (becomes an HTML fragment)
- _block
Real Types
Types further restrict the values of an attribute and give an interpretation. Some type define translations that have to be applied to the literal values.
All _raw is _string, just the read process is different?
- string - use as is
- _string: the text of the string
- _raw: default to _string
- _block: error/ignore/decay
- number - try to convert to a float, use string if fails
- _string: as string, then convert using \w*(\d+)\w*
or \w*(.+)\w* not sure of syntax
- _raw: default to _string
- _string: as string, then convert using \w*(\d+)\w*
- date - try to convert, use string if this fails (XXX look for more precise definition)
- _string: ISO8601
- URL - to a webpage, conventionally understood by web browsers and programs
- _string: href replacement, accept first URL
- _raw: accept first URL
- HTML fragment - use as is
- as string
- Maschinecodeblock (of a given Type, identified by, ...)
- _string: as URL, then load block from URL and assign it
- _block: nested block (the text describing the nested block)
TODO
define block identificaton & addressing
- use id attribute
addressing by hash (http://example.com/#foo identifies block "id:foo" on page http://example.com/)
- drill down with -'s to identify particular elements.
- name spaces?
define schema (see /MetaSchema)
- Define internal representation
- dicts of lists of values (dics, strings)
Mapping to RDF (see /RdfIntegration)
longer term:
- write formal spec
- write implementation
Addressing
lk: We could say that machinecodeblocks are named with only letters, and then use reserved characters for going in deep.
(leapfrogging from page to page to other block on page to item within list to page to...)
F: Nice idea. But we need only one step right now.
lk: Well, if you're nesting more than one level deep, ...
This is no problem if each block has a name. So we have a reserved attribute "id" and you access example.com/MyBlocks#thridsubblockofweiredblock
hm, but then other people can't address a block if you didn't name it.
You still can leapfrogg. But this has to be done by the application and not by the stanard uncluded URL mangling.
okay- we use a combination of allowed leapfrogging, and recommended naming.
Names must be unique on the page. yes, ofcause, they are URLs/URIs
We should probably recommend that the names attempt to be different than any <a name>'s that might be on the page itself, too.
Though, in fact, they can co-exist, if an <a name> collides.. (..!)
I can imagine a "smart wiki" that would identify MCB's and notice their names and then generate <a name> tags around the name: keys. So that a named MCB would also be a valid identifier to web browsers as well. Yes.
Wiki Engine Description Schema
Wiki Description Schema
Future ideas:
Maschine Code Block Web Service
Define XML RPC interface that maps Blockurls to real blocks. Define how the blocks are returned.
- Dicts for blocks, with lists for the values.
- Retrieve all blocks recursivly? Depth limit? Links only? Links for external blocks only?