I will put here the different unanswered question I have when I am coding. Feel you free to give me clue, if you understand about what I am talking

XPath

I try to use XPath for the unit test of the converter however there is some issue.

Order of the tags

I do not have solution for generic test with XPath which won't fail with two identical trees like that : "text bold and em" and "text bold and em"z

Current solution : Useless to care about it at this time.

$/!\$ Problem unclear.

XML is not HTML (where autoclosing of tags can apply), so if XML is erroneous, nothing should work (and it wouldn't work). -- EugeneSyromyatnikov 2010-05-26 18:17:18

How to use Xpath

I made some experience with XPath, it is not really working well at this time ...

   1     def test_style_xpath(self):
   2         test_input = '<page><body><p><span baseline-shift="sub">sub</span>script</p></body></page>'
   3         result_tree = list()
   4         result_tree.append('div')
   5         result_tree.append('p')
   6         result_tree.append('sub')
   7         count = len(result_tree)
   8 
   9         out = self.conv(self.handle_input(test_input), )
  10         count = count - 1
  11         assert out.tag.name in result_tree 
  12         result = list(out.findall(u'*'))
  13         while count > 0:
  14             assert result[0].tag.name in result_tree
  15             result = list(result[0].findall(u'*'))
  16             count = count - 1
  17         assert result[0] == 'sub'

$/!\$ Problem unclear, please either use wiki for pasting or a non-dark pastebin.

XPath in EmeraldTree

Here is a test of Xpath with emerald tree :

   1     c = Element(u'c', children=(u"Content of C", ))
   2     b = Element(u'b',children=(c, ))
   3     a = Element(u'a', children=(b, ))
   4     tree = ElementTree(a)
   5 
   6     result = list(a.findall(u"a/b/c"))
   7     assert len(result) == 1

It should normally return the element c, but it does not return any thing.

Here is the tree (with XML syntax) :

<a><b><c>Content of C</c></b></a>

I finally found the correct syntax from ElementTree : http://svn.effbot.org/public/tags/elementtree-1.3a3-20070912/selftest.py so we have :

   1 def test_Element_find_child():
   2     c = Element(u'c', children=(u"Content of C", ))
   3     b = Element(u'b',children=(c, ))
   4     a = Element(u'a', children=(b, ))
   5     tree = ElementTree(a)
   6 
   7     assert tree.find("a/b/c").findtext == 'Content of C'

XPath and Namespaces

I tried to use the previous work in to the HTML Converter, without any success.

Here is the code :

   1     def test_style_xpath(self):
   2         test_input = '<page><body><p><span baseline-shift="sub">sub</span>script</p></body></page>'
   3         out = self.conv(self.handle_input(test_input), )
   4         tree = ET.ElementTree(out)
   5         dump(tree)
   6         assert tree.find('div/p/sub').findtext == 'sub'
   7         assert tree.find('div/p').findtext == 'p'

Here is a dump of the resulting tree :

<ns0:div xmlns:ns0="http://www.w3.org/1999/xhtml"><ns0:p><ns0:sub>sub</ns0:sub>script</ns0:p></ns0:div>

I think the problem is coming from the namespaces. I will try to add some test about that in to EmeraldTree.

After some research, I found why. For an element without namespace tag return the name of the element as a string. For instance

   1 a = Element('a')
   2 print a.tag
   3 >>> a

But for an element with a namespace, we will a string representation of a Name object as a tag. So for instance :

   1 #a is en element like that : <ns0:div xmlns:ns0="http://www.w3.org/1999/xhtml"></div>
   2 print a.tag
   3 >>>{http://www.w3.org/1999/xhtml}div

XPath with ET VS Xpath with lxml

With ET : We can directly run XPath query of the tree we obtained from the converter.

With lxml : We should first write the ET tree to a string using write method from ET, then parse it again with lxml using the internal parser, then run the XPath query.

I do not like to use lxml, because there is two steps we can avoid with using only ET. I think that these steps can produce errors which won't happen normally. We introduce something specific to the test which will not be used in production.

However, this will not hide the problem, so we can use lxml

MoinMoin: ValentinJaniaut/GSoC/ProblemsAndQuestion (last edited 2010-05-26 22:00:04 by ValentinJaniaut)