Skip to content

XPath results contain namespace in the keys #20

@aemreunal

Description

@aemreunal

Hello,

First of all, commendable job. Thank you for your work.

I'm working on a Jupyter notebook, which will be a tutorial on how to use Riko to access unstructured website data in a structured manner. When I finish it, I will send you a pull request with the notebook (or get it to you in an alternative way), as I think it could be a great beginner's guide for everyone who'd like to use Riko.

As I am preparing the notebook, I ran in to an interesting situation: when I am parsing <li> elements using the xpathfetchpage and if those elements have other elements nested underneath it, the keys to those nested elements have a weird {http://www.w3.org/1999/xhtml} prefix. The following code snippet can illustrate it:

url = 'http://www.sozcu.com.tr/kategori/yazarlar/yilmaz-ozdil/'
xpath = '/html/body/div[5]/div[6]/div[3]/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li/a'
xpath_conf = {'xpath': xpath, 'url': url}
flow_main = SyncPipe('xpathfetchpage', conf=xpath_conf)
print next(flow_main.output)

This prints:

{
    u'href': u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/', 
    u'{http://www.w3.org/1999/xhtml}p': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n', 
    u'{http://www.w3.org/1999/xhtml}span': {
        u'content': u'16 Ekim 2016', 
        u'class': u'date'
    }, 
    u'title': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n'
}

for the fetched structure:

<a href="http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/" title="GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın">
    <p>GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın</p> 
    <span class="date">16 Ekim 2016</span>
</a>

(This page is updated daily so the exact output might differ when you run it but the structure remains the same)
I was unable to figure out why there's that '{http://www.w3.org/1999/xhtml}' prefix on the nested key values or how to get rid of them. I understand that it differentiates between the attributes of a tag and the nested elements but maybe there is a flag (that I was unable to find) to retrieve them as a list under a key like 'child' in top-level dictionary.

Thank you for your assistance.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions