Skip to content

Expansion: XPath Template Store #9

@mfan

Description

@mfan

XPath Template Server is used to drive the expansion (based on list templates), and content extraction (based on content templates).

Two kinds of templates are stored in the store (redis):

  • list templates. The templates are used to extract more link urls from the page. The urls are used to crawl depper into more pages. For example, the template could be applied upon "category listing pages", or "related contents" pages, or "most popular items" pages, etc.
  • content templates. The templates are used to extract one or more entities from the page. The extracted data are structured data and could be add or merged into existing database.

TODO:

  • microformats shall be supported as one kind of content templates. The parsing of microformat is supported in lxml library. Need to keep tracking how many sites using microformats now.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions