Skip to content

HPT: Host property table #4

@mfan

Description

@mfan

store host related information to help get rid of duplicated url and optimizing for downloading:

  1. host normalization, e.g. 301 will decide target host is winner. the info will be used for normalize urls and remove dups.
  2. host robots.txt info.
  3. host properties
    • friendliness, how the host behavior based upon previous crawling
    • stability, ranking helps here, also from previous crawling.
    • ranking, from other sources or assigned.
    • other info, e.g. ip address caching.
  4. timestamp to decide record freshness.

design:

  • this table stores in redis
  • shall sync with the same table in hbase.
  • might load on demand from hbase to redis.

The implentation will take stages, and this issue will be separated into sub-issues.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions