store host related information to help get rid of duplicated url and optimizing for downloading:
- host normalization, e.g. 301 will decide target host is winner. the info will be used for normalize urls and remove dups.
- host robots.txt info.
- host properties
- friendliness, how the host behavior based upon previous crawling
- stability, ranking helps here, also from previous crawling.
- ranking, from other sources or assigned.
- other info, e.g. ip address caching.
- timestamp to decide record freshness.
design:
- this table stores in redis
- shall sync with the same table in hbase.
- might load on demand from hbase to redis.
The implentation will take stages, and this issue will be separated into sub-issues.
store host related information to help get rid of duplicated url and optimizing for downloading:
design:
The implentation will take stages, and this issue will be separated into sub-issues.