-
-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
Section 6 of the CDXJ spec, defines the fields to be included in a JSON block as url, digest, mime, payload, filename, offset, length, and status. It might be useful to document where these things can be found when parsing a WARC file, as some of them are from the WARC header and some in the HTTP header.
From what I can see of looking through the CDXJ-indexer code, the fields map as follows:
- url
- WARC-Target-URI
- digest
- WARC-Payload-Digest
- mime
- If the WARC Record Type is 'revisit' then the type should be "warc/revisit", otherwise use the HTTP Content-Type header.
- filename
- WARC-Filename
- offset
- This is something calculated by counting through the WARC file byte-by-byte? I can't find it in the WARC spec.
- length
- WARC Content-Length
- status
- Status Code from the HTTP header
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels