Skip to content

Tie CDXJ fields to WARC / HTTP headers #162

@extua

Description

@extua

Section 6 of the CDXJ spec, defines the fields to be included in a JSON block as url, digest, mime, payload, filename, offset, length, and status. It might be useful to document where these things can be found when parsing a WARC file, as some of them are from the WARC header and some in the HTTP header.

From what I can see of looking through the CDXJ-indexer code, the fields map as follows:

url
WARC-Target-URI
digest
WARC-Payload-Digest
mime
If the WARC Record Type is 'revisit' then the type should be "warc/revisit", otherwise use the HTTP Content-Type header.
filename
WARC-Filename
offset
This is something calculated by counting through the WARC file byte-by-byte? I can't find it in the WARC spec.
length
WARC Content-Length
status
Status Code from the HTTP header

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions