Skip to content

Support payload digest of revisit records #15

@Arkiver2

Description

@Arkiver2

Currently warcat gives the following error on revisit records from a deduplicated WARC:

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 282, in action
    action(record)
  File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 298, in verify_payload_digest
    raise VerifyProblem('Bad payload digest.', '5.9')
warcat.tool.VerifyProblem: ('Bad payload digest.', '5.9', True)

The payload digest of a revisit record should be the payload digest of the record the revisit record points to, see 6.7.2 on page 15 (page 21 in the PDF) on http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf:

To report the payload digest used for comparison, a 'revisit' record using this profile shall include a WARC-Payload-Digest field, with a value of the digest that was calculated on the payload.
(...)
For records using this profile, the payload is defined as the original payload content whose digest value was unchanged.

Currently warcat reports an error for the payload digest, it would be nice if it would check the WARC for the record the revisit record refers to. If that record is in the WARC, compare the payload digest with that. If the record is not in the WARC, throw a warning or info that the record the revisit record refers to is not in the WARC.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions