-
-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Currently warcat gives the following error on revisit records from a deduplicated WARC:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 282, in action
action(record)
File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 298, in verify_payload_digest
raise VerifyProblem('Bad payload digest.', '5.9')
warcat.tool.VerifyProblem: ('Bad payload digest.', '5.9', True)
The payload digest of a revisit record should be the payload digest of the record the revisit record points to, see 6.7.2 on page 15 (page 21 in the PDF) on http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf:
To report the payload digest used for comparison, a 'revisit' record using this profile shall include a WARC-Payload-Digest field, with a value of the digest that was calculated on the payload.
(...)
For records using this profile, the payload is defined as the original payload content whose digest value was unchanged.
Currently warcat reports an error for the payload digest, it would be nice if it would check the WARC for the record the revisit record refers to. If that record is in the WARC, compare the payload digest with that. If the record is not in the WARC, throw a warning or info that the record the revisit record refers to is not in the WARC.