Fix error-path leak in CPU JPEG decode (setjmp/longjmp)#9429
Open
MPSFuzz wants to merge 2 commits intopytorch:mainfrom
Open
Fix error-path leak in CPU JPEG decode (setjmp/longjmp)#9429MPSFuzz wants to merge 2 commits intopytorch:mainfrom
MPSFuzz wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9429
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Contributor
Author
|
Following the suggestion in #9423, I applied the same error-path handling/refactoring pattern to the PNG decode path as well. In particular,
I tested this with python3 poc_png.py torchvision_png_leak_case --iters 100000 --report-every 10000
With the rebuilt patched `torchvision`, RSS stays flat and `delta_kb` no longer increases:
[INFO] input_file=torchvision_png_leak_case
[INFO] input_size=297
[INFO] iters=100000, report_every=10000
[INFO] start_rss_kb=627612
[REPORT] iter=1 rss_kb=627612 delta_kb=0 elapsed=0.00s
[REPORT] iter=10000 rss_kb=627612 delta_kb=0 elapsed=0.15s
[REPORT] iter=20000 rss_kb=627612 delta_kb=0 elapsed=0.31s
[REPORT] iter=30000 rss_kb=627612 delta_kb=0 elapsed=0.47s
[REPORT] iter=40000 rss_kb=627612 delta_kb=0 elapsed=0.63s
[REPORT] iter=50000 rss_kb=627612 delta_kb=0 elapsed=0.78s
[REPORT] iter=60000 rss_kb=627612 delta_kb=0 elapsed=0.93s
[REPORT] iter=70000 rss_kb=627612 delta_kb=0 elapsed=1.08s
[REPORT] iter=80000 rss_kb=627612 delta_kb=0 elapsed=1.24s
[REPORT] iter=90000 rss_kb=627612 delta_kb=0 elapsed=1.38s
[REPORT] iter=100000 rss_kb=627612 delta_kb=0 elapsed=1.53s
[DONE] end_rss_kb=627612 total_delta_kb=0
For comparison, in the other environment without the updated build, the same PoC still shows near-linear RSS growth:
[INFO] input_file=torchvision_png_leak_case
[INFO] input_size=297
[INFO] iters=100000, report_every=10000
[INFO] start_rss_kb=637760
[REPORT] iter=1 rss_kb=637760 delta_kb=0 elapsed=0.00s
[REPORT] iter=10000 rss_kb=654504 delta_kb=16744 elapsed=0.18s
[REPORT] iter=20000 rss_kb=670344 delta_kb=32584 elapsed=0.37s
[REPORT] iter=30000 rss_kb=686184 delta_kb=48424 elapsed=0.55s
[REPORT] iter=40000 rss_kb=701760 delta_kb=64000 elapsed=0.73s
[REPORT] iter=50000 rss_kb=717600 delta_kb=79840 elapsed=0.91s
[REPORT] iter=60000 rss_kb=733440 delta_kb=95680 elapsed=1.10s
[REPORT] iter=70000 rss_kb=749280 delta_kb=111520 elapsed=1.29s
[REPORT] iter=80000 rss_kb=764856 delta_kb=127096 elapsed=1.47s
[REPORT] iter=90000 rss_kb=780696 delta_kb=142936 elapsed=1.65s
[REPORT] iter=100000 rss_kb=796536 delta_kb=158776 elapsed=1.83s
[DONE] end_rss_kb=796536 total_delta_kb=158776poc_png.py: #!/usr/bin/env python3
import sys
import time
import argparse
import torch
from torchvision.io import ImageReadMode
from torchvision.io.image import decode_png
def get_rss_kb():
with open("/proc/self/status", "r", encoding="utf-8") as f:
for line in f:
if line.startswith("VmRSS:"):
return int(line.split()[1])
return -1
def main():
parser = argparse.ArgumentParser()
parser.add_argument("input_file", help="path to input sample")
parser.add_argument("--iters", type=int, default=100000, help="number of iterations")
parser.add_argument(
"--report-every", type=int, default=1000, help="report RSS every N iterations"
)
args = parser.parse_args()
with open(args.input_file, "rb") as f:
data = f.read()
u8 = torch.frombuffer(data, dtype=torch.uint8)
rss0 = get_rss_kb()
t0 = time.time()
print(f"[INFO] input_file={args.input_file}")
print(f"[INFO] input_size={len(data)}")
print(f"[INFO] iters={args.iters}, report_every={args.report_every}")
print(f"[INFO] start_rss_kb={rss0}")
for i in range(1, args.iters + 1):
try:
decode_png(u8, mode=ImageReadMode.UNCHANGED)
except Exception:
pass
if i == 1 or i % args.report_every == 0 or i == args.iters:
rss = get_rss_kb()
print(
f"[REPORT] iter={i} rss_kb={rss} delta_kb={rss - rss0} elapsed={time.time() - t0:.2f}s",
flush=True,
)
rss1 = get_rss_kb()
print(f"[DONE] end_rss_kb={rss1} total_delta_kb={rss1 - rss0}")
if __name__ == "__main__":
main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refs #9383
Thanks for the feedback and suggestions! Following the guidance provided, I've refactored the code to improve its structure. The setjmp/longjmp (libjpeg) region is now separated from potentially-throwing Torch operations. Specifically, the libjpeg decoding path is now encapsulated in the helper function
decode_jpeg_hwc_impl, which returns an HWC tensor and optionally reads EXIF orientation. The outerdecode_jpegfunction performs thepermute()and EXIF transformation after libjpeg has finished and the decompression object is destroyed, ensuring that throwing operations are outside the longjmp zone.I have validated this fix using Valgrind on several broken JPEG files, including those from the PoC case mentioned in #9383 (https://github.com/MPSFuzz/images/tree/master/torch_poc) as well as additional files in
test/assets/damaged_jpeg. Below are the leak summaries detected by Valgrind:Case 1:
Case 2:
Official assert:
The "possibly lost" memory is a result of untracked allocations (as expected), but nothing is definitely lost.