Skip to content

Fix error-path leak in CPU JPEG decode (setjmp/longjmp)#9429

Open
MPSFuzz wants to merge 2 commits intopytorch:mainfrom
MPSFuzz:fix/jpeg-decode-errorpath-leak
Open

Fix error-path leak in CPU JPEG decode (setjmp/longjmp)#9429
MPSFuzz wants to merge 2 commits intopytorch:mainfrom
MPSFuzz:fix/jpeg-decode-errorpath-leak

Conversation

@MPSFuzz
Copy link
Contributor

@MPSFuzz MPSFuzz commented Mar 9, 2026

Refs #9383
Thanks for the feedback and suggestions! Following the guidance provided, I've refactored the code to improve its structure. The setjmp/longjmp (libjpeg) region is now separated from potentially-throwing Torch operations. Specifically, the libjpeg decoding path is now encapsulated in the helper function decode_jpeg_hwc_impl, which returns an HWC tensor and optionally reads EXIF orientation. The outer decode_jpeg function performs the permute() and EXIF transformation after libjpeg has finished and the decompression object is destroyed, ensuring that throwing operations are outside the longjmp zone.

I have validated this fix using Valgrind on several broken JPEG files, including those from the PoC case mentioned in #9383 (https://github.com/MPSFuzz/images/tree/master/torch_poc) as well as additional files in test/assets/damaged_jpeg. Below are the leak summaries detected by Valgrind:

Case 1:

root@f6dd0ec04f4a:~# grep -n "LEAK SUMMARY" -A25 /root/valgrind_case1.log
881500:==2225004== LEAK SUMMARY:
881501-==2225004==    definitely lost: 0 bytes in 1 blocks
881502-==2225004==    indirectly lost: 0 bytes in 0 blocks
881503-==2225004==      possibly lost: 492,322 bytes in 310 blocks
881504-==2225004==    still reachable: 25,377,811 bytes in 101,103 blocks
881505-==2225004==         suppressed: 0 bytes in 0 blocks

Case 2:

root@f6dd0ec04f4a:~# grep -n "LEAK SUMMARY" -A25 /root/valgrind_case2.log
609727:==2232009== LEAK SUMMARY:
609728-==2232009==    definitely lost: 0 bytes in 1 blocks
609729-==2232009==    indirectly lost: 0 bytes in 0 blocks
609730-==2232009==      possibly lost: 492,322 bytes in 310 blocks
609731-==2232009==    still reachable: 25,377,811 bytes in 101,103 blocks
609732-==2232009==         suppressed: 0 bytes in 0 blocks

Official assert:

==== /root/vg_bad_huffman.log ====
609717:==2246780== LEAK SUMMARY:
609718-==2246780==    definitely lost: 0 bytes in 1 blocks
609719-==2246780==    indirectly lost: 0 bytes in 0 blocks
609720-==2246780==      possibly lost: 492,322 bytes in 310 blocks
609721-==2246780==    still reachable: 25,377,796 bytes in 101,103 blocks
609722-==2246780==         suppressed: 0 bytes in 0 blocks
609723-==2246780== 
==== /root/vg_corrupt.log ====
609717:==2247747== LEAK SUMMARY:
609718-==2247747==    definitely lost: 0 bytes in 1 blocks
609719-==2247747==    indirectly lost: 0 bytes in 0 blocks
609720-==2247747==      possibly lost: 492,322 bytes in 310 blocks
609721-==2247747==    still reachable: 25,377,802 bytes in 101,103 blocks
609722-==2247747==         suppressed: 0 bytes in 0 blocks
609723-==2247747== 
==== /root/vg_corrupt34_2.log ====
609717:==2248607== LEAK SUMMARY:
609718-==2248607==    definitely lost: 0 bytes in 1 blocks
609719-==2248607==    indirectly lost: 0 bytes in 0 blocks
609720-==2248607==      possibly lost: 492,322 bytes in 310 blocks
609721-==2248607==    still reachable: 25,377,796 bytes in 101,103 blocks
609722-==2248607==         suppressed: 0 bytes in 0 blocks
609723-==2248607== 
==== /root/vg_corrupt34_3.log ====
609717:==2249501== LEAK SUMMARY:
609718-==2249501==    definitely lost: 0 bytes in 1 blocks
609719-==2249501==    indirectly lost: 0 bytes in 0 blocks
609720-==2249501==      possibly lost: 492,322 bytes in 310 blocks
609721-==2249501==    still reachable: 25,377,799 bytes in 101,103 blocks
609722-==2249501==         suppressed: 0 bytes in 0 blocks
609723-==2249501== 
==== /root/vg_corrupt34_4.log ====
609717:==2250387== LEAK SUMMARY:
609718-==2250387==    definitely lost: 0 bytes in 1 blocks
609719-==2250387==    indirectly lost: 0 bytes in 0 blocks
609720-==2250387==      possibly lost: 492,322 bytes in 310 blocks
609721-==2250387==    still reachable: 25,377,787 bytes in 101,103 blocks
609722-==2250387==         suppressed: 0 bytes in 0 blocks
609723-==2250387== 

The "possibly lost" memory is a result of untracked allocations (as expected), but nothing is definitely lost.

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 9, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9429

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@MPSFuzz
Copy link
Contributor Author

MPSFuzz commented Mar 12, 2026

Following the suggestion in #9423, I applied the same error-path handling/refactoring pattern to the PNG decode path as well.

In particular, decode_png is now structured similarly to the updated decode_jpeg path:

  • keep the libpng setjmp/longjmp region contained,
  • move potentially-throwing Torch ops outside that region,
  • and align the PNG error-path cleanup logic with the JPEG fix.

I tested this with torchvision_png_leak_case from:
https://github.com/MPSFuzz/images/tree/master/torch_poc

python3 poc_png.py torchvision_png_leak_case --iters 100000 --report-every 10000

With the rebuilt patched `torchvision`, RSS stays flat and `delta_kb` no longer increases:

[INFO] input_file=torchvision_png_leak_case  
[INFO] input_size=297  
[INFO] iters=100000, report_every=10000  
[INFO] start_rss_kb=627612  
[REPORT] iter=1 rss_kb=627612 delta_kb=0 elapsed=0.00s  
[REPORT] iter=10000 rss_kb=627612 delta_kb=0 elapsed=0.15s  
[REPORT] iter=20000 rss_kb=627612 delta_kb=0 elapsed=0.31s  
[REPORT] iter=30000 rss_kb=627612 delta_kb=0 elapsed=0.47s  
[REPORT] iter=40000 rss_kb=627612 delta_kb=0 elapsed=0.63s  
[REPORT] iter=50000 rss_kb=627612 delta_kb=0 elapsed=0.78s  
[REPORT] iter=60000 rss_kb=627612 delta_kb=0 elapsed=0.93s  
[REPORT] iter=70000 rss_kb=627612 delta_kb=0 elapsed=1.08s  
[REPORT] iter=80000 rss_kb=627612 delta_kb=0 elapsed=1.24s  
[REPORT] iter=90000 rss_kb=627612 delta_kb=0 elapsed=1.38s  
[REPORT] iter=100000 rss_kb=627612 delta_kb=0 elapsed=1.53s  
[DONE] end_rss_kb=627612 total_delta_kb=0

For comparison, in the other environment without the updated build, the same PoC still shows near-linear RSS growth:

[INFO] input_file=torchvision_png_leak_case  
[INFO] input_size=297  
[INFO] iters=100000, report_every=10000  
[INFO] start_rss_kb=637760  
[REPORT] iter=1 rss_kb=637760 delta_kb=0 elapsed=0.00s  
[REPORT] iter=10000 rss_kb=654504 delta_kb=16744 elapsed=0.18s  
[REPORT] iter=20000 rss_kb=670344 delta_kb=32584 elapsed=0.37s  
[REPORT] iter=30000 rss_kb=686184 delta_kb=48424 elapsed=0.55s  
[REPORT] iter=40000 rss_kb=701760 delta_kb=64000 elapsed=0.73s  
[REPORT] iter=50000 rss_kb=717600 delta_kb=79840 elapsed=0.91s  
[REPORT] iter=60000 rss_kb=733440 delta_kb=95680 elapsed=1.10s  
[REPORT] iter=70000 rss_kb=749280 delta_kb=111520 elapsed=1.29s  
[REPORT] iter=80000 rss_kb=764856 delta_kb=127096 elapsed=1.47s  
[REPORT] iter=90000 rss_kb=780696 delta_kb=142936 elapsed=1.65s  
[REPORT] iter=100000 rss_kb=796536 delta_kb=158776 elapsed=1.83s  
[DONE] end_rss_kb=796536 total_delta_kb=158776

poc_png.py:

#!/usr/bin/env python3
import sys
import time
import argparse

import torch
from torchvision.io import ImageReadMode
from torchvision.io.image import decode_png


def get_rss_kb():
    with open("/proc/self/status", "r", encoding="utf-8") as f:
        for line in f:
            if line.startswith("VmRSS:"):
                return int(line.split()[1])
    return -1


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("input_file", help="path to input sample")
    parser.add_argument("--iters", type=int, default=100000, help="number of iterations")
    parser.add_argument(
        "--report-every", type=int, default=1000, help="report RSS every N iterations"
    )
    args = parser.parse_args()

    with open(args.input_file, "rb") as f:
        data = f.read()

    u8 = torch.frombuffer(data, dtype=torch.uint8)

    rss0 = get_rss_kb()
    t0 = time.time()

    print(f"[INFO] input_file={args.input_file}")
    print(f"[INFO] input_size={len(data)}")
    print(f"[INFO] iters={args.iters}, report_every={args.report_every}")
    print(f"[INFO] start_rss_kb={rss0}")

    for i in range(1, args.iters + 1):
        try:
            decode_png(u8, mode=ImageReadMode.UNCHANGED)
        except Exception:
            pass

        if i == 1 or i % args.report_every == 0 or i == args.iters:
            rss = get_rss_kb()
            print(
                f"[REPORT] iter={i} rss_kb={rss} delta_kb={rss - rss0} elapsed={time.time() - t0:.2f}s",
                flush=True,
            )

    rss1 = get_rss_kb()
    print(f"[DONE] end_rss_kb={rss1} total_delta_kb={rss1 - rss0}")


if __name__ == "__main__":
    main()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants