tar: fix etc remapping of paths with non-ASCII characters#2073
tar: fix etc remapping of paths with non-ASCII characters#2073cgwalters merged 3 commits intobootc-dev:mainfrom
Conversation
PAX extended headers take precedence over basic tar header fields per POSIX. When a container layer contains PAX `path` or `linkpath` headers (e.g. for non-ASCII filenames), they override the remapped path written to the basic header, causing files that should land under /usr/etc to remain under /etc. Filter out `path` and `linkpath` from PAX extensions before writing the output entry. The tar crate regenerates them from the remapped path passed to append_data/append_link. Signed-off-by: Peter Siegel <psiegel2000@icloud.com>
Verifies that PAX `path` headers (as produced by Docker/BuildKit for non-ASCII filenames) do not bypass the /etc -> /usr/etc remap. Checks both that no unremapped /etc PAX headers remain in the output and that the remapped file appears under usr/etc. Signed-off-by: Peter Siegel <psiegel2000@icloud.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where PAX headers in tar archives were overriding path remapping, specifically for /etc to /usr/etc. The proposed solution correctly filters out the problematic path and linkpath headers. The changes include a new regression test to ensure the fix works as expected, including for paths with non-ASCII characters. My review focuses on improving the implementation's efficiency and making the new test more robust.
Signed-off-by: Peter Siegel <psiegel2000@icloud.com>
gursewak1997
left a comment
There was a problem hiding this comment.
Would be nice to have just two commits (add gemini suggestions into one of the previous commits) but changes lgtm
Sure, happy to do that. thanks for the quick review |
|
I have an agent running on this doing some research, and man: tar is just a giant mess. There's so many incompatible extensions and variants, and some codebases don't follow the spec (because they can't because they need to care about things that don't honor the spec) etc. In this specific case, from some agent research it looks like go archive/tar switches to PAX when it sees non-UTF8, but the Rust tar crate instead prefers GNU. And...hmmm when we do this filtering...I think tar-rs will see that even though the header is pax it will cut over to GNU for emitting? Let me see... |
|
On the positive side we don't do this tar filtering in the composefs path at least, so no bugs there. |
Unfortunately I'm stuck on kernel 5.15 so I wasn't quite able to get the composefs pathway working for my use-case. My end goal is trying to get bootc working on ubuntu for nvidia jetson devices. Any tips would be welcome but I guess bootc on ubuntu + jetson is probably uncharted territory :) |
|
OK after some analysis...I think this is viable, though we could clearly do some more cleanup here. I did some in composefs/tar-core#19 (which will be used by composefs, not yet the ostree side yet). Also on that topic we've got a lot of ongoing work on #20 which will eventually get us to the point where we may be able to completely rework the ostree-container storage as well such that it's based on the composefs storage, which would help avoid this (there's better stuff we can do there than filtering tar). |
Motivated by bootc-dev/bootc#2073, where Go's archive/tar (used by Docker/BuildKit) emits PAX path headers for non-ASCII filenames like Főtanúsítvány.pem (valid UTF-8, but non-ASCII). PAX headers take precedence over basic tar headers per POSIX, so code that remaps paths by rewriting the basic header must also update or strip PAX path/linkpath records. tar-core already handles non-UTF-8 PAX path values correctly (raw `&[u8]` throughout, matching Go archive/tar and Rust tar crate), but this was untested. Add tests covering: parser acceptance of non-UTF-8 PAX path bytes, lossy conversion, builder->parser roundtrip with a >100 byte path (to actually trigger PAX emission), linkpath preservation, and PaxExtension value_bytes() vs value() behavior. Assisted-by: OpenCode (Claude Opus 4) Signed-off-by: Colin Walters <walters@verbum.org>
Test the PAX 'x' -> GNU 'L' -> real entry ordering, which is what tar-rs's builder produces when you call append_pax_extensions() followed by append_data() with a long path. This matters for ecosystem compatibility -- bootc's copy_entry (bootc-dev/bootc#2073) generates exactly this layout when filtering PAX extensions during path remapping. The parser already handles this correctly via PendingMetadata accumulation across recursive parse_header calls, but the reversed ordering was untested. Also test that PAX path still wins over GNU long name regardless of which comes first in the byte stream. Assisted-by: OpenCode (Claude Opus 4) Signed-off-by: Colin Walters <walters@verbum.org>
Various patches motivated by bootc-dev/bootc#2073
Hmm rhel9 is 5.14+lots-of-patches, but I am not aware of a hard reason that 5.15 (to be clear: kernel.org?) wouldn't work. Would take some investigation of course, IIRC rhel9 did backport some EROFS work, and I think the new mount API too... |
I think that's exaclty what i'm missing. I get this error testing in an aarch64 qemu vm: Unfortunately this isn't vanilla 5.15 as nvidia requires some out of tree modules for jetson. I had hoped to be able to get away with using ubuntu's prebuilt linux-nvidia-tegra-jetson kernel metapackage but I guess if that would work it would be too easy 😅 I guess i'm in for some more investigation on whether I can use a newer version of that package (they seem to also have a 6.8 kernel build) or build a custom kernel which has the fixes I need. |
This is an issue probably for composefs-rs; offhand it might have even been fixed by composefs/composefs-rs#265 I have no issues with trying to support 5.15 offhand, but that said we'd need to have a clean agreed reproducer environment. |
Thanks for the tip. I'll take another look. The good thing is I have bootc working in an aarch64 vm and on real hw for the jetson orin nano with ubuntu 24.04 and the jetson flavor of kernel 6.8. If I run into any more issues with 5.15 that are relevant for bootc I'll create some github issues / PRs and include a reproducible example. |
PAX extended headers take precedence over basic tar header fields per POSIX. When a container layer contains PAX
pathorlinkpathheaders (e.g. for non-ASCII filenames), they override the remapped path written to the basic header, causing files that should land under/usr/etcto remain under/etc.This PR filters out
pathandlinkpathfrom PAX extensions incopy_entrybefore writing the output entry. The tar crate regenerates them from the remapped path passed toappend_data/append_link. I also add a test to ensure that paths containing non-ASCII characters are remapped properly.Fixes bootcrew/ubuntu-bootc#4 (I ran into the same issue building my own ubuntu-based bootc image).
Example error:
Disclaimer: This bug was quite hard to track down. I used claude to help find the root cause.