RFC: overlay dirmeta_delegate design#701
Conversation
…yer unpack When unpacking container image layers, directories may be created implicitly as structural parents for files in the tar stream, without having their own tar entry with proper metadata. These structural directories get default metadata which shadows the meaningful metadata from base image layers when the layers are composed via overlayfs. With the new kernel overlayfs dirmeta_delegate feature, directories marked with the trusted.overlay.dirmeta_delegate xattr (or user.overlay.dirmeta_delegate in rootless mode) will have their metadata delegated to a lower layer in the overlay stack. This change: - Adds a DirmetaDelegate field to TarOptions to opt in to the behavior - Updates UnpackLayer() to set the dirmeta_delegate xattr on directories that are implicitly created as parents for files in the tar stream, while leaving explicitly-defined directories unmarked - Adds a dirmeta_delegate driver option to the overlay driver (default true) and threads it through to the tar unpack path - Adds tests verifying the xattr is set on implicit dirs, not set on explicit dirs, and not set when the feature is disabled Signed-off-by: John Eckersberg <jeckersb@redhat.com>
|
This relates to opencontainers/image-spec#1221 and opencontainers/image-spec#737 and also I think has implications for e.g. https://github.com/jlebon/chunkah last we talked about it right? |
mtrmac
left a comment
There was a problem hiding this comment.
*shrug* I can imagine such semantics.
In principle, I think it’s important that the layers should behave the same in all execution environments (different graph drivers / backing stores). In that sense, relying on an overlay-only feature seems problematic. (An alternative view is that most snapshot-based graph drivers like btrfs don’t have this issue at all, and that this arises because overly needs the intermediate parents to exist in the child’s diff directory structure, and that this is really a design bug in overlay. I don’t think that’s unreasonable.) So, I’d really prefer the behavior to be defined by OCI.
Pragmatically, for a rechunker, the rechunker can, in principle, figure out the intended directory attributes, and explicitly include the intermediate parents in all layers it creates, can’t it?
| } | ||
| case "dirmeta_delegate": | ||
| logrus.Debugf("overlay: dirmeta_delegate=%s", val) | ||
| o.dirmetaDelegate, err = strconv.ParseBool(val) |
There was a problem hiding this comment.
I don’t know why this should be an option, conceptually the images should have a single set of semantics. (Now, following that thought strictly, either all existing implementations are violating the spec, or the spec needs to specify the existing behavior and the new feature should not be adopted … but let’s ignore that.)
Who would want to set this on/off, per host? (I can imagine per-image, although that would be a totally unreasonable hassle, but per host just looks like a wrong scope for the decision.)
There was a problem hiding this comment.
I have no strong opinions on the configurability of this, it's only like this because doing AI iteration it asked how it should work and I said to have it on by default but be able to disable it with the config. But the only real reason I did that was for my own testing purposes.
| // marked with the overlay dirmeta_delegate xattr. This tells overlayfs to | ||
| // delegate metadata (timestamps, ownership, mode) for these directories to | ||
| // a lower layer, preserving the meaningful metadata from base image layers. | ||
| DirmetaDelegate bool |
There was a problem hiding this comment.
What happens if no layer contains an entry for the parent directory?
There was a problem hiding this comment.
As written today, if it cannot find a non-delegated entry in any of the lowerdirs, it will fall back to the current behavior of using whatever metadata happens to be present on the topmost dir in the stack. That could be subject to change during kernel review though.
There was a problem hiding this comment.
This is also a question for the OCI spec…
| parentPath := filepath.Join(dest, parent) | ||
| if err := fileutils.Lexists(parentPath); err != nil && os.IsNotExist(err) { | ||
| err = idtools.MkdirAllAndChownNew(parentPath, 0o777, rootIDs) | ||
| if options.DirmetaDelegate { |
There was a problem hiding this comment.
I guess there could be a cleaner implementation but I don’t worry about that now.
| err = mkdirAllAndChownWithDirmetaDelegate(parentPath, 0o777, rootIDs) | ||
| } else { | ||
| err = idtools.MkdirAllAndChownNew(parentPath, 0o777, rootIDs) | ||
| } |
There was a problem hiding this comment.
What happens if a layer (this one or a later one) contains a directory entry for one of the intermediate directories after we create it? I suppose we should clean the attribute.
There was a problem hiding this comment.
Yeah this is a good point, if the ordering of the tar stream in this particular layer is something like...
/usr/bin/foo
... some more entries ...
/usr/bin
We would need to set the xattr on /usr/bin when creating /usr/bin/foo but ensure it's cleared when we encounter /usr/bin itself later. Or otherwise look ahead to know not to set it in the first place.
If instead it's spread across two layers, like:
-
Earlier / lower / more "base" layer defines
/usr/bin/foo -
Later / higher in the stack / derived layer defines
/usr/bin
Then I don't think there needs to be any special handling because when traversing through the lowerdir stack overlay will find /usr/bin from the derived layer first without the xattr set and use the metadata from there.
But more generally this is precisely the sort of thing that the OCI spec should address and clarify.
Yeah this definitely has implications for chunkah. This whole thing can be worked around by always including the full parent directory metadata (the rpm-ostree rechunker supports that today) but per the OCI image spec that's not really how it should work, and I tend to agree with the spirit of the spec. Here's a contrived example why this distinction can be important: Base image Base image Now, we have some big 100GB If I want to build derived images from both base images, ideally I have just one layer blob with the content of |
Agree 100% that this should be also be driven through the spec in the places where @cgwalters noted above. This is just a piece of the puzzle so that if we get the spec clarified, we're already moving things in the right direction to be able to support it properly with the overlay driver. |
|
We also need to worry, at least for a time, about older kernels. I suppose adding an xattr should be a no-op on older kernels, but just for the record. |
|
@jlebon tagging you here explicitly since this is highly relevant to your interests as colin noted above 😄 |
|
It's also worth emphasizing that this is not an issue for composefs-based storage. |
I'm looking for some early feedback on this. Don't worry too much about all of the details, this is just AI-guided PoC level for now to show integration with potential kernel bits. For example, this very intentionally ignores the zstd:chunked path but that will need considered in a final version.
Mostly I want to make sure this is a reasonable approach before I start trying to clean up things and push on the kernel side. I want to avoid investing a nontrivial amount of time fixing the kernel only to find this isn't going to fly over here. If everything pans out I'll tackle a more stringent final change here.
For some background around this, I've written about it on the bootc blog. It's a bit long, so the tl;dr is that the handling of implied directories in tar streams is not great and causes us quite a bit of pain in bootc. I won't reiterate here but the blog also touches on other places in the community where this struggle comes up, it is not exclusive to bootc.
Here's the basic demonstration of the problem:
bootc (via ostree in this case) goes out of its way to set the mtime to 0 everywhere. During rechunking the layers get shuffled around a bit, but the gist is that for the above,
/usr/bin(as an example) ends up getting defined in the base layer with the correct mtime of 0. Later layers end up including specific binaries under/usr/binbut none of them explicitly re-define/usr/binitself. We end up with implicit directories created via those layers that "shadow" the intended base layer metadata. Meanwhile for things like/usr/localno later layers add files to that tree, so the metadata is correct since there's nothing to shadow it.I've drafted up what I'm calling dirmeta_delegate for overlayfs, which is what this patch is utilizing to correct this problem. By setting an xattr, one can instruct overlayfs to delegate directory metadata to a lower layer. We use this to label the implied directories during unpacking.
With a patched kernel and a podman built using a patched containers/storage: