Skip to content

RFC: overlay dirmeta_delegate design#701

Draft
jeckersb wants to merge 1 commit intocontainers:mainfrom
jeckersb:dirmeta_delegate
Draft

RFC: overlay dirmeta_delegate design#701
jeckersb wants to merge 1 commit intocontainers:mainfrom
jeckersb:dirmeta_delegate

Conversation

@jeckersb
Copy link

I'm looking for some early feedback on this. Don't worry too much about all of the details, this is just AI-guided PoC level for now to show integration with potential kernel bits. For example, this very intentionally ignores the zstd:chunked path but that will need considered in a final version.

Mostly I want to make sure this is a reasonable approach before I start trying to clean up things and push on the kernel side. I want to avoid investing a nontrivial amount of time fixing the kernel only to find this isn't going to fly over here. If everything pans out I'll tackle a more stringent final change here.

For some background around this, I've written about it on the bootc blog. It's a bit long, so the tl;dr is that the handling of implied directories in tar streams is not great and causes us quite a bit of pain in bootc. I won't reiterate here but the blog also touches on other places in the community where this struggle comes up, it is not exclusive to bootc.

Here's the basic demonstration of the problem:

[root@rawhide ~]# podman system reset -f
[root@rawhide ~]# /usr/bin/podman run --quiet --rm quay.io/fedora/fedora-bootc ls -l /usr
total 8
drwxr-xr-x. 1 root root   36 Mar 17 14:21 bin
drwxr-xr-x. 1 root root    0 Jan  1  1970 games
drwxr-xr-x. 1 root root   16 Mar 17 14:21 i686-w64-mingw32
drwxr-xr-x. 1 root root   20 Mar 17 14:21 include
drwxr-xr-x. 1 root root  152 Mar 17 14:21 lib
drwxr-xr-x. 1 root root 1716 Mar 17 14:21 lib64
drwxr-xr-x. 1 root root   82 Mar 17 14:21 libexec
drwxr-xr-x. 1 root root   82 Jan  1  1970 local
lrwxrwxrwx. 2 root root    3 Jan  1  1970 sbin -> bin
drwxr-xr-x. 1 root root   78 Mar 17 14:21 share
drwxr-xr-x. 1 root root   24 Jan  1  1970 src
lrwxrwxrwx. 2 root root   10 Jan  1  1970 tmp -> ../var/tmp
drwxr-xr-x. 1 root root   16 Mar 17 14:21 x86_64-w64-mingw32

bootc (via ostree in this case) goes out of its way to set the mtime to 0 everywhere. During rechunking the layers get shuffled around a bit, but the gist is that for the above, /usr/bin (as an example) ends up getting defined in the base layer with the correct mtime of 0. Later layers end up including specific binaries under /usr/bin but none of them explicitly re-define /usr/bin itself. We end up with implicit directories created via those layers that "shadow" the intended base layer metadata. Meanwhile for things like /usr/local no later layers add files to that tree, so the metadata is correct since there's nothing to shadow it.

I've drafted up what I'm calling dirmeta_delegate for overlayfs, which is what this patch is utilizing to correct this problem. By setting an xattr, one can instruct overlayfs to delegate directory metadata to a lower layer. We use this to label the implied directories during unpacking.

With a patched kernel and a podman built using a patched containers/storage:

[root@rawhide ~]# grep DELEGATE /boot/config-7.0.0-0.rc3.28.fc45.x86_64
CONFIG_OVERLAY_FS_DIRMETA_DELEGATE=y
[root@rawhide ~]# podman system reset -f
[root@rawhide ~]# /root/bin/podman run --quiet --rm quay.io/fedora/fedora-bootc ls -l /usr
total 8
drwxr-xr-x. 1 root root   36 Jan  1  1970 bin
drwxr-xr-x. 1 root root    0 Jan  1  1970 games
drwxr-xr-x. 1 root root   16 Jan  1  1970 i686-w64-mingw32
drwxr-xr-x. 1 root root   20 Jan  1  1970 include
drwxr-xr-x. 1 root root  152 Jan  1  1970 lib
drwxr-xr-x. 1 root root 1716 Jan  1  1970 lib64
drwxr-xr-x. 1 root root   82 Jan  1  1970 libexec
drwxr-xr-x. 1 root root   82 Jan  1  1970 local
lrwxrwxrwx. 2 root root    3 Jan  1  1970 sbin -> bin
drwxr-xr-x. 1 root root   78 Jan  1  1970 share
drwxr-xr-x. 1 root root   24 Jan  1  1970 src
lrwxrwxrwx. 2 root root   10 Jan  1  1970 tmp -> ../var/tmp
drwxr-xr-x. 1 root root   16 Jan  1  1970 x86_64-w64-mingw32

…yer unpack

When unpacking container image layers, directories may be created
implicitly as structural parents for files in the tar stream, without
having their own tar entry with proper metadata.  These structural
directories get default metadata which shadows the meaningful metadata
from base image layers when the layers are composed via overlayfs.

With the new kernel overlayfs dirmeta_delegate feature, directories
marked with the trusted.overlay.dirmeta_delegate xattr (or
user.overlay.dirmeta_delegate in rootless mode) will have their metadata
delegated to a lower layer in the overlay stack.

This change:

- Adds a DirmetaDelegate field to TarOptions to opt in to the behavior
- Updates UnpackLayer() to set the dirmeta_delegate xattr on directories
  that are implicitly created as parents for files in the tar stream,
  while leaving explicitly-defined directories unmarked
- Adds a dirmeta_delegate driver option to the overlay driver (default
  true) and threads it through to the tar unpack path
- Adds tests verifying the xattr is set on implicit dirs, not set on
  explicit dirs, and not set when the feature is disabled

Signed-off-by: John Eckersberg <jeckersb@redhat.com>
@github-actions github-actions bot added the storage Related to "storage" package label Mar 17, 2026
@cgwalters
Copy link
Contributor

This relates to opencontainers/image-spec#1221 and opencontainers/image-spec#737 and also I think has implications for e.g. https://github.com/jlebon/chunkah last we talked about it right?

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*shrug* I can imagine such semantics.

In principle, I think it’s important that the layers should behave the same in all execution environments (different graph drivers / backing stores). In that sense, relying on an overlay-only feature seems problematic. (An alternative view is that most snapshot-based graph drivers like btrfs don’t have this issue at all, and that this arises because overly needs the intermediate parents to exist in the child’s diff directory structure, and that this is really a design bug in overlay. I don’t think that’s unreasonable.) So, I’d really prefer the behavior to be defined by OCI.

Pragmatically, for a rechunker, the rechunker can, in principle, figure out the intended directory attributes, and explicitly include the intermediate parents in all layers it creates, can’t it?

}
case "dirmeta_delegate":
logrus.Debugf("overlay: dirmeta_delegate=%s", val)
o.dirmetaDelegate, err = strconv.ParseBool(val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t know why this should be an option, conceptually the images should have a single set of semantics. (Now, following that thought strictly, either all existing implementations are violating the spec, or the spec needs to specify the existing behavior and the new feature should not be adopted … but let’s ignore that.)

Who would want to set this on/off, per host? (I can imagine per-image, although that would be a totally unreasonable hassle, but per host just looks like a wrong scope for the decision.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinions on the configurability of this, it's only like this because doing AI iteration it asked how it should work and I said to have it on by default but be able to disable it with the config. But the only real reason I did that was for my own testing purposes.

// marked with the overlay dirmeta_delegate xattr. This tells overlayfs to
// delegate metadata (timestamps, ownership, mode) for these directories to
// a lower layer, preserving the meaningful metadata from base image layers.
DirmetaDelegate bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if no layer contains an entry for the parent directory?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written today, if it cannot find a non-delegated entry in any of the lowerdirs, it will fall back to the current behavior of using whatever metadata happens to be present on the topmost dir in the stack. That could be subject to change during kernel review though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a question for the OCI spec…

parentPath := filepath.Join(dest, parent)
if err := fileutils.Lexists(parentPath); err != nil && os.IsNotExist(err) {
err = idtools.MkdirAllAndChownNew(parentPath, 0o777, rootIDs)
if options.DirmetaDelegate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there could be a cleaner implementation but I don’t worry about that now.

err = mkdirAllAndChownWithDirmetaDelegate(parentPath, 0o777, rootIDs)
} else {
err = idtools.MkdirAllAndChownNew(parentPath, 0o777, rootIDs)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a layer (this one or a later one) contains a directory entry for one of the intermediate directories after we create it? I suppose we should clean the attribute.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good point, if the ordering of the tar stream in this particular layer is something like...

/usr/bin/foo
... some more entries ...
/usr/bin

We would need to set the xattr on /usr/bin when creating /usr/bin/foo but ensure it's cleared when we encounter /usr/bin itself later. Or otherwise look ahead to know not to set it in the first place.

If instead it's spread across two layers, like:

  • Earlier / lower / more "base" layer defines /usr/bin/foo

  • Later / higher in the stack / derived layer defines /usr/bin

Then I don't think there needs to be any special handling because when traversing through the lowerdir stack overlay will find /usr/bin from the derived layer first without the xattr set and use the metadata from there.

But more generally this is precisely the sort of thing that the OCI spec should address and clarify.

@jeckersb
Copy link
Author

This relates to opencontainers/image-spec#1221 and opencontainers/image-spec#737 and also I think has implications for e.g. https://github.com/jlebon/chunkah last we talked about it right?

Yeah this definitely has implications for chunkah. This whole thing can be worked around by always including the full parent directory metadata (the rpm-ostree rechunker supports that today) but per the OCI image spec that's not really how it should work, and I tend to agree with the spirit of the spec.

Here's a contrived example why this distinction can be important:

Base image foo has /srv with mode 0700.

Base image bar has /srv with mode 0755.

Now, we have some big 100GB blob that we want to add at /srv/blob.

If I want to build derived images from both base images, ideally I have just one layer blob with the content of /srv/blob and that can be shared between both. Net used storage is 100GB. If I have to represent (contrary to spec) the full parent tree, that means distinct layers. Now I have two layers, each with 100GB blob plus the tiny metadata delta. Net used storage is now 200GB.

@jeckersb
Copy link
Author

In principle, I think it’s important that the layers should behave the same in all execution environments (different graph drivers / backing stores). In that sense, relying on an overlay-only feature seems problematic. (An alternative view is that most snapshot-based graph drivers like btrfs don’t have this issue at all, and that this arises because overly needs the intermediate parents to exist in the child’s diff directory structure, and that this is really a design bug in overlay. I don’t think that’s unreasonable.) So, I’d really prefer the behavior to be defined by OCI.

Agree 100% that this should be also be driven through the spec in the places where @cgwalters noted above. This is just a piece of the puzzle so that if we get the spec clarified, we're already moving things in the right direction to be able to support it properly with the overlay driver.

@mtrmac
Copy link
Contributor

mtrmac commented Mar 17, 2026

We also need to worry, at least for a time, about older kernels. I suppose adding an xattr should be a no-op on older kernels, but just for the record.

@jeckersb
Copy link
Author

@jlebon tagging you here explicitly since this is highly relevant to your interests as colin noted above 😄

@cgwalters
Copy link
Contributor

It's also worth emphasizing that this is not an issue for composefs-based storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

storage Related to "storage" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants