Skip to content

fix unshare infinite loop without CGO_ENABLED#590

Open
lyp256 wants to merge 1 commit intocontainers:mainfrom
lyp256:unshre
Open

fix unshare infinite loop without CGO_ENABLED#590
lyp256 wants to merge 1 commit intocontainers:mainfrom
lyp256:unshre

Conversation

@lyp256
Copy link
Contributor

@lyp256 lyp256 commented Jan 15, 2026

fix unshare.MaybeReexecUsingUserNamespace() infinite loop without CGO_ENABLED.

fixes: #160

@github-actions github-actions bot added the storage Related to "storage" package label Jan 15, 2026
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Jan 15, 2026
@podmanbot
Copy link

✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6637

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Jan 15, 2026
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Jan 15, 2026
Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I can’t see how this can possibly work. AFAICT the code with _Containers-pid-pipe and _Containers-continue-pipe exists in C became we need it to run (and give all the process environment changes in unshare.Cmd.Start an opportunity to happen) before the Go runtime starts creating any OS-level threads.

We can mimic the child process’ behavior in Go but I don’t see how we can reliably achieve the same outcome.

Cc: @giuseppe

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this supposed to work?

Is this AI-generated?

Comment on lines +29 to +37
attr.AmbientCaps = []uintptr{
unix.CAP_CHOWN,
unix.CAP_DAC_OVERRIDE,
unix.CAP_DAC_READ_SEARCH,
unix.CAP_FOWNER,
unix.CAP_FSETID,
unix.CAP_SYS_ADMIN,
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not tested this PR but this won't work, we need all caps because we launch containers from this environment (and they don't need to be in the ambient set)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've compared processes with C code and those without, and you're right—the C process does include all capabilities. However, I'm not entirely sure if adding all privileges is the correct decision. The reason I added these capabilities in my code was that when using storage, it directly attempted to delete a folder with permissions r-xr-xr-x, which resulted in a Permission deniederror. Adding the capabilities was my workaround to resolve this permission issue. Based on your suggestion, I will proceed with adding all the capabilities.

Copy link
Contributor Author

@lyp256 lyp256 Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found any alternative methods to add capabilities other than using the ambient set. Do you have any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you create a user namespace you automatically get them. Is Go dropping them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my testing, I found that child processes will not have capabilities unless the ambient method is used. I collected relevant child process information from the /proc/<pid>/status file under three scenarios: using CGO, not using CGO without setting ambient capabilities, and not using CGO with ambient capabilities set.

Using CGO:

CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

Not using CGO and not setting ambient capabilities:

CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

Not using CGO but setting ambient capabilities:

CapInh: 000001ffffffffff
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 000001ffffffffff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This affects also processes executed by Podman, could you investigate why this is happening? An alternative would be to drop the ambient caps once we are in the namespace

@lyp256
Copy link
Contributor Author

lyp256 commented Jan 16, 2026

Thanks!

I can’t see how this can possibly work. AFAICT the code with _Containers-pid-pipe and _Containers-continue-pipe exists in C became we need it to run (and give all the process environment changes in unshare.Cmd.Start an opportunity to happen) before the Go runtime starts creating any OS-level threads.

We can mimic the child process’ behavior in Go but I don’t see how we can reliably achieve the same outcome.

Cc: @giuseppe

how is this supposed to work?

Is this AI-generated?

I needed to use statically compiled Skopeo and Buildah for container image migration tasks in an offline environment. I compiled the necessary executables using commands similar to:

CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -tags exclude_graphdriver_btrfs,containers_image_openpgp skopeo/cmd/skopeo.

However, I encountered an issue where the unsharepackage did not function properly when CGO_ENABLED=0. While Skopeo worked correctly in most scenarios, operations involving containers-storage(which call unshare.MaybeReexecUsingUserNamespace, e.g. skopeo copy docker://docker.io/library/alpine:3 containers-storage:alpine:3) resulted in an infinite loop of child process creation.

To resolve this, I modified the code to directly pass syscall.CLONE_NEWUSER via exec.Cmd.SysProcAttr.Cloneflags. In the initfunction, I implemented logic to mimic the operations related to _Containers-pid-pipe and _Containers-continue-pipe as done in the C code. After recompiling both Buildah and Skopeo with these modifications, they now function as expected.

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Jan 16, 2026
@mtrmac
Copy link
Contributor

mtrmac commented Jan 16, 2026

I implemented logic to mimic the operations related to _Containers-pid-pipe and _Containers-continue-pipe as done in the C code. After recompiling both Buildah and Skopeo with these modifications, they now function as expected.

None of this addresses the concern that doing any such changes after the Go runtime starts is too late. What am I missing? (Giuseppe?)

The conversation in #160 was AFAICT saying that to resolve this

MaybeReexecUsingUserNamespace can fail with an useful error message

I realize that does not help your use case but that doesn’t mean that we can ignore the constraints (if there indeed are any).

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Jan 19, 2026
@lyp256
Copy link
Contributor Author

lyp256 commented Jan 19, 2026

Under Giuseppe guidance, I re-investigated the implementation of unshare. After reading the source code related to golang's exec.Cmd, I found that exec.Cmd.Start() handles uid/gid mapping by writing directly to the u(g)id_map files. Due to Linux kernel restrictions, ordinary users cannot write additional ID mappings to the uid_mapand gid_mapfiles directly; this requires using the newuidmap and newgidmap commands in conjunction with the /etc/subuid and /etc/subgid files.

The golang exec.Cmd.Start() method does not use the newu(g)idmap commands. Writing directly to the u(g)id_map files results in permission issues. If the child process does not have the uid/gid mapping set, the system will not grant capabilities to the child process.

The current approach uses the EXECVE system call to handle the issue of writing ID mappings too late (a step I had previously missed, which necessitated the Ambient set). The complete process is roughly as follows:

The main process uses clone() to create a child process -> The child process uses execve() to start the child program -> The parent and child processes interactively complete the ID mapping -> The child process uses execve() again to re-execute the child program.

Please review @mtrmac @giuseppe

Signed-off-by: lyp256 <lyp256@qq.com>
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Jan 19, 2026
@lyp256 lyp256 requested a review from giuseppe January 20, 2026 01:27
Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested locally.

LGTM

@lyp256 lyp256 requested a review from mtrmac January 22, 2026 01:30
@TomSweeneyRedHat
Copy link
Member

LGTM

@TomSweeneyRedHat
Copy link
Member

re-ping @mtrmac

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really know ~nothing about the problem space; even if this is correct, reviewing it would require much more familiarity with the user namespace mechanics than I now have.

Comment on lines +14 to +17
if c.Cmd.SysProcAttr == nil {
c.Cmd.SysProcAttr = &syscall.SysProcAttr{}
}
attr := c.Cmd.SysProcAttr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this could happen only inside the if below.

}
attr := c.Cmd.SysProcAttr
if c.UnshareFlags&syscall.CLONE_NEWUSER != 0 {
attr.Cloneflags = uintptr(c.UnshareFlags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing documents why we are not using “unshare” flags for unshare.

AFAICS the two functions don’t accept exactly the same sets of flags, so even if this were correct for the callers we are currently anticipating, I’d expect the code here to report a failure if any unexpected flag value is set.


(And, well, if Cloneflags works here fine nowadays, wouldn’t we want to use it in the CGo code path as well, and save us the unshare trouble? Is there more subtlety to this? There probably is, given the two separate unshare calls.)

This sort of applies generally — if we can do something natively using SysProcAttr nowadays, I’d expect that to always be done, rather than having two different code paths.)

attr := c.Cmd.SysProcAttr
if c.UnshareFlags&syscall.CLONE_NEWUSER != 0 {
attr.Cloneflags = uintptr(c.UnshareFlags)
attr.GidMappingsEnableSetgroups = c.GidMappingsEnableSetgroups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does nothing at all unless …

if c.Ctty != nil {
index := len(c.Cmd.ExtraFiles)
c.Cmd.ExtraFiles = append(c.Cmd.ExtraFiles, c.Ctty)
attr.Ctty = index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does nothing unless …

_ = os.Unsetenv(name)
v, err := strconv.Atoi(env)
if err != nil {
return -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C code differentiates between “missing” and “invalid format”

bailOnError(err, "Read containers continue pipe")
}
if n > 0 {
bailOnError(fmt.Errorf(string(buf)), "Unexpected containers continue pipe read")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format string vulnerability.


Nothing is all that unexpected about this error reporting path, the data exists only to be reported this way.

bailOnError(err, "Error during setsid")
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setpgrp handling?

bailOnError(err, "Setresgid failed")
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C version does a two-stage unshare. Why is it necessary there and not necessary here?

}

// Re-invoke the execve system call to obtain capabilities.
err := syscall.Exec("/proc/self/exe", os.Args, os.Environ())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containers_reexec does a lot of things. I have no idea why it does all of them (and the code doesn’t document it :| ); either way, I must ask: why is all of that not necessary here?

Comment on lines +76 to +81
if err := syscall.Setresuid(0, 0, 0); err != nil {
bailOnError(err, "Setresuid failed")
}
if err := syscall.Setresgid(0, 0, 0); err != nil {
bailOnError(err, "Setresgid failed")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the UID/GID order differ from the C version?

@giuseppe
Copy link
Member

@lyp256 are you still working on this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

storage Related to "storage" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reexec causes infinite loop, if built without CGO_ENABLED

5 participants