Proposal: User namespace support for re-mapped root per daemon setting by estesp · Pull Request #11253

Proposal: User namespace support for re-mapped root per daemon setting by estesp · Pull Request #11253 · moby/moby

Proposal for User Namespace Daemon Support

This is a docs-first proposal to get review/feedback on the UX for specifying a per-daemon-instance remapping of container root to an unprivileged user.

Depends on libcontainer API/userns support

The support for user namespaces already exists in libcontainer. The PR for bringing that new API and functionality is open for review: #11208. This present proposal cannot be implemented until that PR is merged and the libcontainer vendor is updated in Docker itself.

A per-daemon setting for root

The documentation change in this PR notes a new flag -r, --root="" which would be used to specify the requester user & group name--or uid:gid--that the daemon would instantiate new containers as the remapped root. Specifying a special value of default would be the user's request to have Docker create (or use if existing) a special user/group named dockroot (docker group is already taken and used by Docker itself).

This new flag, when specified, would cause a new template (see https://github.com/docker/docker/blob/master/daemon/execdriver/native/template/default_template.go for current template) to be used when containers are created with the native execdriver, which will cause the added creation of a user namespace, using the specified uid:gid as the remapped root within the container.

Other required modifications to support user namespaces

Locally hosted layer content

Because the filesystem layers of any image have root:root ownership of most of the files, a re-mapping operation will also need to occur on untar and tar of image layers. This is one of the reasons to start with user namespaces as a daemon-level option: to keep from potentially significant churn of chmod activity, all image layers for a specific daemon can be untarred with root:root ownership mapped to the re-mapped unprivileged container root uid:gid, allowing those images to be used by any containers within the daemon successfully. When images are pushed, since a tar action happens anyway, these image layers can be remapped back to root:root at this point. This work is underway, but has no UX component for review.

Daemon root

Currently the daemon, by default, creates a directory /var/lib/docker owned by root:root with permissions 0700. Given the actual container root filesytems live underneath this hierarchy, a user namespaced container will not even start today as the early chdir() call will fail due to lack of access to the root-accessible-only directory hierarchy under /var/lib/docker.

My proposal is that /var/lib/docker will become a super-root of any number of daemon roots, each one owned by the uid:gid of the daemon's remapped root, if provided. If root is not remapped (user namespaces are "off"), then a new daemon root under /var/lib/docker (or the user-specified location) simply named "0.0" will be used. For migration purposes, the first time the daemon is run with this feature, current data from /var/lib/docker (or the user-specified root) will be migrated to the subdir ./0.0 (for the user namespaces-off case). The "super-root" directory perms would change to 0755, but subdirectories would still use 0700, with ownership matching the remapped root, or real root depending on the case. An example of three subdirectories under a /var/lib/docker with 0755 permissions is shown below:

drwx------ 2 2000 2000 4096 Mar 13 13:34 2000.2000
drwx------ 2  500  500 4096 Mar 13 13:34 500.500
drwx------ 2 root root 4096 Mar 13 13:34 0.0

Questions

Open questions/concerns from early review/discussion

Is this a bad starting point if we support full specification of user/group maps in the future?

I believe that if we support (in the future) more complete user control over the namespace capabilities that exist in libcontainer/Linux kernel level, it will not deprecate this "mode" of user namespace support. I believe instead it will be a "Conflicts:" scenario between --root and {future map config option(s)}. Because we expect --root can be supported by the daemon and image/graph subsystem, it will probably be the more likely path, and the custom uid/gid maps will need to have a set of restrictions around using Hub images, or other "migration" scenarios.

How does this impact the use of `--privileged`?

Given user namespaces are about restricting privileges inside the container, you can guess that the general answer is "--privileged is incompatible with user namespaces". However, given using --privileged maps to a varied set of actions at container setup (from making /sys rw instead of just r to allowing more CAPS_ to remain), it will require a deeper look at which of those may be compatible or reasonable with user namespacing restrictions.