Regarding using ACLs for persistence. I wonder if you could establish a convention where an ACL on the root directory of a filesystem stores the 32-bit user & group namespace id such that default mounts would mount the entire filesystem within that user & group namespace id. For processes that are not in that user and/or namespace you squash uids/gids to that namespace id. For processes inside the user & group namespace they see the full 32-but isolated namespace. This ACL would have to be unwritable from withing the user/group namespace id for security. In essence, you "taint" the entire filesystem so that even though it might contain persisted files that have uid 0 and setuid, it is hard to mount in the host's default user/group namespace id (which I assume is 0/0) and thus expose the system to them (i.e. risk actually honouring the uid & setuid of the executable in the FS). This convention does not suffer from the scalability problem since it uses only one (well, two, one for user, one fore group) per persistent FS. It is also a fairly easy concept to understand, I think, and also easy to reason about with respect to security. All just handwaving ideas on my part as I am not a developer.
@virtuous-sloth9 ай бұрын
It occurs to me that when I say 'default mount' I should really talk about mount namespaces. In the default mount namespace, then mounting a FS with the ACL set would mount that FS squashed to the owner user/group. If you create a mount namespace matching the owner user/group (is that a thing? I don't really understand all the details of the various namespaces) then within that mount namespace the default mount would use the user/group namespace id and would expose the full isolated user/group namespace. This makes my head hurt having to picture multiple isolated syscall boundaries on a per-process basis.
@josephprice58729 ай бұрын
This seems to be addressed at @27:30
@virtuous-sloth9 ай бұрын
@@josephprice5872, That's the part that prompted my post. There they are talking able persisting the entire 64-bit kernel uid/gid into the filesystem on a per-file basis, presumably with the classic 32-bit part stored in the normal dirent and the additional namespace part in a per-file ACL. Thus the scalability problem. Instead, what I am suggesting is to store only a single namespace part in an ACL in the root of any FS and using that to as a basis to establish which user/group namespace is used at mount time. However, upon further reflection it occurs to me that all the namespace settings are per-process so I'm thinking my idea may not even make sense because the mount happens in the process' mount namespace presumably after the user/group namespace has already been established for that process. *shrug* Like I said, I only have a fuzzy notion of how it works.
@josephprice58729 ай бұрын
@@virtuous-sloth gotcha, sorry for the misunderstanding. I figure conventions will always be less safe than implementations.
@virtuous-sloth9 ай бұрын
@@josephprice5872 I agree that conventions are less safe than implementations. However, from what I got from the talk there will be no implementation that prevents the host root user from doing something dangerous like taking a filesystem whose contents were created inside a user/group namespace and contains potentially dangerous executables with uid 0 and setuid and mounting it in the default user/group namespace of the host. They basically said you just shouldn't do that. And, of course, since it is root that is pretty much what you have to do; avoid doing stupid things. What I'm saying is that you could make it so the default userspace tools such as the mount executable could implement safe behaviours by default (requiring explicit options when trying to do a stupid thing) or simply not allow you to do the stupid thing thus forcing you to write your own userspace program to be run as root if you really want to do it. Presumably a container-management userspace suite like incus will probably do the right thing but it will store the meta-data about the user/group namespace associated with an incus volume in the incus DB. In that case, my idea provides no real advantage. My idea would only help your average system admin to create filesystems, containers, and mounts using the basic userspace tools do the right thing with the least effort. Maybe.