All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@amacapital.net>
To: Aleksa Sarai <cyphar@cyphar.com>
Cc: "Stas Sergeev" <stsp2@yandex.ru>,
	"Serge E. Hallyn" <serge@hallyn.com>,
	linux-kernel@vger.kernel.org,
	"Stefan Metzmacher" <metze@samba.org>,
	"Eric Biederman" <ebiederm@xmission.com>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Andy Lutomirski" <luto@kernel.org>,
	"Christian Brauner" <brauner@kernel.org>,
	"Jan Kara" <jack@suse.cz>, "Jeff Layton" <jlayton@kernel.org>,
	"Chuck Lever" <chuck.lever@oracle.com>,
	"Alexander Aring" <alex.aring@gmail.com>,
	"David Laight" <David.Laight@aculab.com>,
	linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Christian Göttsche" <cgzones@googlemail.com>
Subject: Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
Date: Mon, 6 May 2024 10:29:52 -0700	[thread overview]
Message-ID: <CALCETrU2VwCF-o7E5sc8FN_LBs3Q-vNMBf7N4rm0PAWFRo5QWw@mail.gmail.com> (raw)
In-Reply-To: <20240428.171236-tangy.giblet.idle.helpline-y9LqufL7EAAV@cyphar.com>

Replying to a couple emails at once...

On Mon, May 6, 2024 at 12:14 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2024-04-28, Andy Lutomirski <luto@amacapital.net> wrote:
> > > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
> > > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
> > > It is needed to perform an open operation with the creds that were in
> > > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
> > > flag. This allows the process to pre-open some dirs and switch eUID
> > > (and other UIDs/GIDs) to the less-privileged user, while still retaining
> > > the possibility to open/create files within the pre-opened directory set.
> > >
> >
> > I’ve been contemplating this, and I want to propose a different solution.
> >
> > First, the problem Stas is solving is quite narrow and doesn’t
> > actually need kernel support: if I want to write a user program that
> > sandboxes itself, I have at least three solutions already.  I can make
> > a userns and a mountns; I can use landlock; and I can have a separate
> > process that brokers filesystem access using SCM_RIGHTS.
> >
> > But what if I want to run a container, where the container can access
> > a specific host directory, and the contained application is not aware
> > of the exact technology being used?  I recently started using
> > containers in anger in a production setting, and “anger” was
> > definitely the right word: binding part of a filesystem in is
> > *miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.
> > Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
> > actually gave up on making one of my use cases work on a Fedora
> > system.
> >
> > Here’s what I wanted to do, logically, in production: pick a host
> > directory, pick a host *principal* (UID, GID, label, etc), and have
> > the *entire container* access the directory as that principal. This is
> > what happens automatically if I run the whole container as a userns
> > with only a single UID mapped, but I don’t really want to do that for
> > a whole variety and of reasons.
> >
> > So maybe reimagining Stas’ feature a bit can actually solve this
> > problem.  Instead of a special dirfd, what if there was a special
> > subtree (in the sense of open_tree) that captures a set of creds and
> > does all opens inside the subtree using those creds?
> >
> > This isn’t a fully formed proposal, but I *think* it should be
> > generally fairly safe for even an unprivileged user to clone a subtree
> > with a specific flag set to do this. Maybe a capability would be
> > needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
> > this to a daemon if a privilege is needed, and getting the API right
> > might be a bit tricky.
>
> Tying this to an actual mount rather than a file handle sounds like a
> more plausible proposal than OA2_CRED_INHERIT, but it just seems that
> this is going to re-create all of the work that went into id-mapped
> mounts but with the extra-special step of making the generic VFS
> permissions no longer work normally (unless the idea is that everything
> would pretend to be owned by current_fsuid()?).

I was assuming that the owner uid and gid would be show to stat, etc
as usual.  But the permission checks would be done against the
captured creds.

>
> IMHO it also isn't enough to just make open work, you need to make all
> operations work (which leads to a non-trivial amount of
> filesystem-specific handling), which is just idmapped mounts. A lot of
> work was put into making sure that is safe, and collapsing owners seems
> like it will cause a lot of headaches.
>
> I also find it somewhat amusing that this proposal is to basically give
> up on multi-user permissions for this one directory tree because it's
> too annoying to deal with. In that case, isn't chmod 777 a simpler
> solution? (I'm being a bit flippant, of course there is a difference,
> but the net result is that all users in the container would have the
> same permissions with all of the fun issues that implies.)
>
> In short, AFAICS idmapped mounts pretty much solve this problem (minus
> the ability to collapse users, which I suspect is not a good idea in
> general)?
>

With my kernel hat on, maybe I agree.  But with my *user* hat on, I
think I pretty strongly disagree.  Look, idmapis lousy for
unprivileged use:

$ install -m 0700 -d test_directory
$ echo 'hi there' >test_directory/file
$ podman run -it --rm
--mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
# cat /tmp/file
hi there

<-- Hey, look, this kind of works!

# setpriv --reuid=1 ls /tmp
ls: cannot open directory '/tmp': Permission denied

<-- Gee, thanks, Linux!


Obviously this is a made up example.  But it's quite analogous to a
real example.  Suppose I want to make a directory that will contain
some MySQL data.  I don't want to share this directory with anyone
else, so I set its mode to 0700.  Then I want to fire up an
unprivileged MySQL container, so I build or download it, and then I
run it and bind my directory to /var/lib/mysql and I run it.  I don't
need to think about UIDs or anything because it's 2024 and containers
just work.  Okay, I need to setenforce 0 because I'm on Fedora and
SELinux makes absolutely no sense in a container world, but I can live
with that.

Except that it doesn't work!  Because unless I want to manually futz
with the idmaps to get mysql to have access to the directory inside
the container, only *root* gets to get in.  But I bet that even
futzing with the idmap doesn't work, because software like mysql often
expects that root *and* a user can access data.  And some software
even does privilege separation and uses more than one UID.

So I want a way to give *an entire container* access to a directory.
Classic UNIX DAC is just *wrong* for this use case.  Maybe idmaps
could learn a way to squash multiple ids down to one.  Or maybe
something like my silly credential-capturing mount proposal could
work.  But the status quo is not actually amazing IMO.

I haven't looked at the idmap implementation nearly enough to have any
opinion as to whether squashing UID is practical or whether there's
any sensible way to specify it in the configuration.

> On Apr 29, 2024, at 2:12 AM, Christian Brauner <brauner@kernel.org> wrote:
>
> Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and
> move_mount(). I rewrote the bind-mount logic in systemd based on that
> and util-linux uses that as well now.
> https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html
>

Yep, I remember that.

>> Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
>> actually gave up on making one of my use cases work on a Fedora
>> system.
>>
>> Here’s what I wanted to do, logically, in production: pick a host
>> directory, pick a host *principal* (UID, GID, label, etc), and have
>> the *entire container* access the directory as that principal. This is
>> what happens automatically if I run the whole container as a userns
>> with only a single UID mapped, but I don’t really want to do that for
>> a whole variety and of reasons.
>
> You're describing idmapped mounts for the most part which are upstream
> and are used in exactly that way by a lot of userspace.
>

See above...

>>
>> So maybe reimagining Stas’ feature a bit can actually solve this
>> problem.  Instead of a special dirfd, what if there was a special
>> subtree (in the sense of open_tree) that captures a set of creds and
>> does all opens inside the subtree using those creds?
>
> That would mean override creds in the VFS layer when accessing a
> specific subtree which is a terrible idea imho. Not just because it will
> quickly become a potential dos when you do that with a lot of subtrees
> it will also have complex interactions with overlayfs.

I was deliberately talking about semantics, not implementation. This
may well be impossible to implement straightforwardly.

  reply	other threads:[~2024-05-06 17:30 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-26 13:33 [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Stas Sergeev
2024-04-26 13:33 ` [PATCH v5 1/3] fs: reorganize path_openat() Stas Sergeev
2024-04-26 13:33 ` [PATCH v5 2/3] open: add O_CRED_ALLOW flag Stas Sergeev
2024-04-27  2:12   ` kernel test robot
2024-04-26 13:33 ` [PATCH v5 3/3] openat2: add OA2_CRED_INHERIT flag Stas Sergeev
2024-04-28 16:41 ` [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Andy Lutomirski
2024-04-28 17:39   ` stsp
2024-04-28 19:15     ` stsp
2024-04-28 20:19     ` Andy Lutomirski
2024-04-28 21:14       ` stsp
2024-04-28 21:30         ` Andy Lutomirski
2024-04-28 22:12           ` stsp
2024-04-29  1:12             ` stsp
2024-04-29  9:12   ` Christian Brauner
2024-05-06  7:13   ` Aleksa Sarai
2024-05-06 17:29     ` Andy Lutomirski [this message]
2024-05-06 17:34       ` Andy Lutomirski
2024-05-06 19:34       ` David Laight
2024-05-06 21:53         ` Andy Lutomirski
2024-05-07  7:42       ` Christian Brauner
2024-05-07 20:38         ` Andy Lutomirski
2024-05-08  7:32           ` Christian Brauner
2024-05-08 17:30             ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALCETrU2VwCF-o7E5sc8FN_LBs3Q-vNMBf7N4rm0PAWFRo5QWw@mail.gmail.com \
    --to=luto@amacapital.net \
    --cc=David.Laight@aculab.com \
    --cc=alex.aring@gmail.com \
    --cc=brauner@kernel.org \
    --cc=cgzones@googlemail.com \
    --cc=chuck.lever@oracle.com \
    --cc=cyphar@cyphar.com \
    --cc=ebiederm@xmission.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=metze@samba.org \
    --cc=pbonzini@redhat.com \
    --cc=serge@hallyn.com \
    --cc=stsp2@yandex.ru \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.