Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: Quentin Perret <qperret@google.com>
To: Qais Yousef <qais.yousef@arm.com>
Cc: mingo@redhat.com, peterz@infradead.org,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rickyiu@google.com, wvw@google.com, patrick.bellasi@matbug.net,
	xuewen.yan94@gmail.com, linux-kernel@vger.kernel.org,
	kernel-team@android.com
Subject: Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
Date: Mon, 21 Jun 2021 10:52:28 +0000	[thread overview]
Message-ID: <YNBvbEv2UTzSHBP5@google.com> (raw)
In-Reply-To: <20210614150327.3humrvztv3fxurvk@e107158-lin.cambridge.arm.com>

Hi Qais,

Apologies for the delayed reply, I was away last week.

On Monday 14 Jun 2021 at 16:03:27 (+0100), Qais Yousef wrote:
> On 06/11/21 14:43, Quentin Perret wrote:
> > On Friday 11 Jun 2021 at 15:17:37 (+0100), Qais Yousef wrote:
> > > On 06/11/21 13:49, Quentin Perret wrote:
> > > > Thinking about it a bit more, a more involved option would be to have
> > > > this patch as is, but to also introduce a new RLIMIT_UCLAMP on top of
> > > > it. The semantics could be:
> > > > 
> > > >   - if the clamp requested by the non-privileged task is lower than its
> > > >     existing clamp, then allow;
> > > >   - otherwise, if the requested clamp is less than UCLAMP_RLIMIT, then
> > > >     allow;
> > > >   - otherwise, deny,
> > > > 
> > > > And the same principle would apply to both uclamp.min and uclamp.max,
> > > > and UCLAMP_RLIMIT would default to 0.
> > > > 
> > > > Thoughts?
> > > 
> > > That could work. But then I'd prefer your patch to go as-is. I don't think
> > > uclamp can do with this extra complexity in using it.
> > 
> > Sorry I'm not sure what you mean here?
> 
> Hmm. I understood this as a new flag to sched_setattr() syscall first, but now
> I get it. You want to use getrlimit()/setrlimit()/prlimit() API to impose
> a restriction. My comment was in regard to this being a sys call extension,
> which it isn't. So please ignore it.
> 
> > 
> > > We basically want to specify we want to be paranoid about uclamp CAP or not. In
> > > my view that is simple and can't see why it would be a big deal to have
> > > a procfs entry to define the level of paranoia the system wants to impose. If
> > > it is a big deal though (would love to hear the arguments);
> > 
> > Not saying it's a big deal, but I think there are a few arguments in
> > favor of using rlimit instead of a sysfs knob. It allows for a much
> > finer grain configuration  -- constraints can be set per-task as well as
> > system wide if needed, and it is the standard way of limiting resources
> > that tasks can ask for.
> 
> Is it system wide or per user?

Right, so calling this 'system-wide' is probably an abuse, but IIRC
rlimits are per-process, and are inherited accross fork/exec. So the
usual trick to have a default value is to set the rlimits on the init
task accordingly. Android for instance already does that for a few
things, and I would guess that systemd and friends have equivalents
(though admittedly I should check that).

> > 
> > > requiring apps that
> > > want to self regulate to have CAP_SYS_NICE is better approach.
> > 
> > Rlimit wouldn't require that though, which is also nice as CAP_SYS_NICE
> > grants you a lot more power than just clamps ...
> 
> Now I better understand your suggestion. It seems a viable option I agree.
> I need to digest it more still though. The devil is in the details :)
> 
> Shouldn't the default be RLIM_INIFINITY? ie: no limit?

I guess so yes.

> We will need to add two limit, RLIMIT_UCLAMP_MIN/MAX, right?

Not sure, but I was originally envisioning to have only one that applies
to both min and max. In which would we need separate ones?

> We have the following hierarchy now:
> 
> 	1. System Wide (/proc/sys/kerenl/sched_util_clamp_min/max)
> 	2. Cgroup
> 	3. Per-Task
> 
> In that order of priority where 1 limits/overrides 2 and 3. And
> 2 limits/overrides 3.
> 
> Where do you see the RLIMIT fit in this hierarchy? It should be between 2 and
> 3, right? Cgroup settings should still win even if the user/processes were
> limited?

Yes, the rlimit stuff would just apply the syscall interface.

> If the framework decided a user can't request any boost at all (can't increase
> its uclamp_min above 0). IIUC then setting the hard limit of RLIMIT_UCLAMP_MIN
> to 0 would achieve that, right?

Exactly.

> Since the framework and the task itself would go through the same
> sched_setattr() call, how would the framework circumvent this limit? IIUC it
> has to raise the RLIMIT_UCLAMP_MIN first then perform sched_setattr() to
> request the boost value, right? Would this overhead be acceptable? It looks
> considerable to me.

The framework needs to have CAP_SYS_NICE to change another process'
clamps, and generally rlimit checks don't apply to CAP_SYS_NICE-capable
processes -- see __sched_setscheduler(). So I think we should be fine.
IOW, rlimits are just constraining what unprivileged tasks are allowed
to request for themselves IIUC.

> Also, Will prlimit() allow you to go outside what was set for the user via
> setrlimit()? Reading the man pages it seems to override, so that should be
> fine.

IIRC rlimit are per-process properties, not per-user, so I think we
should be fine here as well?

> For 1 (System Wide) limits, sched_setattr() requests are accepted, but the
> effective uclamp is *capped by* the system wide limit.
> 
> Were you thinking RLIMIT_UCLAMP* will behave similarly?

Nope, I was actually thinking of having the syscall return -EPERM in
this case, as we already do for nice values or RT priorities.

> If they do, we have
> consistent behavior with how the current system wide limits work; but this will
> break your use case because tasks can change the requested uclamp value for
> a task, albeit the effective value will be limited.
> 
> 	RLIMIT_UCLAMP_MIN=512
> 	p->uclamp[UCLAMP_min] = 800	// this request is allowed but
> 					// Effective UCLAMP_MIN = 512
> 
> If not, then
> 
> 	RLIMIT_UCLAMP_MIN=no limit
> 	p->uclamp[UCLAMP_min] = 800	// task changed its uclamp_min to 800
> 	RLIMIT_UCLAMP_MIN=512		// limit was lowered for task/user
> 
> what will happen to p->uclamp[UCLAMP_MIN] in this case? Will it be lowered to
> match the new limit? And this will be inconsistent with the current system wide
> limits we already have.

As per the above, if the syscall returns -EPERM we can leave the
integration with system-wide defaults and such untouched I think.

> Sorry too many questions. I was mainly thinking loudly. I need to spend more
> time to dig into the details of how RLIMITs are imposed to understand how this
> could be a good fit. I already see some friction points that needs more
> thinking.

No need to apologize, this would be a new userspace-visible interface,
so you're right that we need to think it through.

Thanks for the feedback,
Quentin

     prev parent reply	other threads:[~2021-06-21 10:52 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-10 15:13 [PATCH v2 0/3] A few uclamp fixes Quentin Perret
2021-06-10 15:13 ` [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting Quentin Perret
2021-06-10 19:05   ` Peter Zijlstra
2021-06-11  7:25     ` Quentin Perret
2021-06-17 15:27       ` Dietmar Eggemann
2021-06-21 10:57         ` Quentin Perret
2021-06-10 15:13 ` [PATCH v2 2/3] sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS Quentin Perret
2021-06-10 19:15   ` Peter Zijlstra
2021-06-11  8:59     ` Quentin Perret
2021-06-11  9:07       ` Quentin Perret
2021-06-11  9:20       ` Peter Zijlstra
2021-06-10 15:13 ` [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE Quentin Perret
2021-06-11 12:48   ` Qais Yousef
2021-06-11 13:08     ` Quentin Perret
2021-06-11 13:26       ` Qais Yousef
2021-06-11 13:49         ` Quentin Perret
2021-06-11 14:17           ` Qais Yousef
2021-06-11 14:43             ` Quentin Perret
2021-06-14 15:03               ` Qais Yousef
2021-06-21 10:52                 ` Quentin Perret [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YNBvbEv2UTzSHBP5@google.com \
    --to=qperret@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=kernel-team@android.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=patrick.bellasi@matbug.net \
    --cc=peterz@infradead.org \
    --cc=qais.yousef@arm.com \
    --cc=rickyiu@google.com \
    --cc=vincent.guittot@linaro.org \
    --cc=wvw@google.com \
    --cc=xuewen.yan94@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.