2021-06-23 12:38:11

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v3 3/3] sched: Introduce RLIMIT_UCLAMP

There is currently nothing preventing tasks from changing their per-task
clamp values in anyway that they like. The rationale is probably that
system administrators are still able to limit those clamps thanks to the
cgroup interface. While this is probably fine in many systems where
userspace apps are expected to drive their own power-performance, this
causes pain in a system where both per-task and per-cgroup clamp values
are expected to be under the control of core system components (as is
the case for Android).

To fix this, let's introduce a new rlimit to control the uclamp
behaviour. This allows unprivileged tasks to lower their uclamp
requests, but not increase them unless they have been allowed to do so
via rlimit. This is consistent with the existing behaviour for nice
values or RT priorities.

The default RLIMIT_UCLAMP is set to RLIMIT_INFINITY to keep the existing
behaviour.

Signed-off-by: Quentin Perret <[email protected]>
---
fs/proc/base.c | 1 +
include/asm-generic/resource.h | 1 +
include/uapi/asm-generic/resource.h | 3 +-
kernel/sched/core.c | 48 ++++++++++++++++++++++++-----
4 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9cbd915025ad..91a78cf1fe79 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -586,6 +586,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
[RLIMIT_NICE] = {"Max nice priority", NULL},
[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
+ [RLIMIT_UCLAMP] = {"Max utilization clamp", NULL},
};

/* Display limits for a process */
diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 8874f681b056..53483b7cd4d7 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -26,6 +26,7 @@
[RLIMIT_NICE] = { 0, 0 }, \
[RLIMIT_RTPRIO] = { 0, 0 }, \
[RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
+ [RLIMIT_UCLAMP] = { RLIM_INFINITY, RLIM_INFINITY }, \
}

#endif
diff --git a/include/uapi/asm-generic/resource.h b/include/uapi/asm-generic/resource.h
index f12db7a0da64..4d0fe4d564bf 100644
--- a/include/uapi/asm-generic/resource.h
+++ b/include/uapi/asm-generic/resource.h
@@ -46,7 +46,8 @@
0-39 for nice level 19 .. -20 */
#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
#define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */
-#define RLIM_NLIMITS 16
+#define RLIMIT_UCLAMP 16 /* maximum utilization clamp */
+#define RLIM_NLIMITS 17

/*
* SuS says limits have to be unsigned.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ad055fb9ed2d..b094da4c5fea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1430,6 +1430,11 @@ static int uclamp_validate(struct task_struct *p,
if (util_min != -1 && util_max != -1 && util_min > util_max)
return -EINVAL;

+ return 0;
+}
+
+static void uclamp_enable(void)
+{
/*
* We have valid uclamp attributes; make sure uclamp is enabled.
*
@@ -1438,8 +1443,20 @@ static int uclamp_validate(struct task_struct *p,
* scheduler locks.
*/
static_branch_enable(&sched_uclamp_used);
+}

- return 0;
+static bool can_uclamp(struct task_struct *p, int value, enum uclamp_id clamp_id)
+{
+ unsigned long uc_rlimit = task_rlimit(p, RLIMIT_UCLAMP);
+
+ if (value == -1) {
+ if (rt_task(p) && clamp_id == UCLAMP_MIN)
+ value = sysctl_sched_uclamp_util_min_rt_default;
+ else
+ value = uclamp_none(clamp_id);
+ }
+
+ return value <= p->uclamp_req[clamp_id].value || value <= uc_rlimit;
}

static bool uclamp_reset(const struct sched_attr *attr,
@@ -1580,6 +1597,11 @@ static inline int uclamp_validate(struct task_struct *p,
{
return -EOPNOTSUPP;
}
+static inline void uclamp_enable(void) { }
+static bool can_uclamp(struct task_struct *p, int value, enum uclamp_id clamp_id)
+{
+ return true;
+}
static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr) { }
static inline void uclamp_fork(struct task_struct *p) { }
@@ -6116,6 +6138,13 @@ static int __sched_setscheduler(struct task_struct *p,
(rt_policy(policy) != (attr->sched_priority != 0)))
return -EINVAL;

+ /* Update task specific "requested" clamps */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+ retval = uclamp_validate(p, attr);
+ if (retval)
+ return retval;
+ }
+
/*
* Allow unprivileged RT tasks to decrease priority:
*/
@@ -6165,6 +6194,15 @@ static int __sched_setscheduler(struct task_struct *p,
/* Normal users shall not reset the sched_reset_on_fork flag: */
if (p->sched_reset_on_fork && !reset_on_fork)
return -EPERM;
+
+ /* Can't increase util-clamps */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN &&
+ !can_uclamp(p, attr->sched_util_min, UCLAMP_MIN))
+ return -EPERM;
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
+ !can_uclamp(p, attr->sched_util_max, UCLAMP_MAX))
+ return -EPERM;
}

if (user) {
@@ -6176,12 +6214,8 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}

- /* Update task specific "requested" clamps */
- if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
- retval = uclamp_validate(p, attr);
- if (retval)
- return retval;
- }
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
+ uclamp_enable();

if (pi)
cpuset_read_lock();
--
2.32.0.288.g62a8d224e6-goog


2021-07-01 10:52:12

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] sched: Introduce RLIMIT_UCLAMP

Hi Quentin

Thanks for the patch!

+CC Morten

On 06/23/21 12:34, Quentin Perret wrote:
> There is currently nothing preventing tasks from changing their per-task
> clamp values in anyway that they like. The rationale is probably that
> system administrators are still able to limit those clamps thanks to the
> cgroup interface. While this is probably fine in many systems where
> userspace apps are expected to drive their own power-performance, this
> causes pain in a system where both per-task and per-cgroup clamp values
> are expected to be under the control of core system components (as is
> the case for Android).

Yeah when there's a framework that wants full control of how uclamp is set for
each task/app, a mechanism to allow that is necessary.

> To fix this, let's introduce a new rlimit to control the uclamp
> behaviour. This allows unprivileged tasks to lower their uclamp
> requests, but not increase them unless they have been allowed to do so
> via rlimit. This is consistent with the existing behaviour for nice
> values or RT priorities.

I'm still trying to digest the full implications of this new API to be honest.
So take my comments with a pinch of salt from someone who's trying to build
a full mental picture of how all of this should really work :-)

At the moment we have: system wide sysctl trumps cgroup which in turn trumps
per-task requests.

The new RLIMIT_UCLAMP will be a layer below cgroup but above per-task, right?

And IIUC, you just want it to limit the per-task requests, it doesn't change
the currently set values. I think this is a crucial decision of this mechanism.

Is this usage of RLIMIT to constraints request without impacting the currently
set value accepted? It's not really limiting resources and it is acting as
a permission control since it doesn't impact the currently set value.

>
> The default RLIMIT_UCLAMP is set to RLIMIT_INFINITY to keep the existing
> behaviour.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> fs/proc/base.c | 1 +
> include/asm-generic/resource.h | 1 +
> include/uapi/asm-generic/resource.h | 3 +-
> kernel/sched/core.c | 48 ++++++++++++++++++++++++-----
> 4 files changed, 45 insertions(+), 8 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 9cbd915025ad..91a78cf1fe79 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -586,6 +586,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
> [RLIMIT_NICE] = {"Max nice priority", NULL},
> [RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
> [RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> + [RLIMIT_UCLAMP] = {"Max utilization clamp", NULL},

I think a single RLIMIT_UCLAMP is fine for pure permission control. But if we
have to do something with the currently requested values we'd need to split it
IMO.

> };
>
> /* Display limits for a process */
> diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
> index 8874f681b056..53483b7cd4d7 100644
> --- a/include/asm-generic/resource.h
> +++ b/include/asm-generic/resource.h
> @@ -26,6 +26,7 @@
> [RLIMIT_NICE] = { 0, 0 }, \
> [RLIMIT_RTPRIO] = { 0, 0 }, \
> [RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
> + [RLIMIT_UCLAMP] = { RLIM_INFINITY, RLIM_INFINITY }, \
> }
>
> #endif
> diff --git a/include/uapi/asm-generic/resource.h b/include/uapi/asm-generic/resource.h
> index f12db7a0da64..4d0fe4d564bf 100644
> --- a/include/uapi/asm-generic/resource.h
> +++ b/include/uapi/asm-generic/resource.h
> @@ -46,7 +46,8 @@
> 0-39 for nice level 19 .. -20 */
> #define RLIMIT_RTPRIO 14 /* maximum realtime priority */
> #define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */
> -#define RLIM_NLIMITS 16
> +#define RLIMIT_UCLAMP 16 /* maximum utilization clamp */
> +#define RLIM_NLIMITS 17
>
> /*
> * SuS says limits have to be unsigned.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ad055fb9ed2d..b094da4c5fea 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1430,6 +1430,11 @@ static int uclamp_validate(struct task_struct *p,
> if (util_min != -1 && util_max != -1 && util_min > util_max)
> return -EINVAL;
>
> + return 0;
> +}
> +
> +static void uclamp_enable(void)
> +{
> /*
> * We have valid uclamp attributes; make sure uclamp is enabled.
> *
> @@ -1438,8 +1443,20 @@ static int uclamp_validate(struct task_struct *p,
> * scheduler locks.
> */
> static_branch_enable(&sched_uclamp_used);
> +}
>
> - return 0;
> +static bool can_uclamp(struct task_struct *p, int value, enum uclamp_id clamp_id)
> +{
> + unsigned long uc_rlimit = task_rlimit(p, RLIMIT_UCLAMP);
> +
> + if (value == -1) {
> + if (rt_task(p) && clamp_id == UCLAMP_MIN)
> + value = sysctl_sched_uclamp_util_min_rt_default;
> + else
> + value = uclamp_none(clamp_id);
> + }
> +
> + return value <= p->uclamp_req[clamp_id].value || value <= uc_rlimit;

Hmm why do we still need to prevent the task from changing the uclamp value
upward? It just shouldn't be outside the specified limit, no?

And I think there's a bug in this logic. If UCLAMP_MIN was 1024 then the
RLIMIT_UCLAMP was lowered to 512, the user will be able to change UCLAMP_MIN to
700 for example because of the

return value <= p->uclamp_req[clamp_id].value || ...

I think we should just prevent the requested value to be above the limit. But
the user can lower and increase it within that range. ie: for RLIMIT_UCLAMP
= 512, any request in the [0:512] range is fine.

Also if we set RLIMIT_UCLAMP = 0, then the user will still be able to change
the uclamp value to 0, which is not what we want. We need a special value for
*all requests are invalid*.

I'm not against this, but my instinct tells me that the simple sysctl knob to
define the paranoia/priviliged level for uclamp is a lot simpler and more
straightforward control. I still can't get my head around the full implications
of the RLIMIT and what they should really deliver. It being a pure permission
control mechanism feels off to me and misusing its purpose.

Thanks

--
Qais Yousef

> }
>
> static bool uclamp_reset(const struct sched_attr *attr,
> @@ -1580,6 +1597,11 @@ static inline int uclamp_validate(struct task_struct *p,
> {
> return -EOPNOTSUPP;
> }
> +static inline void uclamp_enable(void) { }
> +static bool can_uclamp(struct task_struct *p, int value, enum uclamp_id clamp_id)
> +{
> + return true;
> +}
> static void __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr) { }
> static inline void uclamp_fork(struct task_struct *p) { }
> @@ -6116,6 +6138,13 @@ static int __sched_setscheduler(struct task_struct *p,
> (rt_policy(policy) != (attr->sched_priority != 0)))
> return -EINVAL;
>
> + /* Update task specific "requested" clamps */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
> + retval = uclamp_validate(p, attr);
> + if (retval)
> + return retval;
> + }
> +
> /*
> * Allow unprivileged RT tasks to decrease priority:
> */
> @@ -6165,6 +6194,15 @@ static int __sched_setscheduler(struct task_struct *p,
> /* Normal users shall not reset the sched_reset_on_fork flag: */
> if (p->sched_reset_on_fork && !reset_on_fork)
> return -EPERM;
> +
> + /* Can't increase util-clamps */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN &&
> + !can_uclamp(p, attr->sched_util_min, UCLAMP_MIN))
> + return -EPERM;
> +
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
> + !can_uclamp(p, attr->sched_util_max, UCLAMP_MAX))
> + return -EPERM;
> }
>
> if (user) {
> @@ -6176,12 +6214,8 @@ static int __sched_setscheduler(struct task_struct *p,
> return retval;
> }
>
> - /* Update task specific "requested" clamps */
> - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
> - retval = uclamp_validate(p, attr);
> - if (retval)
> - return retval;
> - }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
> + uclamp_enable();
>
> if (pi)
> cpuset_read_lock();
> --
> 2.32.0.288.g62a8d224e6-goog
>

2021-07-01 12:19:30

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] sched: Introduce RLIMIT_UCLAMP

Hi Qais,

On Thursday 01 Jul 2021 at 11:50:14 (+0100), Qais Yousef wrote:
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 9cbd915025ad..91a78cf1fe79 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -586,6 +586,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
> > [RLIMIT_NICE] = {"Max nice priority", NULL},
> > [RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
> > [RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> > + [RLIMIT_UCLAMP] = {"Max utilization clamp", NULL},
>
> I think a single RLIMIT_UCLAMP is fine for pure permission control. But if we
> have to do something with the currently requested values we'd need to split it
> IMO.

I don't see why we'd need to TBH. Increasing the uclamp min of task will
request a higher capacity for the task, and increasing the uclamp min
will _allow_ the task to ask for a higher capacity. So at the end of the
day, what we want to limit is how much can a task request, no matter
how it does it. It's all the same thing in my mind, but if you have a
clear idea of what could go wrong, then I'm happy to think again :)

> > };
> >
> > /* Display limits for a process */
> > diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
> > index 8874f681b056..53483b7cd4d7 100644
> > --- a/include/asm-generic/resource.h
> > +++ b/include/asm-generic/resource.h
> > @@ -26,6 +26,7 @@
> > [RLIMIT_NICE] = { 0, 0 }, \
> > [RLIMIT_RTPRIO] = { 0, 0 }, \
> > [RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
> > + [RLIMIT_UCLAMP] = { RLIM_INFINITY, RLIM_INFINITY }, \
> > }
> >
> > #endif
> > diff --git a/include/uapi/asm-generic/resource.h b/include/uapi/asm-generic/resource.h
> > index f12db7a0da64..4d0fe4d564bf 100644
> > --- a/include/uapi/asm-generic/resource.h
> > +++ b/include/uapi/asm-generic/resource.h
> > @@ -46,7 +46,8 @@
> > 0-39 for nice level 19 .. -20 */
> > #define RLIMIT_RTPRIO 14 /* maximum realtime priority */
> > #define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */
> > -#define RLIM_NLIMITS 16
> > +#define RLIMIT_UCLAMP 16 /* maximum utilization clamp */
> > +#define RLIM_NLIMITS 17
> >
> > /*
> > * SuS says limits have to be unsigned.
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index ad055fb9ed2d..b094da4c5fea 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1430,6 +1430,11 @@ static int uclamp_validate(struct task_struct *p,
> > if (util_min != -1 && util_max != -1 && util_min > util_max)
> > return -EINVAL;
> >
> > + return 0;
> > +}
> > +
> > +static void uclamp_enable(void)
> > +{
> > /*
> > * We have valid uclamp attributes; make sure uclamp is enabled.
> > *
> > @@ -1438,8 +1443,20 @@ static int uclamp_validate(struct task_struct *p,
> > * scheduler locks.
> > */
> > static_branch_enable(&sched_uclamp_used);
> > +}
> >
> > - return 0;
> > +static bool can_uclamp(struct task_struct *p, int value, enum uclamp_id clamp_id)
> > +{
> > + unsigned long uc_rlimit = task_rlimit(p, RLIMIT_UCLAMP);
> > +
> > + if (value == -1) {
> > + if (rt_task(p) && clamp_id == UCLAMP_MIN)
> > + value = sysctl_sched_uclamp_util_min_rt_default;
> > + else
> > + value = uclamp_none(clamp_id);
> > + }
> > +
> > + return value <= p->uclamp_req[clamp_id].value || value <= uc_rlimit;
>
> Hmm why do we still need to prevent the task from changing the uclamp value
> upward?

Because this is exactly how rlimit already works for nice or rt
priorities. Tasks are always allowed to decrease their 'importance'
(that is, increase their nice values for ex.), but are limited in how
they can increase it.

See the __sched_setscheduler() permission checks for nice values and
such.

> It just shouldn't be outside the specified limit, no?
>
> And I think there's a bug in this logic. If UCLAMP_MIN was 1024 then the
> RLIMIT_UCLAMP was lowered to 512, the user will be able to change UCLAMP_MIN to
> 700 for example because of the
>
> return value <= p->uclamp_req[clamp_id].value || ...

Right, but again this is very much intentional and consistent with the
existing behaviour for RLIMIT_NICE and friends. I think we should stick
with that for the new uclamp limit unless there is a good reason to
change it.

> I think we should just prevent the requested value to be above the limit. But
> the user can lower and increase it within that range. ie: for RLIMIT_UCLAMP
> = 512, any request in the [0:512] range is fine.
>
> Also if we set RLIMIT_UCLAMP = 0, then the user will still be able to change
> the uclamp value to 0, which is not what we want. We need a special value for
> *all requests are invalid*.

And on this one again this is all for consistency :)

> I'm not against this, but my instinct tells me that the simple sysctl knob to
> define the paranoia/priviliged level for uclamp is a lot simpler and more
> straightforward control.

It is indeed simpler, but either way we're committing to a new
userspace-visible. I feel that the rlimit stuff is going to be a lot
more future-proof, because it allows for much finer grain configurations
and as such is likely to cover more use-cases in the long run.

Thanks,
Quentin

2021-07-01 17:54:08

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] sched: Introduce RLIMIT_UCLAMP

On 07/01/21 12:05, Quentin Perret wrote:
> Hi Qais,
>
> On Thursday 01 Jul 2021 at 11:50:14 (+0100), Qais Yousef wrote:
> > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > index 9cbd915025ad..91a78cf1fe79 100644
> > > --- a/fs/proc/base.c
> > > +++ b/fs/proc/base.c
> > > @@ -586,6 +586,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
> > > [RLIMIT_NICE] = {"Max nice priority", NULL},
> > > [RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
> > > [RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> > > + [RLIMIT_UCLAMP] = {"Max utilization clamp", NULL},
> >
> > I think a single RLIMIT_UCLAMP is fine for pure permission control. But if we
> > have to do something with the currently requested values we'd need to split it
> > IMO.
>
> I don't see why we'd need to TBH. Increasing the uclamp min of task will
> request a higher capacity for the task, and increasing the uclamp min
> will _allow_ the task to ask for a higher capacity. So at the end of the
> day, what we want to limit is how much can a task request, no matter
> how it does it. It's all the same thing in my mind, but if you have a
> clear idea of what could go wrong, then I'm happy to think again :)

There are several thoughts actually. A bit hard to articulate at this time of
day, but let me try.

/proc/sys/kernel/sched_util_clamp_min/max are system wide limits. RLIMIT_UCLAMP
seems to want to mimic it, so it makes sense for both to behave similarly.
Preventing task from requesting a boost (raising UCLAMP_MIN) is different from
preventing a task going above performance point (raising UCLAMP_MAX). One
could want to control one without impacting the other.

Also I'm not sure about the relationship between RLIMIT_UCLAMP on the effective
uclamp value. It seems off to me that by default p->uclamp_req[UCLAMP_MAX]
= 1024, but setting RLIMIT_UCLAMP to 512 will keep all tasks uncapped by
default (ie: exceeding the limit).

>
> > > };
> > >
> > > /* Display limits for a process */
> > > diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
> > > index 8874f681b056..53483b7cd4d7 100644
> > > --- a/include/asm-generic/resource.h
> > > +++ b/include/asm-generic/resource.h
> > > @@ -26,6 +26,7 @@
> > > [RLIMIT_NICE] = { 0, 0 }, \
> > > [RLIMIT_RTPRIO] = { 0, 0 }, \
> > > [RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
> > > + [RLIMIT_UCLAMP] = { RLIM_INFINITY, RLIM_INFINITY }, \
> > > }
> > >
> > > #endif
> > > diff --git a/include/uapi/asm-generic/resource.h b/include/uapi/asm-generic/resource.h
> > > index f12db7a0da64..4d0fe4d564bf 100644
> > > --- a/include/uapi/asm-generic/resource.h
> > > +++ b/include/uapi/asm-generic/resource.h
> > > @@ -46,7 +46,8 @@
> > > 0-39 for nice level 19 .. -20 */
> > > #define RLIMIT_RTPRIO 14 /* maximum realtime priority */
> > > #define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */
> > > -#define RLIM_NLIMITS 16
> > > +#define RLIMIT_UCLAMP 16 /* maximum utilization clamp */
> > > +#define RLIM_NLIMITS 17
> > >
> > > /*
> > > * SuS says limits have to be unsigned.
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index ad055fb9ed2d..b094da4c5fea 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1430,6 +1430,11 @@ static int uclamp_validate(struct task_struct *p,
> > > if (util_min != -1 && util_max != -1 && util_min > util_max)
> > > return -EINVAL;
> > >
> > > + return 0;
> > > +}
> > > +
> > > +static void uclamp_enable(void)
> > > +{
> > > /*
> > > * We have valid uclamp attributes; make sure uclamp is enabled.
> > > *
> > > @@ -1438,8 +1443,20 @@ static int uclamp_validate(struct task_struct *p,
> > > * scheduler locks.
> > > */
> > > static_branch_enable(&sched_uclamp_used);
> > > +}
> > >
> > > - return 0;
> > > +static bool can_uclamp(struct task_struct *p, int value, enum uclamp_id clamp_id)
> > > +{
> > > + unsigned long uc_rlimit = task_rlimit(p, RLIMIT_UCLAMP);
> > > +
> > > + if (value == -1) {
> > > + if (rt_task(p) && clamp_id == UCLAMP_MIN)
> > > + value = sysctl_sched_uclamp_util_min_rt_default;
> > > + else
> > > + value = uclamp_none(clamp_id);
> > > + }
> > > +
> > > + return value <= p->uclamp_req[clamp_id].value || value <= uc_rlimit;
> >
> > Hmm why do we still need to prevent the task from changing the uclamp value
> > upward?
>
> Because this is exactly how rlimit already works for nice or rt
> priorities. Tasks are always allowed to decrease their 'importance'
> (that is, increase their nice values for ex.), but are limited in how
> they can increase it.
>
> See the __sched_setscheduler() permission checks for nice values and
> such.

I thought we already established that uclamp doesn't have security or fairness
implications and tasks are free to change it the way they want? This
implementation is close to v1 otherwise; we wanted to improve on that?

I think RLIMIT_UCLAMP should set an upper bound, and that's it.

>
> > It just shouldn't be outside the specified limit, no?
> >
> > And I think there's a bug in this logic. If UCLAMP_MIN was 1024 then the
> > RLIMIT_UCLAMP was lowered to 512, the user will be able to change UCLAMP_MIN to
> > 700 for example because of the
> >
> > return value <= p->uclamp_req[clamp_id].value || ...
>
> Right, but again this is very much intentional and consistent with the
> existing behaviour for RLIMIT_NICE and friends. I think we should stick
> with that for the new uclamp limit unless there is a good reason to
> change it.

Like above. I don't see the two limits are the same. Uclamp is managing
a different resource that doesn't behave like nice IMO. Apps are free to
lower/raise their uclamp value as long as they don't exceed the limit set by
RLIMIT_UCLAMP.

>
> > I think we should just prevent the requested value to be above the limit. But
> > the user can lower and increase it within that range. ie: for RLIMIT_UCLAMP
> > = 512, any request in the [0:512] range is fine.
> >
> > Also if we set RLIMIT_UCLAMP = 0, then the user will still be able to change
> > the uclamp value to 0, which is not what we want. We need a special value for
> > *all requests are invalid*.
>
> And on this one again this is all for consistency :)

But this will break your use case. If android framework decided to boost a task
to 300, then the task itself decides to set its boost to 0, it'll override that
setting, no? Isn't against what you want?

We could make 0 actually behave as *all requests are invalid*.

>
> > I'm not against this, but my instinct tells me that the simple sysctl knob to
> > define the paranoia/priviliged level for uclamp is a lot simpler and more
> > straightforward control.
>
> It is indeed simpler, but either way we're committing to a new
> userspace-visible. I feel that the rlimit stuff is going to be a lot
> more future-proof, because it allows for much finer grain configurations
> and as such is likely to cover more use-cases in the long run.

Yeah as I said I'm not against it in principle. What is tripping me off is
mainly the relationship between RLIMIT_UCLAMP and the effective uclamp value.
But maybe I'm overthinking it.

Thanks

--
Qais Yousef

2021-07-02 12:31:57

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] sched: Introduce RLIMIT_UCLAMP

On Thursday 01 Jul 2021 at 18:52:48 (+0100), Qais Yousef wrote:
> On 07/01/21 12:05, Quentin Perret wrote:
> > Hi Qais,
> >
> > On Thursday 01 Jul 2021 at 11:50:14 (+0100), Qais Yousef wrote:
> > > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > > index 9cbd915025ad..91a78cf1fe79 100644
> > > > --- a/fs/proc/base.c
> > > > +++ b/fs/proc/base.c
> > > > @@ -586,6 +586,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
> > > > [RLIMIT_NICE] = {"Max nice priority", NULL},
> > > > [RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
> > > > [RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> > > > + [RLIMIT_UCLAMP] = {"Max utilization clamp", NULL},
> > >
> > > I think a single RLIMIT_UCLAMP is fine for pure permission control. But if we
> > > have to do something with the currently requested values we'd need to split it
> > > IMO.
> >
> > I don't see why we'd need to TBH. Increasing the uclamp min of task will
> > request a higher capacity for the task, and increasing the uclamp min
> > will _allow_ the task to ask for a higher capacity. So at the end of the
> > day, what we want to limit is how much can a task request, no matter
> > how it does it. It's all the same thing in my mind, but if you have a
> > clear idea of what could go wrong, then I'm happy to think again :)
>
> There are several thoughts actually. A bit hard to articulate at this time of
> day, but let me try.
>
> /proc/sys/kernel/sched_util_clamp_min/max are system wide limits. RLIMIT_UCLAMP
> seems to want to mimic it, so it makes sense for both to behave similarly.
> Preventing task from requesting a boost (raising UCLAMP_MIN) is different from
> preventing a task going above performance point (raising UCLAMP_MAX).

I don't really see why -- as you've already explained tasks can just
busy loop to inflate their util values. So limiting the min clamp of
task alone will never be an effective mechanism to limit how much
capacity a task can ask for.

> One could want to control one without impacting the other.
>
> Also I'm not sure about the relationship between RLIMIT_UCLAMP on the effective
> uclamp value. It seems off to me that by default p->uclamp_req[UCLAMP_MAX]
> = 1024, but setting RLIMIT_UCLAMP to 512 will keep all tasks uncapped by
> default (ie: exceeding the limit).
>
> >
> > > > };
> > > >
> > > > /* Display limits for a process */
> > > > diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
> > > > index 8874f681b056..53483b7cd4d7 100644
> > > > --- a/include/asm-generic/resource.h
> > > > +++ b/include/asm-generic/resource.h
> > > > @@ -26,6 +26,7 @@
> > > > [RLIMIT_NICE] = { 0, 0 }, \
> > > > [RLIMIT_RTPRIO] = { 0, 0 }, \
> > > > [RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
> > > > + [RLIMIT_UCLAMP] = { RLIM_INFINITY, RLIM_INFINITY }, \
> > > > }
> > > >
> > > > #endif
> > > > diff --git a/include/uapi/asm-generic/resource.h b/include/uapi/asm-generic/resource.h
> > > > index f12db7a0da64..4d0fe4d564bf 100644
> > > > --- a/include/uapi/asm-generic/resource.h
> > > > +++ b/include/uapi/asm-generic/resource.h
> > > > @@ -46,7 +46,8 @@
> > > > 0-39 for nice level 19 .. -20 */
> > > > #define RLIMIT_RTPRIO 14 /* maximum realtime priority */
> > > > #define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */
> > > > -#define RLIM_NLIMITS 16
> > > > +#define RLIMIT_UCLAMP 16 /* maximum utilization clamp */
> > > > +#define RLIM_NLIMITS 17
> > > >
> > > > /*
> > > > * SuS says limits have to be unsigned.
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index ad055fb9ed2d..b094da4c5fea 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -1430,6 +1430,11 @@ static int uclamp_validate(struct task_struct *p,
> > > > if (util_min != -1 && util_max != -1 && util_min > util_max)
> > > > return -EINVAL;
> > > >
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static void uclamp_enable(void)
> > > > +{
> > > > /*
> > > > * We have valid uclamp attributes; make sure uclamp is enabled.
> > > > *
> > > > @@ -1438,8 +1443,20 @@ static int uclamp_validate(struct task_struct *p,
> > > > * scheduler locks.
> > > > */
> > > > static_branch_enable(&sched_uclamp_used);
> > > > +}
> > > >
> > > > - return 0;
> > > > +static bool can_uclamp(struct task_struct *p, int value, enum uclamp_id clamp_id)
> > > > +{
> > > > + unsigned long uc_rlimit = task_rlimit(p, RLIMIT_UCLAMP);
> > > > +
> > > > + if (value == -1) {
> > > > + if (rt_task(p) && clamp_id == UCLAMP_MIN)
> > > > + value = sysctl_sched_uclamp_util_min_rt_default;
> > > > + else
> > > > + value = uclamp_none(clamp_id);
> > > > + }
> > > > +
> > > > + return value <= p->uclamp_req[clamp_id].value || value <= uc_rlimit;
> > >
> > > Hmm why do we still need to prevent the task from changing the uclamp value
> > > upward?
> >
> > Because this is exactly how rlimit already works for nice or rt
> > priorities. Tasks are always allowed to decrease their 'importance'
> > (that is, increase their nice values for ex.), but are limited in how
> > they can increase it.
> >
> > See the __sched_setscheduler() permission checks for nice values and
> > such.
>
> I thought we already established that uclamp doesn't have security or fairness
> implications and tasks are free to change it the way they want? This
> implementation is close to v1 otherwise; we wanted to improve on that?

Sorry but I'm not sure to understand what you mean here :/
I thought we agreed that it was _not_ always OK to let tasks drive their
own clamps values ...

> I think RLIMIT_UCLAMP should set an upper bound, and that's it.

Well that is the core of our disagreement, but I think we should be
consitent with the existing mechanisms.

Today any unprivileged task can increase its nice value, and there is
nothing root can do to prevent that, irrespective of rlimit.

If root uses rlimit to prevent an unprivileged task from lowering its
nice value below, say, 0 (in the [-20, 19] range), then if that task
already has nice -15, it will be allowed to increase it to e.g. nice
-10, even if that exceeds the rlimit:

https://elixir.bootlin.com/linux/v5.13/source/kernel/sched/core.c#L6127

Tasks are always allowed to decrease their own 'importance' for nice and
RT priorities, and I don't see why we should do anything different for
uclamp.

Rlimit only comes into play when a task tries to increase its importance.
So that's what the above check tries to implement.

>
> >
> > > It just shouldn't be outside the specified limit, no?
> > >
> > > And I think there's a bug in this logic. If UCLAMP_MIN was 1024 then the
> > > RLIMIT_UCLAMP was lowered to 512, the user will be able to change UCLAMP_MIN to
> > > 700 for example because of the
> > >
> > > return value <= p->uclamp_req[clamp_id].value || ...
> >
> > Right, but again this is very much intentional and consistent with the
> > existing behaviour for RLIMIT_NICE and friends. I think we should stick
> > with that for the new uclamp limit unless there is a good reason to
> > change it.
>
> Like above. I don't see the two limits are the same. Uclamp is managing
> a different resource that doesn't behave like nice IMO. Apps are free to
> lower/raise their uclamp value as long as they don't exceed the limit set by
> RLIMIT_UCLAMP.

And like above, I don't see why uclamp should not behave like the
existing scheduler-related rlimits :)

> > > I think we should just prevent the requested value to be above the limit. But
> > > the user can lower and increase it within that range. ie: for RLIMIT_UCLAMP
> > > = 512, any request in the [0:512] range is fine.
> > >
> > > Also if we set RLIMIT_UCLAMP = 0, then the user will still be able to change
> > > the uclamp value to 0, which is not what we want. We need a special value for
> > > *all requests are invalid*.
> >
> > And on this one again this is all for consistency :)
>
> But this will break your use case. If android framework decided to boost a task
> to 300, then the task itself decides to set its boost to 0, it'll override that
> setting, no? Isn't against what you want?

No, I don't think we have a problem with that. The problem we have is
that today the framework has essentially no control what-so-ever over
per-task clamp values. With the rlimit stuff we can at least limit the
range that tasks are allowed to ask for.

> We could make 0 actually behave as *all requests are invalid*.

I don't have a fundamental problem with that. Feels a little odd to have
a special value in the range, but that could probably be useful, so why
not ...

Thanks!
Quentin

2021-07-08 11:37:26

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] sched: Introduce RLIMIT_UCLAMP

Hi Quentin

Apologies about the delayed response.

After I replied to most of your email I discovered the cause of the confusion,
but left my reply intact. So please read till the end before you reply ;-)

On 07/02/21 12:28, Quentin Perret wrote:
> > There are several thoughts actually. A bit hard to articulate at this time of
> > day, but let me try.
> >
> > /proc/sys/kernel/sched_util_clamp_min/max are system wide limits. RLIMIT_UCLAMP
> > seems to want to mimic it, so it makes sense for both to behave similarly.
> > Preventing task from requesting a boost (raising UCLAMP_MIN) is different from
> > preventing a task going above performance point (raising UCLAMP_MAX).
>
> I don't really see why -- as you've already explained tasks can just
> busy loop to inflate their util values. So limiting the min clamp of
> task alone will never be an effective mechanism to limit how much
> capacity a task can ask for.

It's about what the API means and how ultimately the end user should be use and
benefit from it. I don't think we're on the same page here. Let me try to
explain better (hopefully) :)

The system wide controls already split the uclamp min and uclamp max. Each
request delivers a different meaning is what I'm saying. So lumping them under
one control needs at least more explanation of what is the big picture behind
it to warrant contradicting how the system wide limits already behave. And what
this single system knob means from user perspective; outside your specific use
case.

For instance, why this scenario is not allowed?

RLIMIT_UCLAMP_MIN = 200
RLIMIT_UCLAMP_MAX = 1024

This task can boost itself up to 200 then must 'suffer' DVFS/migration
delays to reach maximum performance point.

Which is basically what you can do with the sysctl_sched_util_clamp_min/max but
they apply globally (and impact the effective uclamp value).

It's a valid way to control resource requests. Uclamp doesn't guarantee any
allocations anyway; it's just a hint that hopefully will work most of the time
and will be limited by thermal issues or oversubscribed system type of
problems.

A more realistic use case would be

RLIMIT_UCLAMP_MIN = 0
RLIMIT_UCLAMP_MAX = 512

So a task can't boost itself but allowed to run at half the system performance
point if it becomes busy. Which effectively will prevent short bursty tasks
from achieving high performance points, but allow long running ones to achieve
higher performance points to meet their demand and would be capped at 512.

In all these cases we're assuming a cooperative apps and framework. If there's
a malicious app, then this API could still be useful to block these power
viruses without having to kill them.

The way I see it is that we're introducing RLIMIT API to deal with more than
just on/off switch. I think we need to discuss these potential other use cases
and make sure this API is generic enough to suite them.

The RLIMIT API will become most useful in a world where we have apps that
dynamically manage their performance running in an evnrionment where there's
a 'master' uclamp planner that wants to limit what this app can 'perceive' and
are allowed to achieve in terms of performance/power.

I'd be very happy if this world becomes a reality, I think it's the right way
forward. But for the time being we're just anticipating and guessing. The only
use case we have is on/off. This use case can't dictate the overall theme of
the API though. You can actually view it as a corner case.

> > > Because this is exactly how rlimit already works for nice or rt
> > > priorities. Tasks are always allowed to decrease their 'importance'
> > > (that is, increase their nice values for ex.), but are limited in how
> > > they can increase it.
> > >
> > > See the __sched_setscheduler() permission checks for nice values and
> > > such.
> >
> > I thought we already established that uclamp doesn't have security or fairness
> > implications and tasks are free to change it the way they want? This
> > implementation is close to v1 otherwise; we wanted to improve on that?
>
> Sorry but I'm not sure to understand what you mean here :/
> I thought we agreed that it was _not_ always OK to let tasks drive their
> own clamps values ...

Yeah that's still the case. What I'm saying is that if we set

RLIMIT_UCLAMP = 512

means the task is allowed to change its uclamp value in the 0-512 range. It
just can't go above. The default value RLIMIT_INIFINITY should preserve the
old behavior as we agreed which is set it anywhere in the range between 0-1024.

Why the rlimit should impose the 'only can be lowered' behavior?

The 'only can be lowered' means:

p->uclamp[UCLAMP_MIN] = 300

with RLIMIT_UCLAMP = 512

according to you what should happen:

p->uclamp[UCLAMP_MIN] can only be changed to 0-300. If the user changes
it to 200 for instance, then the new range is 0-200. And so on.
A new value of 400 will return EPERM.

according to what I'm saying:

p->uclamp[UCLAMP_MIN] can only be changed to 0-512. Only value above
512 will get -EPERM. So a new value of 400 will be accepted since it
doesn't break RLIMIT_UCLAMP = 512.

As of mainline code today, a task can choose any value in the range 1024 for
its own. I don't agree to impose a new behavior to impose lowering the value
only. I don't think this behavior fits the uclamp story as I tried to explain
before. Apps are free to manage their uclamp values. I appreciate a framework
could set a limit of what they can request, but they should still be free to
pick any value within this limit.

>
> > I think RLIMIT_UCLAMP should set an upper bound, and that's it.
>
> Well that is the core of our disagreement, but I think we should be
> consitent with the existing mechanisms.

RLIMIT only imposes not going above a value, no? Does it say the attribute must
be 'lowered' only if RLIMIT is set? I don't see a correlation.

>
> Today any unprivileged task can increase its nice value, and there is
> nothing root can do to prevent that, irrespective of rlimit.

Can or can't? Unpriviliged users can only lower nice values even without
RLIMIT, no? That could be what I'm missing although when I read the code it
clearly will always fail when the requested nice value is not >= the current
task nice value. The rlimit check is 'don't care' if that's not true.

if (attr->sched_nice < task_nice(p) && ...

>
> If root uses rlimit to prevent an unprivileged task from lowering its
> nice value below, say, 0 (in the [-20, 19] range), then if that task
> already has nice -15, it will be allowed to increase it to e.g. nice
> -10, even if that exceeds the rlimit:
>
> https://elixir.bootlin.com/linux/v5.13/source/kernel/sched/core.c#L6127
>
> Tasks are always allowed to decrease their own 'importance' for nice and
> RT priorities, and I don't see why we should do anything different for
> uclamp.

Ah, I think the inverted meaning of 'increase' for nice might be catching us
here. Yes for nice unprivileged can decrease their importance (which means
increase their nice value).

Hopefully I explained my thoughts about uclamp better above. I think for uclamp
we just need to impose it doesn't exceed the RLIMIT. It can be increased or
decreased within the RLIMIT range.

And I think I get your point now about 'exceeding' the RLIMIT. You're talking
about the corner case I mentioned before of

p->uclamp[UCLAMP_MAX] = 1024

but then

RLIMIT_UCALMP = 512

what you're saying is that it's okay for a task to decrease it to 800 here
although that will exceed the RLIMIT, right?

Okay. I'm not sure how this corner case should be handled. But I hear your it
could be the consistent behavior to adopt here if that' what nice and priority
allow to happen already.

>
> Rlimit only comes into play when a task tries to increase its importance.
> So that's what the above check tries to implement.

Hmm the way I read the code is that we unconditionally prevent a task from
increasing its importance. And that what I was objecting to (beside the
question of the corner case above).

/me revisits the code again

Okay I indeed misread the code. Sorry about that. I think we're aligned now ;-)

It indeed allows moving in the [0:RLIMIT_UCLAMP] range except to handle the
corner case above.

>
> >
> > >
> > > > It just shouldn't be outside the specified limit, no?
> > > >
> > > > And I think there's a bug in this logic. If UCLAMP_MIN was 1024 then the
> > > > RLIMIT_UCLAMP was lowered to 512, the user will be able to change UCLAMP_MIN to
> > > > 700 for example because of the
> > > >
> > > > return value <= p->uclamp_req[clamp_id].value || ...
> > >
> > > Right, but again this is very much intentional and consistent with the
> > > existing behaviour for RLIMIT_NICE and friends. I think we should stick
> > > with that for the new uclamp limit unless there is a good reason to
> > > change it.
> >
> > Like above. I don't see the two limits are the same. Uclamp is managing
> > a different resource that doesn't behave like nice IMO. Apps are free to
> > lower/raise their uclamp value as long as they don't exceed the limit set by
> > RLIMIT_UCLAMP.
>
> And like above, I don't see why uclamp should not behave like the
> existing scheduler-related rlimits :)

Hopefully we're on the same page now. I was seeing this condition to always
impose lowering for some reason and that's what created the confusion.

>
> > > > I think we should just prevent the requested value to be above the limit. But
> > > > the user can lower and increase it within that range. ie: for RLIMIT_UCLAMP
> > > > = 512, any request in the [0:512] range is fine.
> > > >
> > > > Also if we set RLIMIT_UCLAMP = 0, then the user will still be able to change
> > > > the uclamp value to 0, which is not what we want. We need a special value for
> > > > *all requests are invalid*.
> > >
> > > And on this one again this is all for consistency :)
> >
> > But this will break your use case. If android framework decided to boost a task
> > to 300, then the task itself decides to set its boost to 0, it'll override that
> > setting, no? Isn't against what you want?
>
> No, I don't think we have a problem with that. The problem we have is
> that today the framework has essentially no control what-so-ever over
> per-task clamp values. With the rlimit stuff we can at least limit the
> range that tasks are allowed to ask for.
>
> > We could make 0 actually behave as *all requests are invalid*.
>
> I don't have a fundamental problem with that. Feels a little odd to have
> a special value in the range, but that could probably be useful, so why
> not ...

If setting it to 0 will not break your use case, then we can look at not
imposing that. Although if the API will not allow to prevent the user from
modifying the uclamp values at all, something will be missing in the API IMO.

Let me think if there are other alternatives to consider. A limit of 0 could be
interpreted as no requests allowed. It makes sense logically to me. But maybe
we can do better, hmmm

Thanks!

--
Qais Yousef

2021-07-19 11:47:02

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] sched: Introduce RLIMIT_UCLAMP

Hi Qais,

On Thursday 08 Jul 2021 at 12:36:02 (+0100), Qais Yousef wrote:
> Hi Quentin
>
> Apologies about the delayed response.

It's my turn to apologize -- the last few days have been a pretty busy :/

> After I replied to most of your email I discovered the cause of the confusion,
> but left my reply intact. So please read till the end before you reply ;-)
>
> On 07/02/21 12:28, Quentin Perret wrote:
> > > There are several thoughts actually. A bit hard to articulate at this time of
> > > day, but let me try.
> > >
> > > /proc/sys/kernel/sched_util_clamp_min/max are system wide limits. RLIMIT_UCLAMP
> > > seems to want to mimic it, so it makes sense for both to behave similarly.
> > > Preventing task from requesting a boost (raising UCLAMP_MIN) is different from
> > > preventing a task going above performance point (raising UCLAMP_MAX).
> >
> > I don't really see why -- as you've already explained tasks can just
> > busy loop to inflate their util values. So limiting the min clamp of
> > task alone will never be an effective mechanism to limit how much
> > capacity a task can ask for.
>
> It's about what the API means and how ultimately the end user should be use and
> benefit from it. I don't think we're on the same page here. Let me try to
> explain better (hopefully) :)
>
> The system wide controls already split the uclamp min and uclamp max. Each
> request delivers a different meaning is what I'm saying. So lumping them under
> one control needs at least more explanation of what is the big picture behind
> it to warrant contradicting how the system wide limits already behave. And what
> this single system knob means from user perspective; outside your specific use
> case.
>
> For instance, why this scenario is not allowed?
>
> RLIMIT_UCLAMP_MIN = 200
> RLIMIT_UCLAMP_MAX = 1024
>
> This task can boost itself up to 200 then must 'suffer' DVFS/migration
> delays to reach maximum performance point.

I just don't see how this would help at all. IMO the rlimit stuff is
only useful to allow root to limit how much a task can ask for itself,
and it doesn't really matter if the task inflates its request via
uclamp.min, or by increasing it's uclamp.max and spinning on the CPU.
That is, I just can't imagine how having different values for
RLIMIT_UCLAMP_MIN and RLIMIT_UCLAMP_MAX would be useful in practice.

> Which is basically what you can do with the sysctl_sched_util_clamp_min/max but
> they apply globally (and impact the effective uclamp value).
>
> It's a valid way to control resource requests. Uclamp doesn't guarantee any
> allocations anyway; it's just a hint that hopefully will work most of the time
> and will be limited by thermal issues or oversubscribed system type of
> problems.
>
> A more realistic use case would be
>
> RLIMIT_UCLAMP_MIN = 0
> RLIMIT_UCLAMP_MAX = 512
>
> So a task can't boost itself but allowed to run at half the system performance
> point if it becomes busy.

That's not really what this would mean. What the above means is that a
task is allowed to override the uclamp max set by root as long as it
stays below 512. The actual uclamp.max set by root could be different
from 512.

And IMO the only realistic configurations would be either

RLIMIT_UCLAMP_MIN = 0
RLIMIT_UCLAMP_MAX = 0

which means that tasks can never increase their clamps, root decides
their importance, but they're free to opt out.

Or:

RLIMIT_UCLAMP_MIN = 512
RLIMIT_UCLAMP_MAX = 512

Which means that root allows this task to ask for anything in the 0-512
range. The use case I'd see for this is: root sets the uclamp.max of the
task to 512 and the rlimit to the same value. With that root is
guaranteed that the clamped util of the task will never exceed 512. The
task can do whatever it wants in that range (spin on the CPU or boost
itself via uclamp min) it doesn't matter: it can't escape the limit.

This is what I feel is the use for the rlimit stuff: it allows root to
put restrictions, and to have guarantees about those restrictions
being respected. So, I don't see how separate rlimits for uclamp min
and max would help really.

> Which effectively will prevent short bursty tasks
> from achieving high performance points, but allow long running ones to achieve
> higher performance points to meet their demand and would be capped at 512.

I doubt anybody could make use of this TBH. I can imaging the framework
saying "I don't want this task to ask for more than X of capacity", but I
can't really see how it could reason about allowing long running tasks
but not short ones, or anything like that...

> In all these cases we're assuming a cooperative apps and framework. If there's
> a malicious app, then this API could still be useful to block these power
> viruses without having to kill them.
>
> The way I see it is that we're introducing RLIMIT API to deal with more than
> just on/off switch. I think we need to discuss these potential other use cases
> and make sure this API is generic enough to suite them.
>
> The RLIMIT API will become most useful in a world where we have apps that
> dynamically manage their performance running in an evnrionment where there's
> a 'master' uclamp planner that wants to limit what this app can 'perceive' and
> are allowed to achieve in terms of performance/power.
>
> I'd be very happy if this world becomes a reality, I think it's the right way
> forward. But for the time being we're just anticipating and guessing. The only
> use case we have is on/off. This use case can't dictate the overall theme of
> the API though. You can actually view it as a corner case.
>
> > > > Because this is exactly how rlimit already works for nice or rt
> > > > priorities. Tasks are always allowed to decrease their 'importance'
> > > > (that is, increase their nice values for ex.), but are limited in how
> > > > they can increase it.
> > > >
> > > > See the __sched_setscheduler() permission checks for nice values and
> > > > such.
> > >
> > > I thought we already established that uclamp doesn't have security or fairness
> > > implications and tasks are free to change it the way they want? This
> > > implementation is close to v1 otherwise; we wanted to improve on that?
> >
> > Sorry but I'm not sure to understand what you mean here :/
> > I thought we agreed that it was _not_ always OK to let tasks drive their
> > own clamps values ...
>
> Yeah that's still the case. What I'm saying is that if we set
>
> RLIMIT_UCLAMP = 512
>
> means the task is allowed to change its uclamp value in the 0-512 range. It
> just can't go above. The default value RLIMIT_INIFINITY should preserve the
> old behavior as we agreed which is set it anywhere in the range between 0-1024.
>
> Why the rlimit should impose the 'only can be lowered' behavior?
>
> The 'only can be lowered' means:
>
> p->uclamp[UCLAMP_MIN] = 300
>
> with RLIMIT_UCLAMP = 512
>
> according to you what should happen:
>
> p->uclamp[UCLAMP_MIN] can only be changed to 0-300. If the user changes
> it to 200 for instance, then the new range is 0-200. And so on.
> A new value of 400 will return EPERM.
>
> according to what I'm saying:
>
> p->uclamp[UCLAMP_MIN] can only be changed to 0-512. Only value above
> 512 will get -EPERM. So a new value of 400 will be accepted since it
> doesn't break RLIMIT_UCLAMP = 512.
>
> As of mainline code today, a task can choose any value in the range 1024 for
> its own. I don't agree to impose a new behavior to impose lowering the value
> only. I don't think this behavior fits the uclamp story as I tried to explain
> before. Apps are free to manage their uclamp values. I appreciate a framework
> could set a limit of what they can request, but they should still be free to
> pick any value within this limit.
>
> >
> > > I think RLIMIT_UCLAMP should set an upper bound, and that's it.
> >
> > Well that is the core of our disagreement, but I think we should be
> > consitent with the existing mechanisms.
>
> RLIMIT only imposes not going above a value, no? Does it say the attribute must
> be 'lowered' only if RLIMIT is set? I don't see a correlation.
>
> >
> > Today any unprivileged task can increase its nice value, and there is
> > nothing root can do to prevent that, irrespective of rlimit.
>
> Can or can't? Unpriviliged users can only lower nice values even without
> RLIMIT, no? That could be what I'm missing although when I read the code it
> clearly will always fail when the requested nice value is not >= the current
> task nice value. The rlimit check is 'don't care' if that's not true.
>
> if (attr->sched_nice < task_nice(p) && ...
>
> >
> > If root uses rlimit to prevent an unprivileged task from lowering its
> > nice value below, say, 0 (in the [-20, 19] range), then if that task
> > already has nice -15, it will be allowed to increase it to e.g. nice
> > -10, even if that exceeds the rlimit:
> >
> > https://elixir.bootlin.com/linux/v5.13/source/kernel/sched/core.c#L6127
> >
> > Tasks are always allowed to decrease their own 'importance' for nice and
> > RT priorities, and I don't see why we should do anything different for
> > uclamp.
>
> Ah, I think the inverted meaning of 'increase' for nice might be catching us
> here. Yes for nice unprivileged can decrease their importance (which means
> increase their nice value).
>
> Hopefully I explained my thoughts about uclamp better above. I think for uclamp
> we just need to impose it doesn't exceed the RLIMIT. It can be increased or
> decreased within the RLIMIT range.
>
> And I think I get your point now about 'exceeding' the RLIMIT. You're talking
> about the corner case I mentioned before of
>
> p->uclamp[UCLAMP_MAX] = 1024
>
> but then
>
> RLIMIT_UCALMP = 512
>
> what you're saying is that it's okay for a task to decrease it to 800 here
> although that will exceed the RLIMIT, right?
>
> Okay. I'm not sure how this corner case should be handled. But I hear your it
> could be the consistent behavior to adopt here if that' what nice and priority
> allow to happen already.
>
> >
> > Rlimit only comes into play when a task tries to increase its importance.
> > So that's what the above check tries to implement.
>
> Hmm the way I read the code is that we unconditionally prevent a task from
> increasing its importance. And that what I was objecting to (beside the
> question of the corner case above).
>
> /me revisits the code again
>
> Okay I indeed misread the code. Sorry about that. I think we're aligned now ;-)
>
> It indeed allows moving in the [0:RLIMIT_UCLAMP] range except to handle the
> corner case above.

Right, so I won't reply to all the above comments, but yes I think we're
in line now ;-)

>
> >
> > >
> > > >
> > > > > It just shouldn't be outside the specified limit, no?
> > > > >
> > > > > And I think there's a bug in this logic. If UCLAMP_MIN was 1024 then the
> > > > > RLIMIT_UCLAMP was lowered to 512, the user will be able to change UCLAMP_MIN to
> > > > > 700 for example because of the
> > > > >
> > > > > return value <= p->uclamp_req[clamp_id].value || ...
> > > >
> > > > Right, but again this is very much intentional and consistent with the
> > > > existing behaviour for RLIMIT_NICE and friends. I think we should stick
> > > > with that for the new uclamp limit unless there is a good reason to
> > > > change it.
> > >
> > > Like above. I don't see the two limits are the same. Uclamp is managing
> > > a different resource that doesn't behave like nice IMO. Apps are free to
> > > lower/raise their uclamp value as long as they don't exceed the limit set by
> > > RLIMIT_UCLAMP.
> >
> > And like above, I don't see why uclamp should not behave like the
> > existing scheduler-related rlimits :)
>
> Hopefully we're on the same page now. I was seeing this condition to always
> impose lowering for some reason and that's what created the confusion.
>
> >
> > > > > I think we should just prevent the requested value to be above the limit. But
> > > > > the user can lower and increase it within that range. ie: for RLIMIT_UCLAMP
> > > > > = 512, any request in the [0:512] range is fine.
> > > > >
> > > > > Also if we set RLIMIT_UCLAMP = 0, then the user will still be able to change
> > > > > the uclamp value to 0, which is not what we want. We need a special value for
> > > > > *all requests are invalid*.
> > > >
> > > > And on this one again this is all for consistency :)
> > >
> > > But this will break your use case. If android framework decided to boost a task
> > > to 300, then the task itself decides to set its boost to 0, it'll override that
> > > setting, no? Isn't against what you want?
> >
> > No, I don't think we have a problem with that. The problem we have is
> > that today the framework has essentially no control what-so-ever over
> > per-task clamp values. With the rlimit stuff we can at least limit the
> > range that tasks are allowed to ask for.
> >
> > > We could make 0 actually behave as *all requests are invalid*.
> >
> > I don't have a fundamental problem with that. Feels a little odd to have
> > a special value in the range, but that could probably be useful, so why
> > not ...
>
> If setting it to 0 will not break your use case, then we can look at not
> imposing that. Although if the API will not allow to prevent the user from
> modifying the uclamp values at all, something will be missing in the API IMO.
>
> Let me think if there are other alternatives to consider. A limit of 0 could be
> interpreted as no requests allowed. It makes sense logically to me. But maybe
> we can do better, hmmm

Alright, I feel like this rlimit stuff is going to need a bit more
discussion, so I'll re-post patches 01 and and 02 separately as they're
really just fixes, and we can continue bike-shedding on this one in the
meantime :)

Cheers,
Quentin

2021-07-26 16:40:27

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] sched: Introduce RLIMIT_UCLAMP

Hey Quentin

On 07/19/21 12:44, Quentin Perret wrote:
> Hi Qais,
>
> On Thursday 08 Jul 2021 at 12:36:02 (+0100), Qais Yousef wrote:
> > Hi Quentin
> >
> > Apologies about the delayed response.
>
> It's my turn to apologize -- the last few days have been a pretty busy :/
>
> > After I replied to most of your email I discovered the cause of the confusion,
> > but left my reply intact. So please read till the end before you reply ;-)
> >
> > On 07/02/21 12:28, Quentin Perret wrote:
> > > > There are several thoughts actually. A bit hard to articulate at this time of
> > > > day, but let me try.
> > > >
> > > > /proc/sys/kernel/sched_util_clamp_min/max are system wide limits. RLIMIT_UCLAMP
> > > > seems to want to mimic it, so it makes sense for both to behave similarly.
> > > > Preventing task from requesting a boost (raising UCLAMP_MIN) is different from
> > > > preventing a task going above performance point (raising UCLAMP_MAX).
> > >
> > > I don't really see why -- as you've already explained tasks can just
> > > busy loop to inflate their util values. So limiting the min clamp of
> > > task alone will never be an effective mechanism to limit how much
> > > capacity a task can ask for.
> >
> > It's about what the API means and how ultimately the end user should be use and
> > benefit from it. I don't think we're on the same page here. Let me try to
> > explain better (hopefully) :)
> >
> > The system wide controls already split the uclamp min and uclamp max. Each
> > request delivers a different meaning is what I'm saying. So lumping them under
> > one control needs at least more explanation of what is the big picture behind
> > it to warrant contradicting how the system wide limits already behave. And what
> > this single system knob means from user perspective; outside your specific use
> > case.
> >
> > For instance, why this scenario is not allowed?
> >
> > RLIMIT_UCLAMP_MIN = 200
> > RLIMIT_UCLAMP_MAX = 1024
> >
> > This task can boost itself up to 200 then must 'suffer' DVFS/migration
> > delays to reach maximum performance point.
>
> I just don't see how this would help at all. IMO the rlimit stuff is
> only useful to allow root to limit how much a task can ask for itself,
> and it doesn't really matter if the task inflates its request via
> uclamp.min, or by increasing it's uclamp.max and spinning on the CPU.
> That is, I just can't imagine how having different values for
> RLIMIT_UCLAMP_MIN and RLIMIT_UCLAMP_MAX would be useful in practice.

The way I see it is that we already have a sysctl that behaves like a limit for
UCLAMP_MIN and UCLAMP_MAX. The above scenario is something you can do with the
global sysctl. So you think system wide sysctl should be a single limit too
then?

>
> > Which is basically what you can do with the sysctl_sched_util_clamp_min/max but
> > they apply globally (and impact the effective uclamp value).
> >
> > It's a valid way to control resource requests. Uclamp doesn't guarantee any
> > allocations anyway; it's just a hint that hopefully will work most of the time
> > and will be limited by thermal issues or oversubscribed system type of
> > problems.
> >
> > A more realistic use case would be
> >
> > RLIMIT_UCLAMP_MIN = 0
> > RLIMIT_UCLAMP_MAX = 512
> >
> > So a task can't boost itself but allowed to run at half the system performance
> > point if it becomes busy.
>
> That's not really what this would mean. What the above means is that a
> task is allowed to override the uclamp max set by root as long as it
> stays below 512. The actual uclamp.max set by root could be different
> from 512.
>
> And IMO the only realistic configurations would be either
>
> RLIMIT_UCLAMP_MIN = 0
> RLIMIT_UCLAMP_MAX = 0
>
> which means that tasks can never increase their clamps, root decides
> their importance, but they're free to opt out.
>
> Or:
>
> RLIMIT_UCLAMP_MIN = 512
> RLIMIT_UCLAMP_MAX = 512
>
> Which means that root allows this task to ask for anything in the 0-512
> range. The use case I'd see for this is: root sets the uclamp.max of the
> task to 512 and the rlimit to the same value. With that root is
> guaranteed that the clamped util of the task will never exceed 512. The
> task can do whatever it wants in that range (spin on the CPU or boost
> itself via uclamp min) it doesn't matter: it can't escape the limit.
>
> This is what I feel is the use for the rlimit stuff: it allows root to
> put restrictions, and to have guarantees about those restrictions
> being respected. So, I don't see how separate rlimits for uclamp min
> and max would help really.

Fair enough. I still can't see why system wide can set these limits separately
but not rlimit though. The rlimit just applies them to a specifc task(s) rather
than every task on the system. I appreciate the way you envisage it to work,
but what I can't connect it is why it's fine for system wide uclamp limits but
not for the per-task rlimits?

>
> > Which effectively will prevent short bursty tasks
> > from achieving high performance points, but allow long running ones to achieve
> > higher performance points to meet their demand and would be capped at 512.
>
> I doubt anybody could make use of this TBH. I can imaging the framework
> saying "I don't want this task to ask for more than X of capacity", but I
> can't really see how it could reason about allowing long running tasks
> but not short ones, or anything like that...

This for me is an argument against RLIMIT, or the current design of the system
wide uclamp limits.

Ultimately this interface tries to regulate an application that tries to use
uclamp to boost/cap some tasks but there's a system wide framework that tries
to regulate these tasks too. The application can certainly reason about long
running vs short running tasks. The framework can reason on what is allowed to
be achieved by this app via uclamp. That's the way I see it at least :)

[...]

> Alright, I feel like this rlimit stuff is going to need a bit more
> discussion, so I'll re-post patches 01 and and 02 separately as they're
> really just fixes, and we can continue bike-shedding on this one in the
> meantime :)

:)

I'm sorry if I'm being painful, but I honestly just think the story lacks
clarity still. I am working on a uclamp doc and explaining what the current
interfaces are and how they are related to each others and how they should be
used and what they mean is hard enough. When I think of where this RLIMIT sits
in the story; especially it massively resembles how the system wide uclamp
sysctl behave, my mind keeps tripping over..

The most coherent story IMHO is to always have UCLAMP_MIN and UCLAMP_MAX and
always make UCLAMP_MIN behave as a protection and UCLAMP_MAX as a limit. Then
we'd have a very consistent story across all layers of control. It would just
mean we'd need to change how the system wide UCLAMP_MIN behaves, which could be
an ABI problem.

If we go down this route, I would see the RLIMIT then directly impacting the
effective uclamp value of the task rather than cause -EPERM. The same way the
rest of the layers work. ie: we'll have the following hierarchy to get the
effective uclamp of a task

per-task request -> rlimit -> cgroup -> system

Would be good to hear what others have to say.

Cheers

--
Qais Yousef