2023-10-26 16:41:52

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH] sched: psi: fix unprivileged polling against cgroups

519fabc7aaba ("psi: remove 500ms min window size limitation for
triggers") breaks unprivileged psi polling on cgroups.

Historically, we had a privilege check for polling in the open() of a
pressure file in /proc, but were erroneously missing it for the open()
of cgroup pressure files.

When unprivileged polling was introduced in d82caa273565 ("sched/psi:
Allow unprivileged polling of N*2s period"), it needed to filter
privileges depending on the exact polling parameters, and as such
moved the CAP_SYS_RESOURCE check from the proc open() callback to
psi_trigger_create(). Both the proc files as well as cgroup files go
through this during write(). This implicitly added the missing check
for privileges required for HT polling for cgroups.

When 519fabc7aaba ("psi: remove 500ms min window size limitation for
triggers") followed right after to remove further restrictions on the
RT polling window, it incorrectly assumed the cgroup privilege check
was still missing and added it to the cgroup open(), mirroring what we
used to do for proc files in the past.

As a result, unprivileged poll requests that would be supported now
get rejected when opening the cgroup pressure file for writing.

Remove the cgroup open() check. psi_trigger_create() handles it.

Fixes: 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers")
Cc: [email protected] # 6.5+
Reported-by: Luca Boccassi <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
kernel/cgroup/cgroup.c | 12 ------------
1 file changed, 12 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f11488b18ceb..2069ee98da60 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3879,14 +3879,6 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
return psi_trigger_poll(&ctx->psi.trigger, of->file, pt);
}

-static int cgroup_pressure_open(struct kernfs_open_file *of)
-{
- if (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE))
- return -EPERM;
-
- return 0;
-}
-
static void cgroup_pressure_release(struct kernfs_open_file *of)
{
struct cgroup_file_ctx *ctx = of->priv;
@@ -5287,7 +5279,6 @@ static struct cftype cgroup_psi_files[] = {
{
.name = "io.pressure",
.file_offset = offsetof(struct cgroup, psi_files[PSI_IO]),
- .open = cgroup_pressure_open,
.seq_show = cgroup_io_pressure_show,
.write = cgroup_io_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5296,7 +5287,6 @@ static struct cftype cgroup_psi_files[] = {
{
.name = "memory.pressure",
.file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]),
- .open = cgroup_pressure_open,
.seq_show = cgroup_memory_pressure_show,
.write = cgroup_memory_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5305,7 +5295,6 @@ static struct cftype cgroup_psi_files[] = {
{
.name = "cpu.pressure",
.file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]),
- .open = cgroup_pressure_open,
.seq_show = cgroup_cpu_pressure_show,
.write = cgroup_cpu_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5315,7 +5304,6 @@ static struct cftype cgroup_psi_files[] = {
{
.name = "irq.pressure",
.file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]),
- .open = cgroup_pressure_open,
.seq_show = cgroup_irq_pressure_show,
.write = cgroup_irq_pressure_write,
.poll = cgroup_pressure_poll,
--
2.42.0


2023-10-26 16:50:04

by Luca Boccassi

[permalink] [raw]
Subject: Re: [PATCH] sched: psi: fix unprivileged polling against cgroups

On Thu, 26 Oct 2023 at 17:41, Johannes Weiner <[email protected]> wrote:
>
> 519fabc7aaba ("psi: remove 500ms min window size limitation for
> triggers") breaks unprivileged psi polling on cgroups.
>
> Historically, we had a privilege check for polling in the open() of a
> pressure file in /proc, but were erroneously missing it for the open()
> of cgroup pressure files.
>
> When unprivileged polling was introduced in d82caa273565 ("sched/psi:
> Allow unprivileged polling of N*2s period"), it needed to filter
> privileges depending on the exact polling parameters, and as such
> moved the CAP_SYS_RESOURCE check from the proc open() callback to
> psi_trigger_create(). Both the proc files as well as cgroup files go
> through this during write(). This implicitly added the missing check
> for privileges required for HT polling for cgroups.
>
> When 519fabc7aaba ("psi: remove 500ms min window size limitation for
> triggers") followed right after to remove further restrictions on the
> RT polling window, it incorrectly assumed the cgroup privilege check
> was still missing and added it to the cgroup open(), mirroring what we
> used to do for proc files in the past.
>
> As a result, unprivileged poll requests that would be supported now
> get rejected when opening the cgroup pressure file for writing.
>
> Remove the cgroup open() check. psi_trigger_create() handles it.
>
> Fixes: 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers")
> Cc: [email protected] # 6.5+
> Reported-by: Luca Boccassi <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: Luca Boccassi <[email protected]>

Thank you very much for the quick fix - this was reported originally
on the systemd bug tracker by Daniel Black (I do not have an email
address):

https://github.com/systemd/systemd/issues/29723

It is very important for systemd services to be able to do this
without capabilities, as using capabilities means in turn user
namespaces cannot be used (PrivateUsers=yes in systemd parlance).

Kind regards,
Luca Boccassi

2023-10-26 16:55:55

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH] sched: psi: fix unprivileged polling against cgroups

On Thu, Oct 26, 2023 at 9:49 AM Luca Boccassi <[email protected]> wrote:
>
> On Thu, 26 Oct 2023 at 17:41, Johannes Weiner <[email protected]> wrote:
> >
> > 519fabc7aaba ("psi: remove 500ms min window size limitation for
> > triggers") breaks unprivileged psi polling on cgroups.
> >
> > Historically, we had a privilege check for polling in the open() of a
> > pressure file in /proc, but were erroneously missing it for the open()
> > of cgroup pressure files.
> >
> > When unprivileged polling was introduced in d82caa273565 ("sched/psi:
> > Allow unprivileged polling of N*2s period"), it needed to filter
> > privileges depending on the exact polling parameters, and as such
> > moved the CAP_SYS_RESOURCE check from the proc open() callback to
> > psi_trigger_create(). Both the proc files as well as cgroup files go
> > through this during write(). This implicitly added the missing check
> > for privileges required for HT polling for cgroups.
> >
> > When 519fabc7aaba ("psi: remove 500ms min window size limitation for
> > triggers") followed right after to remove further restrictions on the
> > RT polling window, it incorrectly assumed the cgroup privilege check
> > was still missing and added it to the cgroup open(), mirroring what we
> > used to do for proc files in the past.
> >
> > As a result, unprivileged poll requests that would be supported now
> > get rejected when opening the cgroup pressure file for writing.

Ah, I see the problem. In our discussion
https://lore.kernel.org/all/ZADj4YX4uftK%[email protected]/ we decided
to have the check in open() to fail early but we never considered
unprivileged processes which only poll and never create any triggers.
Makes sense.

> >
> > Remove the cgroup open() check. psi_trigger_create() handles it.
> >
> > Fixes: 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers")
> > Cc: [email protected] # 6.5+
> > Reported-by: Luca Boccassi <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Acked-by: Luca Boccassi <[email protected]>

Acked-by: Suren Baghdasaryan <[email protected]>

>
> Thank you very much for the quick fix - this was reported originally
> on the systemd bug tracker by Daniel Black (I do not have an email
> address):
>
> https://github.com/systemd/systemd/issues/29723
>
> It is very important for systemd services to be able to do this
> without capabilities, as using capabilities means in turn user
> namespaces cannot be used (PrivateUsers=yes in systemd parlance).
>
> Kind regards,
> Luca Boccassi

2023-10-26 17:02:54

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] sched: psi: fix unprivileged polling against cgroups

On Thu, Oct 26, 2023 at 09:55:23AM -0700, Suren Baghdasaryan wrote:
> On Thu, Oct 26, 2023 at 9:49 AM Luca Boccassi <[email protected]> wrote:
> >
> > On Thu, 26 Oct 2023 at 17:41, Johannes Weiner <[email protected]> wrote:
> > >
> > > 519fabc7aaba ("psi: remove 500ms min window size limitation for
> > > triggers") breaks unprivileged psi polling on cgroups.
> > >
> > > Historically, we had a privilege check for polling in the open() of a
> > > pressure file in /proc, but were erroneously missing it for the open()
> > > of cgroup pressure files.
> > >
> > > When unprivileged polling was introduced in d82caa273565 ("sched/psi:
> > > Allow unprivileged polling of N*2s period"), it needed to filter
> > > privileges depending on the exact polling parameters, and as such
> > > moved the CAP_SYS_RESOURCE check from the proc open() callback to
> > > psi_trigger_create(). Both the proc files as well as cgroup files go
> > > through this during write(). This implicitly added the missing check
> > > for privileges required for HT polling for cgroups.
> > >
> > > When 519fabc7aaba ("psi: remove 500ms min window size limitation for
> > > triggers") followed right after to remove further restrictions on the
> > > RT polling window, it incorrectly assumed the cgroup privilege check
> > > was still missing and added it to the cgroup open(), mirroring what we
> > > used to do for proc files in the past.
> > >
> > > As a result, unprivileged poll requests that would be supported now
> > > get rejected when opening the cgroup pressure file for writing.
>
> Ah, I see the problem. In our discussion
> https://lore.kernel.org/all/ZADj4YX4uftK%[email protected]/ we decided
> to have the check in open() to fail early but we never considered
> unprivileged processes which only poll and never create any triggers.
> Makes sense.

Yeah, the two patches just ended up clashing. We made that open()
decision before unprivileged polling was merged, then ended up merging
it before the window patch.

Thanks!

2023-10-26 18:53:00

by Daniel Black

[permalink] [raw]

2023-10-31 20:05:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: psi: fix unprivileged polling against cgroups


+cc tj because cgroup

On Thu, Oct 26, 2023 at 12:41:14PM -0400, Johannes Weiner wrote:
> 519fabc7aaba ("psi: remove 500ms min window size limitation for
> triggers") breaks unprivileged psi polling on cgroups.
>
> Historically, we had a privilege check for polling in the open() of a
> pressure file in /proc, but were erroneously missing it for the open()
> of cgroup pressure files.
>
> When unprivileged polling was introduced in d82caa273565 ("sched/psi:
> Allow unprivileged polling of N*2s period"), it needed to filter
> privileges depending on the exact polling parameters, and as such
> moved the CAP_SYS_RESOURCE check from the proc open() callback to
> psi_trigger_create(). Both the proc files as well as cgroup files go
> through this during write(). This implicitly added the missing check
> for privileges required for HT polling for cgroups.
>
> When 519fabc7aaba ("psi: remove 500ms min window size limitation for
> triggers") followed right after to remove further restrictions on the
> RT polling window, it incorrectly assumed the cgroup privilege check
> was still missing and added it to the cgroup open(), mirroring what we
> used to do for proc files in the past.
>
> As a result, unprivileged poll requests that would be supported now
> get rejected when opening the cgroup pressure file for writing.
>
> Remove the cgroup open() check. psi_trigger_create() handles it.
>
> Fixes: 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers")
> Cc: [email protected] # 6.5+
> Reported-by: Luca Boccassi <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>

Since merge window is upon is, I've queued this with the intent to stick
into sched/urgent after rc1.

> ---
> kernel/cgroup/cgroup.c | 12 ------------
> 1 file changed, 12 deletions(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index f11488b18ceb..2069ee98da60 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3879,14 +3879,6 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
> return psi_trigger_poll(&ctx->psi.trigger, of->file, pt);
> }
>
> -static int cgroup_pressure_open(struct kernfs_open_file *of)
> -{
> - if (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE))
> - return -EPERM;
> -
> - return 0;
> -}
> -
> static void cgroup_pressure_release(struct kernfs_open_file *of)
> {
> struct cgroup_file_ctx *ctx = of->priv;
> @@ -5287,7 +5279,6 @@ static struct cftype cgroup_psi_files[] = {
> {
> .name = "io.pressure",
> .file_offset = offsetof(struct cgroup, psi_files[PSI_IO]),
> - .open = cgroup_pressure_open,
> .seq_show = cgroup_io_pressure_show,
> .write = cgroup_io_pressure_write,
> .poll = cgroup_pressure_poll,
> @@ -5296,7 +5287,6 @@ static struct cftype cgroup_psi_files[] = {
> {
> .name = "memory.pressure",
> .file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]),
> - .open = cgroup_pressure_open,
> .seq_show = cgroup_memory_pressure_show,
> .write = cgroup_memory_pressure_write,
> .poll = cgroup_pressure_poll,
> @@ -5305,7 +5295,6 @@ static struct cftype cgroup_psi_files[] = {
> {
> .name = "cpu.pressure",
> .file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]),
> - .open = cgroup_pressure_open,
> .seq_show = cgroup_cpu_pressure_show,
> .write = cgroup_cpu_pressure_write,
> .poll = cgroup_pressure_poll,
> @@ -5315,7 +5304,6 @@ static struct cftype cgroup_psi_files[] = {
> {
> .name = "irq.pressure",
> .file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]),
> - .open = cgroup_pressure_open,
> .seq_show = cgroup_irq_pressure_show,
> .write = cgroup_irq_pressure_write,
> .poll = cgroup_pressure_poll,
> --
> 2.42.0
>