2022-01-11 07:12:19

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

With write operation on psi files replacing old trigger with a new one,
the lifetime of its waitqueue is totally arbitrary. Overwriting an
existing trigger causes its waitqueue to be freed and pending poll()
will stumble on trigger->event_wait which was destroyed.
Fix this by disallowing to redefine an existing psi trigger. If a write
operation is used on a file descriptor with an already existing psi
trigger, the operation will fail with EBUSY error.
Also bypass a check for psi_disabled in the psi_trigger_destroy as the
flag can be flipped after the trigger is created, leading to a memory
leak.

Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Cc: [email protected]
Reported-by: [email protected]
Analyzed-by: Eric Biggers <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
Changes in v2:
- Added Fixes tag, per Eric Biggers
- CC'ed [email protected], per Eric Biggers
- Removed unnecessary READ_ONCE/WRITE_ONCE, per Eric Biggers
- Changed psi_trigger_destroy to accept psi_trigger pointer, per Eric Biggers

Documentation/accounting/psi.rst | 3 +-
include/linux/psi.h | 2 +-
include/linux/psi_types.h | 3 --
kernel/cgroup/cgroup.c | 11 ++++--
kernel/sched/psi.c | 66 ++++++++++++++------------------
5 files changed, 40 insertions(+), 45 deletions(-)

diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst
index f2b3439edcc2..860fe651d645 100644
--- a/Documentation/accounting/psi.rst
+++ b/Documentation/accounting/psi.rst
@@ -92,7 +92,8 @@ Triggers can be set on more than one psi metric and more than one trigger
for the same psi metric can be specified. However for each trigger a separate
file descriptor is required to be able to poll it separately from others,
therefore for each trigger a separate open() syscall should be made even
-when opening the same psi interface file.
+when opening the same psi interface file. Write operations to a file descriptor
+with an already existing psi trigger will fail with EBUSY.

Monitors activate only when system enters stall state for the monitored
psi metric and deactivates upon exit from the stall state. While system is
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 65eb1476ac70..74f7148dfb9f 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -32,7 +32,7 @@ void cgroup_move_task(struct task_struct *p, struct css_set *to);

struct psi_trigger *psi_trigger_create(struct psi_group *group,
char *buf, size_t nbytes, enum psi_res res);
-void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *t);
+void psi_trigger_destroy(struct psi_trigger *t);

__poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
poll_table *wait);
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 0a23300d49af..6537d0c92825 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -129,9 +129,6 @@ struct psi_trigger {
* events to one per window
*/
u64 last_event_time;
-
- /* Refcounting to prevent premature destruction */
- struct kref refcount;
};

struct psi_group {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index cafb8c114a21..93b51a2104f7 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3642,6 +3642,12 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
cgroup_get(cgrp);
cgroup_kn_unlock(of->kn);

+ /* Allow only one trigger per file descriptor */
+ if (ctx->psi.trigger) {
+ cgroup_put(cgrp);
+ return -EBUSY;
+ }
+
psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
new = psi_trigger_create(psi, buf, nbytes, res);
if (IS_ERR(new)) {
@@ -3649,8 +3655,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
return PTR_ERR(new);
}

- psi_trigger_replace(&ctx->psi.trigger, new);
-
+ ctx->psi.trigger = new;
cgroup_put(cgrp);

return nbytes;
@@ -3689,7 +3694,7 @@ static void cgroup_pressure_release(struct kernfs_open_file *of)
{
struct cgroup_file_ctx *ctx = of->priv;

- psi_trigger_replace(&ctx->psi.trigger, NULL);
+ psi_trigger_destroy(ctx->psi.trigger);
}

bool cgroup_psi_enabled(void)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 1652f2bb54b7..1fe4b8d3ae98 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1151,7 +1151,6 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
t->event = 0;
t->last_event_time = 0;
init_waitqueue_head(&t->event_wait);
- kref_init(&t->refcount);

mutex_lock(&group->trigger_lock);

@@ -1180,15 +1179,19 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
return t;
}

-static void psi_trigger_destroy(struct kref *ref)
+void psi_trigger_destroy(struct psi_trigger *t)
{
- struct psi_trigger *t = container_of(ref, struct psi_trigger, refcount);
- struct psi_group *group = t->group;
+ struct psi_group *group;
struct task_struct *task_to_destroy = NULL;

- if (static_branch_likely(&psi_disabled))
+ /*
+ * We do not check psi_disabled since it might have been disabled after
+ * the trigger got created.
+ */
+ if (!t)
return;

+ group = t->group;
/*
* Wakeup waiters to stop polling. Can happen if cgroup is deleted
* from under a polling process.
@@ -1224,9 +1227,9 @@ static void psi_trigger_destroy(struct kref *ref)
mutex_unlock(&group->trigger_lock);

/*
- * Wait for both *trigger_ptr from psi_trigger_replace and
- * poll_task RCUs to complete their read-side critical sections
- * before destroying the trigger and optionally the poll_task
+ * Wait for psi_schedule_poll_work RCU to complete its read-side
+ * critical section before destroying the trigger and optionally the
+ * poll_task.
*/
synchronize_rcu();
/*
@@ -1243,18 +1246,6 @@ static void psi_trigger_destroy(struct kref *ref)
kfree(t);
}

-void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *new)
-{
- struct psi_trigger *old = *trigger_ptr;
-
- if (static_branch_likely(&psi_disabled))
- return;
-
- rcu_assign_pointer(*trigger_ptr, new);
- if (old)
- kref_put(&old->refcount, psi_trigger_destroy);
-}
-
__poll_t psi_trigger_poll(void **trigger_ptr,
struct file *file, poll_table *wait)
{
@@ -1264,24 +1255,15 @@ __poll_t psi_trigger_poll(void **trigger_ptr,
if (static_branch_likely(&psi_disabled))
return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;

- rcu_read_lock();
-
- t = rcu_dereference(*(void __rcu __force **)trigger_ptr);
- if (!t) {
- rcu_read_unlock();
+ t = READ_ONCE(*trigger_ptr);
+ if (!t)
return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
- }
- kref_get(&t->refcount);
-
- rcu_read_unlock();

poll_wait(file, &t->event_wait, wait);

if (cmpxchg(&t->event, 1, 0) == 1)
ret |= EPOLLPRI;

- kref_put(&t->refcount, psi_trigger_destroy);
-
return ret;
}

@@ -1305,14 +1287,24 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,

buf[buf_size - 1] = '\0';

- new = psi_trigger_create(&psi_system, buf, nbytes, res);
- if (IS_ERR(new))
- return PTR_ERR(new);
-
seq = file->private_data;
+
/* Take seq->lock to protect seq->private from concurrent writes */
mutex_lock(&seq->lock);
- psi_trigger_replace(&seq->private, new);
+
+ /* Allow only one trigger per file descriptor */
+ if (seq->private) {
+ mutex_unlock(&seq->lock);
+ return -EBUSY;
+ }
+
+ new = psi_trigger_create(&psi_system, buf, nbytes, res);
+ if (IS_ERR(new)) {
+ mutex_unlock(&seq->lock);
+ return PTR_ERR(new);
+ }
+
+ seq->private = new;
mutex_unlock(&seq->lock);

return nbytes;
@@ -1347,7 +1339,7 @@ static int psi_fop_release(struct inode *inode, struct file *file)
{
struct seq_file *seq = file->private_data;

- psi_trigger_replace(&seq->private, NULL);
+ psi_trigger_destroy(seq->private);
return single_release(inode, file);
}

--
2.34.1.575.g55b058a8bb-goog



2022-01-11 18:48:50

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

On Mon, Jan 10, 2022 at 11:12:12PM -0800, Suren Baghdasaryan wrote:
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index cafb8c114a21..93b51a2104f7 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3642,6 +3642,12 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
> cgroup_get(cgrp);
> cgroup_kn_unlock(of->kn);
>
> + /* Allow only one trigger per file descriptor */
> + if (ctx->psi.trigger) {
> + cgroup_put(cgrp);
> + return -EBUSY;
> + }
> +
> psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
> new = psi_trigger_create(psi, buf, nbytes, res);
> if (IS_ERR(new)) {
> @@ -3649,8 +3655,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
> return PTR_ERR(new);
> }
>
> - psi_trigger_replace(&ctx->psi.trigger, new);
> -
> + ctx->psi.trigger = new;
> cgroup_put(cgrp);

The write here needs to use smp_store_release(), since it is paired with the
concurrent READ_ONCE() in psi_trigger_poll().

> @@ -1305,14 +1287,24 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
>
> buf[buf_size - 1] = '\0';
>
> - new = psi_trigger_create(&psi_system, buf, nbytes, res);
> - if (IS_ERR(new))
> - return PTR_ERR(new);
> -
> seq = file->private_data;
> +
> /* Take seq->lock to protect seq->private from concurrent writes */
> mutex_lock(&seq->lock);
> - psi_trigger_replace(&seq->private, new);
> +
> + /* Allow only one trigger per file descriptor */
> + if (seq->private) {
> + mutex_unlock(&seq->lock);
> + return -EBUSY;
> + }
> +
> + new = psi_trigger_create(&psi_system, buf, nbytes, res);
> + if (IS_ERR(new)) {
> + mutex_unlock(&seq->lock);
> + return PTR_ERR(new);
> + }
> +
> + seq->private = new;

Likewise here.

- Eric

2022-01-11 19:20:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

On Tue, Jan 11, 2022 at 10:48 AM Eric Biggers <[email protected]> wrote:
>
> The write here needs to use smp_store_release(), since it is paired with the
> concurrent READ_ONCE() in psi_trigger_poll().

A smp_store_release() doesn't make sense pairing with a READ_ONCE().

Any memory ordering that the smp_store_release() does on the writing
side is entirely irrelevant, since the READ_ONCE() doesn't imply any
ordering on the reading side. Ordering one but not the other is
nonsensical.

So the proper pattern is to use a WRITE_ONCE() to pair with a
READ_ONCE() (when you don't care about memory ordering, or you handle
it explicitly), or a smp_load_acquire() with a smp_store_release() (in
which case writes before the smp_store_release() on the writing side
will be ordered wrt accesses after smp_load_acquire() on the reading
side).

Of course, in practice, for pointers, the whole "dereference off a
pointer" on the read side *does* imply a barrier in all relevant
situations. So yes, a smp_store_release() -> READ_ONCE() does work in
practice, although it's technically wrong (in particular, it's wrong
on alpha, because of the completely broken memory ordering that alpha
has that doesn't even honor data dependencies as read-side orderings)

But in this case, I do think that since there's some setup involved
with the trigger pointer, the proper serialization is to use
smp_store_release() to set the pointer, and then smp_load_acquire() on
the reading side.

Or just use the RCU primitives - they are even better optimized, and
handle exactly that case, and can be more efficient on some
architectures if release->acquire isn't already cheap.

That said, we've pretty much always accepted that normal word writes
are not going to tear, so we *have* also accepted just

- do any normal store of a value on the write side

- do a READ_ONCE() on the reading side

where the reading side doesn't actually care *what* value it gets, it
only cares that the value it gets is *stable* (ie no compiler reloads
that might show up as two different values on the reading side).

Of course, that has the same issue as WRITE_ONCE/READ_ONCE - you need
to worry about memory ordering separately.

> > + seq->private = new;
>
> Likewise here.

Yeah, same deal, except here you can't even use the RCU ones, because
'seq->private' isn't annotated for RCU.

Or you'd do the casting, of course.

Linus

2022-01-11 19:27:13

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

On Tue, Jan 11, 2022 at 11:11 AM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Jan 11, 2022 at 10:48 AM Eric Biggers <[email protected]> wrote:
> >
> > The write here needs to use smp_store_release(), since it is paired with the
> > concurrent READ_ONCE() in psi_trigger_poll().
>
> A smp_store_release() doesn't make sense pairing with a READ_ONCE().
>
> Any memory ordering that the smp_store_release() does on the writing
> side is entirely irrelevant, since the READ_ONCE() doesn't imply any
> ordering on the reading side. Ordering one but not the other is
> nonsensical.
>
> So the proper pattern is to use a WRITE_ONCE() to pair with a
> READ_ONCE() (when you don't care about memory ordering, or you handle
> it explicitly), or a smp_load_acquire() with a smp_store_release() (in
> which case writes before the smp_store_release() on the writing side
> will be ordered wrt accesses after smp_load_acquire() on the reading
> side).
>
> Of course, in practice, for pointers, the whole "dereference off a
> pointer" on the read side *does* imply a barrier in all relevant
> situations. So yes, a smp_store_release() -> READ_ONCE() does work in
> practice, although it's technically wrong (in particular, it's wrong
> on alpha, because of the completely broken memory ordering that alpha
> has that doesn't even honor data dependencies as read-side orderings)
>
> But in this case, I do think that since there's some setup involved
> with the trigger pointer, the proper serialization is to use
> smp_store_release() to set the pointer, and then smp_load_acquire() on
> the reading side.
>
> Or just use the RCU primitives - they are even better optimized, and
> handle exactly that case, and can be more efficient on some
> architectures if release->acquire isn't already cheap.
>
> That said, we've pretty much always accepted that normal word writes
> are not going to tear, so we *have* also accepted just
>
> - do any normal store of a value on the write side
>
> - do a READ_ONCE() on the reading side
>
> where the reading side doesn't actually care *what* value it gets, it
> only cares that the value it gets is *stable* (ie no compiler reloads
> that might show up as two different values on the reading side).
>
> Of course, that has the same issue as WRITE_ONCE/READ_ONCE - you need
> to worry about memory ordering separately.
>
> > > + seq->private = new;
> >
> > Likewise here.
>
> Yeah, same deal, except here you can't even use the RCU ones, because
> 'seq->private' isn't annotated for RCU.
>
> Or you'd do the casting, of course.

Thanks for the explanation!
So, it sounds like the best (semantically correct) option I have here
is smp_store_release() to set the pointer, and then smp_load_acquire()
to read it. Is my understanding correct?

>
> Linus

2022-01-11 19:41:36

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

On Tue, Jan 11, 2022 at 11:11:32AM -0800, Linus Torvalds wrote:
> On Tue, Jan 11, 2022 at 10:48 AM Eric Biggers <[email protected]> wrote:
> >
> > The write here needs to use smp_store_release(), since it is paired with the
> > concurrent READ_ONCE() in psi_trigger_poll().
>
> A smp_store_release() doesn't make sense pairing with a READ_ONCE().
>
> Any memory ordering that the smp_store_release() does on the writing
> side is entirely irrelevant, since the READ_ONCE() doesn't imply any
> ordering on the reading side. Ordering one but not the other is
> nonsensical.
>
> So the proper pattern is to use a WRITE_ONCE() to pair with a
> READ_ONCE() (when you don't care about memory ordering, or you handle
> it explicitly), or a smp_load_acquire() with a smp_store_release() (in
> which case writes before the smp_store_release() on the writing side
> will be ordered wrt accesses after smp_load_acquire() on the reading
> side).
>
> Of course, in practice, for pointers, the whole "dereference off a
> pointer" on the read side *does* imply a barrier in all relevant
> situations. So yes, a smp_store_release() -> READ_ONCE() does work in
> practice, although it's technically wrong (in particular, it's wrong
> on alpha, because of the completely broken memory ordering that alpha
> has that doesn't even honor data dependencies as read-side orderings)
>
> But in this case, I do think that since there's some setup involved
> with the trigger pointer, the proper serialization is to use
> smp_store_release() to set the pointer, and then smp_load_acquire() on
> the reading side.
>
> Or just use the RCU primitives - they are even better optimized, and
> handle exactly that case, and can be more efficient on some
> architectures if release->acquire isn't already cheap.
>
> That said, we've pretty much always accepted that normal word writes
> are not going to tear, so we *have* also accepted just
>
> - do any normal store of a value on the write side
>
> - do a READ_ONCE() on the reading side
>
> where the reading side doesn't actually care *what* value it gets, it
> only cares that the value it gets is *stable* (ie no compiler reloads
> that might show up as two different values on the reading side).
>
> Of course, that has the same issue as WRITE_ONCE/READ_ONCE - you need
> to worry about memory ordering separately.
>
> > > + seq->private = new;
> >
> > Likewise here.
>
> Yeah, same deal, except here you can't even use the RCU ones, because
> 'seq->private' isn't annotated for RCU.
>
> Or you'd do the casting, of course.
>

This is yet another case of "one time init". There have been long discussions
on this topic before:
* https://lore.kernel.org/linux-fsdevel/[email protected]/T/#u
* https://lore.kernel.org/lkml/[email protected]/T/#u
* https://lwn.net/Articles/827180/

I even attempted to document the best practices:
* https://lore.kernel.org/linux-fsdevel/[email protected]/T/#u

However, no one could agree on whether READ_ONCE() or smp_load_acquire() should
be used. smp_load_acquire() is always correct, so it remains my preference.
However, READ_ONCE() is correct in some cases, and some people (including the
primary LKMM maintainer) insist that it be used in all such cases, as well as in
rcu_dereference() even though this places difficult-to-understand constraints on
how rcu_dereference() can be used.

My preference is that smp_load_acquire() be used. But be aware that this risks
the READ_ONCE() people coming out of the woodwork and arguing for READ_ONCE().

- Eric

2022-01-11 19:43:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

On Tue, Jan 11, 2022 at 11:27 AM Suren Baghdasaryan <[email protected]> wrote:
>
> Thanks for the explanation!
> So, it sounds like the best (semantically correct) option I have here
> is smp_store_release() to set the pointer, and then smp_load_acquire()
> to read it. Is my understanding correct?

Yeah, that's the clearest one from a memory ordering standpoint, and
is generally also cheap.

Linus

2022-01-11 19:46:05

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

On Tue, Jan 11, 2022 at 11:41 AM Eric Biggers <[email protected]> wrote:
>
> On Tue, Jan 11, 2022 at 11:11:32AM -0800, Linus Torvalds wrote:
> > On Tue, Jan 11, 2022 at 10:48 AM Eric Biggers <[email protected]> wrote:
> > >
> > > The write here needs to use smp_store_release(), since it is paired with the
> > > concurrent READ_ONCE() in psi_trigger_poll().
> >
> > A smp_store_release() doesn't make sense pairing with a READ_ONCE().
> >
> > Any memory ordering that the smp_store_release() does on the writing
> > side is entirely irrelevant, since the READ_ONCE() doesn't imply any
> > ordering on the reading side. Ordering one but not the other is
> > nonsensical.
> >
> > So the proper pattern is to use a WRITE_ONCE() to pair with a
> > READ_ONCE() (when you don't care about memory ordering, or you handle
> > it explicitly), or a smp_load_acquire() with a smp_store_release() (in
> > which case writes before the smp_store_release() on the writing side
> > will be ordered wrt accesses after smp_load_acquire() on the reading
> > side).
> >
> > Of course, in practice, for pointers, the whole "dereference off a
> > pointer" on the read side *does* imply a barrier in all relevant
> > situations. So yes, a smp_store_release() -> READ_ONCE() does work in
> > practice, although it's technically wrong (in particular, it's wrong
> > on alpha, because of the completely broken memory ordering that alpha
> > has that doesn't even honor data dependencies as read-side orderings)
> >
> > But in this case, I do think that since there's some setup involved
> > with the trigger pointer, the proper serialization is to use
> > smp_store_release() to set the pointer, and then smp_load_acquire() on
> > the reading side.
> >
> > Or just use the RCU primitives - they are even better optimized, and
> > handle exactly that case, and can be more efficient on some
> > architectures if release->acquire isn't already cheap.
> >
> > That said, we've pretty much always accepted that normal word writes
> > are not going to tear, so we *have* also accepted just
> >
> > - do any normal store of a value on the write side
> >
> > - do a READ_ONCE() on the reading side
> >
> > where the reading side doesn't actually care *what* value it gets, it
> > only cares that the value it gets is *stable* (ie no compiler reloads
> > that might show up as two different values on the reading side).
> >
> > Of course, that has the same issue as WRITE_ONCE/READ_ONCE - you need
> > to worry about memory ordering separately.
> >
> > > > + seq->private = new;
> > >
> > > Likewise here.
> >
> > Yeah, same deal, except here you can't even use the RCU ones, because
> > 'seq->private' isn't annotated for RCU.
> >
> > Or you'd do the casting, of course.
> >
>
> This is yet another case of "one time init". There have been long discussions
> on this topic before:
> * https://lore.kernel.org/linux-fsdevel/[email protected]/T/#u
> * https://lore.kernel.org/lkml/[email protected]/T/#u
> * https://lwn.net/Articles/827180/
>
> I even attempted to document the best practices:
> * https://lore.kernel.org/linux-fsdevel/[email protected]/T/#u
>
> However, no one could agree on whether READ_ONCE() or smp_load_acquire() should
> be used. smp_load_acquire() is always correct, so it remains my preference.
> However, READ_ONCE() is correct in some cases, and some people (including the
> primary LKMM maintainer) insist that it be used in all such cases, as well as in
> rcu_dereference() even though this places difficult-to-understand constraints on
> how rcu_dereference() can be used.
>
> My preference is that smp_load_acquire() be used. But be aware that this risks
> the READ_ONCE() people coming out of the woodwork and arguing for READ_ONCE().

I like my chances here (I believe we do need memory ordering in this
case). I'll post a fix with smp_load_acquire/smp_store_release shortly
after I run my tests. Thanks for the guidance!

>
> - Eric

2022-01-11 20:15:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

On Tue, Jan 11, 2022 at 11:41 AM Eric Biggers <[email protected]> wrote:
>
> This is yet another case of "one time init".

Ehh. It's somewhat debatable.

For a flag that sets a value once, the rules are somewhat different.
In that case, people may simply not care about memory ordering at all,
because all they care about is the actual flag value, and - thanks to
the one-time behavior - basically whether some transition had happened
or not. That's not all that unusual.

But when you fetch a pointer, things are at least conceptually
slightly different.

Of course, you may use the existence of the pointer itself as a flag
(ie just a "NULL or not"), in which case it's the same as any other
one-time flag thing.

But if you use it to dereference something, then _by_definition_
you're not just fetching a one-time flag - even if the pointer is only
set once. At that point, at a minimum, you require that that thing has
been initialized.

Now, it's then absolutely true that the stuff behind the pointer may
then have other reasons not to care about memory ordering again, and
you may be able to avoid memory ordering even then. If you're just
switching the pointer around between different objects that has been
statically allocated and initialized, then there is no memory ordering
required, for example. You might be back to the "I just want one or
the other of these two pointers".

But if you have something that was initialized before the pointer was
assigned, you really do hit the problem we had on alpha, where even if
you order the pointer write side accesses, the dereferencing of the
pointer may not be ordered on the read side.

Now, alpha is basically dead, and we probably don't really care. Even
on alpha, the whole "data dependency isn't a memory ordering" is
almost impossible to trigger.

And in fact, to avoid too much pain we ended up saying "screw alpha"
and added a memory barrier to READ_ONCE(), so it turns out that
smp_store_release -> READ_ONCE() does work because we just couldn't be
bothered to try something more proper.

So yeah, READ_ONCE() ends up making the "access through a pointer"
thing safe, but that's less of a "it should be safe" and more of a "we
can't waste time dealing with braindamage on platforms that don't
matter".

In general, I think the rule should be that READ_ONCE() is for things
that simply don't care about memory ordering at all (or do whatever
ordering they want explicitly). And yes, one such very common case is
the "one-way flag" where once a certain state has been reached, it's
idempotent.

Of course, then we have the fact that READ_ONCE() can be more
efficient than "smp_load_acquire()" on some platforms, so if something
is *hugely* performance-critical, you might use READ_ONCE() even if
it's not really technically the right thing.

So it's complicated.

A lot of READ_ONCE() users exist just for historical reasons because
they predated smp_store_release/smp_load_acquire. They may well have
been using ACCESS_ONCE() long ago.

And some are there because it's a very critical piece of code, and
it's very intentional.

But if you don't have some huge reasons, I really would prefer people
use "smp_store_release -> smp_load_acquire" as a very clear "handoff"
event.

Linus

2022-01-11 23:37:53

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

On Tue, Jan 11, 2022 at 12:15 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Jan 11, 2022 at 11:41 AM Eric Biggers <[email protected]> wrote:
> >
> > This is yet another case of "one time init".
>
> Ehh. It's somewhat debatable.
>
> For a flag that sets a value once, the rules are somewhat different.
> In that case, people may simply not care about memory ordering at all,
> because all they care about is the actual flag value, and - thanks to
> the one-time behavior - basically whether some transition had happened
> or not. That's not all that unusual.
>
> But when you fetch a pointer, things are at least conceptually
> slightly different.
>
> Of course, you may use the existence of the pointer itself as a flag
> (ie just a "NULL or not"), in which case it's the same as any other
> one-time flag thing.
>
> But if you use it to dereference something, then _by_definition_
> you're not just fetching a one-time flag - even if the pointer is only
> set once. At that point, at a minimum, you require that that thing has
> been initialized.
>
> Now, it's then absolutely true that the stuff behind the pointer may
> then have other reasons not to care about memory ordering again, and
> you may be able to avoid memory ordering even then. If you're just
> switching the pointer around between different objects that has been
> statically allocated and initialized, then there is no memory ordering
> required, for example. You might be back to the "I just want one or
> the other of these two pointers".
>
> But if you have something that was initialized before the pointer was
> assigned, you really do hit the problem we had on alpha, where even if
> you order the pointer write side accesses, the dereferencing of the
> pointer may not be ordered on the read side.
>
> Now, alpha is basically dead, and we probably don't really care. Even
> on alpha, the whole "data dependency isn't a memory ordering" is
> almost impossible to trigger.
>
> And in fact, to avoid too much pain we ended up saying "screw alpha"
> and added a memory barrier to READ_ONCE(), so it turns out that
> smp_store_release -> READ_ONCE() does work because we just couldn't be
> bothered to try something more proper.
>
> So yeah, READ_ONCE() ends up making the "access through a pointer"
> thing safe, but that's less of a "it should be safe" and more of a "we
> can't waste time dealing with braindamage on platforms that don't
> matter".
>
> In general, I think the rule should be that READ_ONCE() is for things
> that simply don't care about memory ordering at all (or do whatever
> ordering they want explicitly). And yes, one such very common case is
> the "one-way flag" where once a certain state has been reached, it's
> idempotent.
>
> Of course, then we have the fact that READ_ONCE() can be more
> efficient than "smp_load_acquire()" on some platforms, so if something
> is *hugely* performance-critical, you might use READ_ONCE() even if
> it's not really technically the right thing.
>
> So it's complicated.
>
> A lot of READ_ONCE() users exist just for historical reasons because
> they predated smp_store_release/smp_load_acquire. They may well have
> been using ACCESS_ONCE() long ago.
>
> And some are there because it's a very critical piece of code, and
> it's very intentional.
>
> But if you don't have some huge reasons, I really would prefer people
> use "smp_store_release -> smp_load_acquire" as a very clear "handoff"
> event.

Posted v3 with smp_store_release/smp_load_acquire:
https://lore.kernel.org/all/[email protected]
Thanks!

>
> Linus

2022-01-12 11:06:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

On Tue, Jan 11, 2022 at 11:11:32AM -0800, Linus Torvalds wrote:

> Of course, in practice, for pointers, the whole "dereference off a
> pointer" on the read side *does* imply a barrier in all relevant
> situations. So yes, a smp_store_release() -> READ_ONCE() does work in
> practice, although it's technically wrong (in particular, it's wrong
> on alpha, because of the completely broken memory ordering that alpha
> has that doesn't even honor data dependencies as read-side orderings)

On a tangent, that actually works, even on Alpha, see commit
d646285885154 ("alpha: Override READ_ONCE() with barriered
implementation").

2022-01-12 14:37:16

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

On Mon, Jan 10, 2022 at 11:12:12PM -0800, Suren Baghdasaryan wrote:
> With write operation on psi files replacing old trigger with a new one,
> the lifetime of its waitqueue is totally arbitrary. Overwriting an
> existing trigger causes its waitqueue to be freed and pending poll()
> will stumble on trigger->event_wait which was destroyed.
> Fix this by disallowing to redefine an existing psi trigger. If a write
> operation is used on a file descriptor with an already existing psi
> trigger, the operation will fail with EBUSY error.
> Also bypass a check for psi_disabled in the psi_trigger_destroy as the
> flag can be flipped after the trigger is created, leading to a memory
> leak.
>
> Fixes: 0e94682b73bf ("psi: introduce psi monitor")
> Cc: [email protected]
> Reported-by: [email protected]
> Analyzed-by: Eric Biggers <[email protected]>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>

That looks good to me, thanks Suren.

Acked-by: Johannes Weiner <[email protected]>

Peter, would you mind picking this up and routing this to Linus
through the sched tree, please?

2022-01-19 20:11:14

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled

Linus Torvalds <[email protected]> wrote:
>
> Of course, in practice, for pointers, the whole "dereference off a
> pointer" on the read side *does* imply a barrier in all relevant
> situations. So yes, a smp_store_release() -> READ_ONCE() does work in
> practice, although it's technically wrong (in particular, it's wrong
> on alpha, because of the completely broken memory ordering that alpha
> has that doesn't even honor data dependencies as read-side orderings)

READ_ONCE has contained the alpha barrier since 2017:

commit 76ebbe78f7390aee075a7f3768af197ded1bdfbb
Author: Will Deacon <[email protected]>
Date: Tue Oct 24 11:22:47 2017 +0100

locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt