do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call
pipelined_send.
This leads to a very hard to trigger race where a do_mq_timedreceive call
might return and leave do_mq_timedsend to rely on an invalid address,
causing the following crash:
[ 240.739977] RIP: 0010:wake_q_add_safe+0x13/0x60
[ 240.739991] Call Trace:
[ 240.739999] __x64_sys_mq_timedsend+0x2a9/0x490
[ 240.740003] ? auditd_test_task+0x38/0x40
[ 240.740007] ? auditd_test_task+0x38/0x40
[ 240.740011] do_syscall_64+0x80/0x680
[ 240.740017] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 240.740019] RIP: 0033:0x7f5928e40343
The race occurs as:
1. do_mq_timedreceive calls wq_sleep with the address of
`struct ext_wait_queue` on function stack (aliased as `ewq_addr` here)
- it holds a valid `struct ext_wait_queue *` as long as the stack has
not been overwritten.
2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.
3. Sender calls __pipelined_op::smp_store_release(&this->state, STATE_READY).
Here is where the race window begins. (`this` is `ewq_addr`.)
4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break. `ewq_addr` gets removed from
info->e_wait_q[RECV].list.
5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)
6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass
to the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.
do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing
`this` which sits on the receiver's stack.
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Signed-off-by: Varad Gautam <[email protected]>
Reported-by: Matthias von Faber <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Manfred Spraul <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
---
ipc/mqueue.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 8031464ed4ae2..8f78057c6be53 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -1004,12 +1004,14 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
struct mqueue_inode_info *info,
struct ext_wait_queue *this)
{
+ struct task_struct *t;
+
list_del(&this->list);
- get_task_struct(this->task);
+ t = get_task_struct(this->task);
/* see MQ_BARRIER for purpose/pairing */
smp_store_release(&this->state, STATE_READY);
- wake_q_add_safe(wake_q, this->task);
+ wake_q_add_safe(wake_q, t);
}
/* pipelined_send() - send a message directly to the task waiting in
--
2.30.2
On 2021-05-04 08:55, Varad Gautam wrote:
> do_mq_timedreceive calls wq_sleep with a stack local address. The
> sender (do_mq_timedsend) uses this address to later call
> pipelined_send.
>
> This leads to a very hard to trigger race where a do_mq_timedreceive
> call
> might return and leave do_mq_timedsend to rely on an invalid address,
> causing the following crash:
>
> [ 240.739977] RIP: 0010:wake_q_add_safe+0x13/0x60
> [ 240.739991] Call Trace:
> [ 240.739999] __x64_sys_mq_timedsend+0x2a9/0x490
> [ 240.740003] ? auditd_test_task+0x38/0x40
> [ 240.740007] ? auditd_test_task+0x38/0x40
> [ 240.740011] do_syscall_64+0x80/0x680
> [ 240.740017] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 240.740019] RIP: 0033:0x7f5928e40343
>
> The race occurs as:
>
> 1. do_mq_timedreceive calls wq_sleep with the address of
> `struct ext_wait_queue` on function stack (aliased as `ewq_addr` here)
> - it holds a valid `struct ext_wait_queue *` as long as the stack has
> not been overwritten.
>
> 2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
> do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
> __pipelined_op.
>
> 3. Sender calls __pipelined_op::smp_store_release(&this->state,
> STATE_READY).
> Here is where the race window begins. (`this` is `ewq_addr`.)
>
> 4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
> will see `state == STATE_READY` and break. `ewq_addr` gets removed from
> info->e_wait_q[RECV].list.
>
> 5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
> to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
> stack. (Although the address may not get overwritten until another
> function happens to touch it, which means it can persist around for an
> indefinite time.)
>
> 6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
> `struct ext_wait_queue *`, and uses it to find a task_struct to pass
> to the wake_q_add_safe call. In the lucky case where nothing has
> overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
> In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
> bogus address as the receiver's task_struct causing the crash.
>
> do_mq_timedsend::__pipelined_op() should not dereference `this` after
> setting STATE_READY, as the receiver counterpart is now free to return.
> Change __pipelined_op to call wake_q_add_safe on the receiver's
> task_struct returned by get_task_struct, instead of dereferencing
> `this` which sits on the receiver's stack.
>
> Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Right, historically we've always ensured that the waker does the state
ready
write as the last operation, with or without lockless wake_qs. And this
commit
broke this:
@@ -923,16 +988,11 @@ static inline void __pipelined_op(struct
wake_q_head *wake_q,
struct ext_wait_queue *this)
{
list_del(&this->list);
- wake_q_add(wake_q, this->task);
- /*
- * Rely on the implicit cmpxchg barrier from wake_q_add such
- * that we can ensure that updating receiver->state is the last
- * write operation: As once set, the receiver can continue,
- * and if we don't have the reference count from the wake_q,
- * yet, at that point we can later have a use-after-free
- * condition and bogus wakeup.
- */
- this->state = STATE_READY;
+ get_task_struct(this->task);
+
+ /* see MQ_BARRIER for purpose/pairing */
+ smp_store_release(&this->state, STATE_READY);
+ wake_q_add_safe(wake_q, this->task);
}
.. so while addressing the race against get_task_struct() vs wakee exit
we ended
up breaking the case where the task returns before the task is added to
the wake_q
(which actually we explicitly re-orded :). So at this point we know that
the
->state = STATE_READY must be done after the whole of the wake_q
addition operation.
Instead, how about the following which closes the race altogether and
simplifies the
code. This basically goes back to a correct version of fa6004ad4528
(ipc/mqueue: Implement lockless pipelined wakeups). And by correct I
mean keeping the
smp_store_release() which ensures the proper wakeup semantics.
Thanks,
Davidlohr
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 8031464ed4ae..43f0ae61c40b 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -76,14 +76,15 @@ struct posix_msg_tree_node {
* acquiring info->lock.
*
* MQ_BARRIER:
- * To achieve proper release/acquire memory barrier pairing, the state
is set to
- * STATE_READY with smp_store_release(), and it is read with READ_ONCE
followed
- * by smp_acquire__after_ctrl_dep(). In addition, wake_q_add_safe() is
used.
+ * To achieve proper release/acquire memory barrier pairing, the state
is
+ * set to STATE_READY with smp_store_release() such that it is the last
write
+ * operation, in which afterwards the blocked task can immediately
return and
+ * exit. It is read with READ_ONCE followed by
smp_acquire__after_ctrl_dep().
*
* This prevents the following races:
*
- * 1) With the simple wake_q_add(), the task could be gone already
before
- * the increase of the reference happens
+ * 1) With the wake_q_add(), the task could be gone already before
+ * the increase of the reference happens:
* Thread A
* Thread B
* WRITE_ONCE(wait.state, STATE_NONE);
@@ -97,10 +98,27 @@ struct posix_msg_tree_node {
* sys_exit()
* get_task_struct() // UaF
*
- * Solution: Use wake_q_add_safe() and perform the get_task_struct()
before
+ *
+ * 2) With the wake_q_add(), the waiter task could have returned from
the
+ * syscall and overwritten it's task-allocated waiter before the
sender
+ * can be added to the wake_q:
+ * Thread A
+ * Thread B
+ * WRITE_ONCE(wait.state, STATE_NONE);
+ * schedule_hrtimeout()
+ * ->state = STATE_READY
+ * <timeout returns>
+ * if (wait.state == STATE_READY) return;
+ * sysret to user space
+ * overwrite receiver's stack
+ * wake_q_add(A)
+ * if (cmpxchg()) // corrupted waiter
+ *
+ * Solution: Use wake_q_add() and queue the task for wakeup before
* the smp_store_release() that does ->state = STATE_READY.
*
- * 2) Without proper _release/_acquire barriers, the woken up task
+ *
+ * 3) Without proper _release/_acquire barriers, the woken up task
* could read stale data
*
* Thread A
@@ -116,7 +134,7 @@ struct posix_msg_tree_node {
*
* Solution: use _release and _acquire barriers.
*
- * 3) There is intentionally no barrier when setting current->state
+ * 4) There is intentionally no barrier when setting current->state
* to TASK_INTERRUPTIBLE: spin_unlock(&info->lock) provides the
* release memory barrier, and the wakeup is triggered when holding
* info->lock, i.e. spin_lock(&info->lock) provided a pairing
@@ -1005,11 +1023,9 @@ static inline void __pipelined_op(struct
wake_q_head *wake_q,
struct ext_wait_queue *this)
{
list_del(&this->list);
- get_task_struct(this->task);
-
+ wake_q_add(wake_q, this->task);
/* see MQ_BARRIER for purpose/pairing */
smp_store_release(&this->state, STATE_READY);
- wake_q_add_safe(wake_q, this->task);
}
/* pipelined_send() - send a message directly to the task waiting in
Hi Davidlohr,
On 5/4/21 11:53 PM, Davidlohr Bueso wrote:
> On 2021-05-04 08:55, Varad Gautam wrote:
>> do_mq_timedreceive calls wq_sleep with a stack local address. The
>> sender (do_mq_timedsend) uses this address to later call
>> pipelined_send.
>>
>> This leads to a very hard to trigger race where a do_mq_timedreceive call
>> might return and leave do_mq_timedsend to rely on an invalid address,
>> causing the following crash:
>>
>> [ 240.739977] RIP: 0010:wake_q_add_safe+0x13/0x60
>> [ 240.739991] Call Trace:
>> [ 240.739999] __x64_sys_mq_timedsend+0x2a9/0x490
>> [ 240.740003] ? auditd_test_task+0x38/0x40
>> [ 240.740007] ? auditd_test_task+0x38/0x40
>> [ 240.740011] do_syscall_64+0x80/0x680
>> [ 240.740017] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [ 240.740019] RIP: 0033:0x7f5928e40343
>>
>> The race occurs as:
>>
>> 1. do_mq_timedreceive calls wq_sleep with the address of
>> `struct ext_wait_queue` on function stack (aliased as `ewq_addr` here)
>> - it holds a valid `struct ext_wait_queue *` as long as the stack has
>> not been overwritten.
>>
>> 2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
>> do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
>> __pipelined_op.
>>
>> 3. Sender calls __pipelined_op::smp_store_release(&this->state, STATE_READY).
>> Here is where the race window begins. (`this` is `ewq_addr`.)
>>
>> 4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
>> will see `state == STATE_READY` and break. `ewq_addr` gets removed from
>> info->e_wait_q[RECV].list.
>>
>> 5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
>> to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
>> stack. (Although the address may not get overwritten until another
>> function happens to touch it, which means it can persist around for an
>> indefinite time.)
>>
>> 6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
>> `struct ext_wait_queue *`, and uses it to find a task_struct to pass
>> to the wake_q_add_safe call. In the lucky case where nothing has
>> overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
>> In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
>> bogus address as the receiver's task_struct causing the crash.
>>
>> do_mq_timedsend::__pipelined_op() should not dereference `this` after
>> setting STATE_READY, as the receiver counterpart is now free to return.
>> Change __pipelined_op to call wake_q_add_safe on the receiver's
>> task_struct returned by get_task_struct, instead of dereferencing
>> `this` which sits on the receiver's stack.
>>
>> Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
>
> Right, historically we've always ensured that the waker does the state ready
> write as the last operation, with or without lockless wake_qs. And this commit
> broke this:
>
> @@ -923,16 +988,11 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
> struct ext_wait_queue *this)
> {
> list_del(&this->list);
> - wake_q_add(wake_q, this->task);
> - /*
> - * Rely on the implicit cmpxchg barrier from wake_q_add such
> - * that we can ensure that updating receiver->state is the last
> - * write operation: As once set, the receiver can continue,
> - * and if we don't have the reference count from the wake_q,
> - * yet, at that point we can later have a use-after-free
> - * condition and bogus wakeup.
> - */
> - this->state = STATE_READY;
> + get_task_struct(this->task);
> +
> + /* see MQ_BARRIER for purpose/pairing */
> + smp_store_release(&this->state, STATE_READY);
> + wake_q_add_safe(wake_q, this->task);
> }
>
> .. so while addressing the race against get_task_struct() vs wakee exit we ended
> up breaking the case where the task returns before the task is added to the wake_q
> (which actually we explicitly re-orded :). So at this point we know that the
> ->state = STATE_READY must be done after the whole of the wake_q addition operation.
>
The race here really is about the lifetime of __pipelined_op's `this` argument only
being guaranteed for some duration of the call (ie, until someone sets
->state = STATE_READY). It is not about when wake_q addition happens, as long as it is
being fed a valid task_struct.
Commit c5b2cbdbdac5 (ipc/mqueue.c: update/document memory barriers) aims at the right
spot wrt. reordering, except for relying on the `this` arg to find a task_struct for
wake_q addition.
> Instead, how about the following which closes the race altogether and simplifies the
> code. This basically goes back to a correct version of fa6004ad4528
> (ipc/mqueue: Implement lockless pipelined wakeups). And by correct I mean keeping the
> smp_store_release() which ensures the proper wakeup semantics.
>
I considered that initially, but given that the race isn't connected with wakeup, I
preferred the current approach which makes this fact explicit by showing what's
valid/invalid during __pipelined_op.
Thanks,
Varad
> Thanks,
> Davidlohr
>
> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> index 8031464ed4ae..43f0ae61c40b 100644
> --- a/ipc/mqueue.c
> +++ b/ipc/mqueue.c
> @@ -76,14 +76,15 @@ struct posix_msg_tree_node {
> * acquiring info->lock.
> *
> * MQ_BARRIER:
> - * To achieve proper release/acquire memory barrier pairing, the state is set to
> - * STATE_READY with smp_store_release(), and it is read with READ_ONCE followed
> - * by smp_acquire__after_ctrl_dep(). In addition, wake_q_add_safe() is used.
> + * To achieve proper release/acquire memory barrier pairing, the state is
> + * set to STATE_READY with smp_store_release() such that it is the last write
> + * operation, in which afterwards the blocked task can immediately return and
> + * exit. It is read with READ_ONCE followed by smp_acquire__after_ctrl_dep().
> *
> * This prevents the following races:
> *
> - * 1) With the simple wake_q_add(), the task could be gone already before
> - * the increase of the reference happens
> + * 1) With the wake_q_add(), the task could be gone already before
> + * the increase of the reference happens:
> * Thread A
> * Thread B
> * WRITE_ONCE(wait.state, STATE_NONE);
> @@ -97,10 +98,27 @@ struct posix_msg_tree_node {
> * sys_exit()
> * get_task_struct() // UaF
> *
> - * Solution: Use wake_q_add_safe() and perform the get_task_struct() before
> + *
> + * 2) With the wake_q_add(), the waiter task could have returned from the
> + * syscall and overwritten it's task-allocated waiter before the sender
> + * can be added to the wake_q:
> + * Thread A
> + * Thread B
> + * WRITE_ONCE(wait.state, STATE_NONE);
> + * schedule_hrtimeout()
> + * ->state = STATE_READY
> + * <timeout returns>
> + * if (wait.state == STATE_READY) return;
> + * sysret to user space
> + * overwrite receiver's stack
> + * wake_q_add(A)
> + * if (cmpxchg()) // corrupted waiter
> + *
> + * Solution: Use wake_q_add() and queue the task for wakeup before
> * the smp_store_release() that does ->state = STATE_READY.
> *
> - * 2) Without proper _release/_acquire barriers, the woken up task
> + *
> + * 3) Without proper _release/_acquire barriers, the woken up task
> * could read stale data
> *
> * Thread A
> @@ -116,7 +134,7 @@ struct posix_msg_tree_node {
> *
> * Solution: use _release and _acquire barriers.
> *
> - * 3) There is intentionally no barrier when setting current->state
> + * 4) There is intentionally no barrier when setting current->state
> * to TASK_INTERRUPTIBLE: spin_unlock(&info->lock) provides the
> * release memory barrier, and the wakeup is triggered when holding
> * info->lock, i.e. spin_lock(&info->lock) provided a pairing
> @@ -1005,11 +1023,9 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
> struct ext_wait_queue *this)
> {
> list_del(&this->list);
> - get_task_struct(this->task);
> -
> + wake_q_add(wake_q, this->task);
> /* see MQ_BARRIER for purpose/pairing */
> smp_store_release(&this->state, STATE_READY);
> - wake_q_add_safe(wake_q, this->task);
> }
>
> /* pipelined_send() - send a message directly to the task waiting in
>
--
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
HRB 36809, AG Nürnberg
Geschäftsführer: Felix Imendörffer
On 5/5/21 5:11 PM, Davidlohr Bueso wrote:
> On 2021-05-05 00:49, Varad Gautam wrote:
>> The race here really is about the lifetime of __pipelined_op's `this`
>> argument only
>> being guaranteed for some duration of the call (ie, until someone sets
>> ->state = STATE_READY). It is not about when wake_q addition happens,
>> as long as it is
>> being fed a valid task_struct.
>
> Again, it's all about ensuring that the READY_STATE is set last, the blocked
> task has no business returning in the first place, making both races (exit and
> the one reported here) similar by ending up using bogus memory.
>
> ...
>
>> I considered that initially, but given that the race isn't connected
>> with wakeup, I
>> preferred the current approach which makes this fact explicit by showing what's
>> valid/invalid during __pipelined_op.
>
> I understand your point, but this is why I updated the ordering comments. Furthermore
> there is no reason to decouple the task's refcount with the wake_q_add operation, it
> just makes the code weird and harder to follow.
>
Different tastes I guess - I'd avoid an additional case to account for in the
MQ_BARRIER comment block and have the __pipelined_op code describe itself,
which avoids some back and forth while reading.
If you're still unconvinced, I'll send out a v2 w/ wake_q_add called before
smp_store_release.
Thanks,
Varad
> Thanks,
> Davidlohr
>
--
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
HRB 36809, AG Nürnberg
Geschäftsführer: Felix Imendörffer
On 2021-05-05 09:24, Davidlohr Bueso wrote:
> On 2021-05-05 08:36, Varad Gautam wrote:
>> If you're still unconvinced, I'll send out a v2 w/ wake_q_add called
>> before
>> smp_store_release.
>
> Yeah, please send a v2, I believe this is the right way to fix the
> issue.
Also, it would be good to add:
Cc: <[email protected]> # 5.6
Thanks,
Davidlohr
On 2021-05-05 08:36, Varad Gautam wrote:
> If you're still unconvinced, I'll send out a v2 w/ wake_q_add called
> before
> smp_store_release.
Yeah, please send a v2, I believe this is the right way to fix the
issue.
Thanks,
Davidlohr
On 2021-05-05 00:49, Varad Gautam wrote:
> The race here really is about the lifetime of __pipelined_op's `this`
> argument only
> being guaranteed for some duration of the call (ie, until someone sets
> ->state = STATE_READY). It is not about when wake_q addition happens,
> as long as it is
> being fed a valid task_struct.
Again, it's all about ensuring that the READY_STATE is set last, the
blocked
task has no business returning in the first place, making both races
(exit and
the one reported here) similar by ending up using bogus memory.
...
> I considered that initially, but given that the race isn't connected
> with wakeup, I
> preferred the current approach which makes this fact explicit by
> showing what's
> valid/invalid during __pipelined_op.
I understand your point, but this is why I updated the ordering
comments. Furthermore
there is no reason to decouple the task's refcount with the wake_q_add
operation, it
just makes the code weird and harder to follow.
Thanks,
Davidlohr
Hi Varad,
On 5/4/21 5:55 PM, Varad Gautam wrote:
> do_mq_timedsend::__pipelined_op() should not dereference `this` after
> setting STATE_READY, as the receiver counterpart is now free to return.
> Change __pipelined_op to call wake_q_add_safe on the receiver's
> task_struct returned by get_task_struct, instead of dereferencing
> `this` which sits on the receiver's stack.
Correct. I was so concentrated on the risks of reordered memory that I
have overlooked the simple bug.
> Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Actually, sem.c and msg.c contain the same bug. Thus all three must be
fixed.
> Signed-off-by: Varad Gautam <[email protected]>
> Reported-by: Matthias von Faber <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Manfred Spraul <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
>
> ---
> ipc/mqueue.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> index 8031464ed4ae2..8f78057c6be53 100644
> --- a/ipc/mqueue.c
> +++ b/ipc/mqueue.c
> @@ -1004,12 +1004,14 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
> struct mqueue_inode_info *info,
> struct ext_wait_queue *this)
> {
> + struct task_struct *t;
> +
> list_del(&this->list);
> - get_task_struct(this->task);
> + t = get_task_struct(this->task);
>
> /* see MQ_BARRIER for purpose/pairing */
> smp_store_release(&this->state, STATE_READY);
> - wake_q_add_safe(wake_q, this->task);
> + wake_q_add_safe(wake_q, t);
> }
The change fixes the issue, but I would prefer to use t = this->task
instead of using the return value of get_task_struct():
Then all wake_q_add_safe() users are identical.
Ok for you?
Slightly tested patch attached.
--
Manfred
Hi Davidlohr,
On 5/4/21 11:53 PM, Davidlohr Bueso wrote:
> @@ -1005,11 +1023,9 @@ static inline void __pipelined_op(struct
> wake_q_head *wake_q,
> struct ext_wait_queue *this)
> {
> list_del(&this->list);
> - get_task_struct(this->task);
> -
> + wake_q_add(wake_q, this->task);
> /* see MQ_BARRIER for purpose/pairing */
> smp_store_release(&this->state, STATE_READY);
> - wake_q_add_safe(wake_q, this->task);
> }
>
> /* pipelined_send() - send a message directly to the task waiting in
This can result in lost wakeups:
wake_q_add() wakes up this->task, and the tasks reads this->state before
the smp_store_release() writes STATE_READY.
I would really prefer if we make kernel/futex.c and ipc/whatever as
similar as possible:
The futex slow path, ipc/sem.c, ipc/msg.c and ipc/mqueue.c are all the
same, thus the code should be the same as well.
[a task goes to sleep, another wakes it up, the woken up task doesn't
acquire any locks, and the wake_q framework is used]
--
Manfred
Hey Manfred,
On 5/8/21 7:12 PM, Manfred Spraul wrote:
> Hi Varad,
>
> On 5/4/21 5:55 PM, Varad Gautam wrote:
>> do_mq_timedsend::__pipelined_op() should not dereference `this` after
>> setting STATE_READY, as the receiver counterpart is now free to return.
>> Change __pipelined_op to call wake_q_add_safe on the receiver's
>> task_struct returned by get_task_struct, instead of dereferencing
>> `this` which sits on the receiver's stack.
> Correct. I was so concentrated on the risks of reordered memory that I have overlooked the simple bug.
>> Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
> Actually, sem.c and msg.c contain the same bug. Thus all three must be fixed.
You're right, it's the same usage pattern.
>> Signed-off-by: Varad Gautam <[email protected]>
>> Reported-by: Matthias von Faber <[email protected]>
>> Cc: Christian Brauner <[email protected]>
>> Cc: Oleg Nesterov <[email protected]>
>> Cc: "Eric W. Biederman" <[email protected]>
>> Cc: Manfred Spraul <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Davidlohr Bueso <[email protected]>
>>
>> ---
>> ipc/mqueue.c | 6 ++++--
>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
>> index 8031464ed4ae2..8f78057c6be53 100644
>> --- a/ipc/mqueue.c
>> +++ b/ipc/mqueue.c
>> @@ -1004,12 +1004,14 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
>> struct mqueue_inode_info *info,
>> struct ext_wait_queue *this)
>> {
>> + struct task_struct *t;
>> +
>> list_del(&this->list);
>> - get_task_struct(this->task);
>> + t = get_task_struct(this->task);
>> /* see MQ_BARRIER for purpose/pairing */
>> smp_store_release(&this->state, STATE_READY);
>> - wake_q_add_safe(wake_q, this->task);
>> + wake_q_add_safe(wake_q, t);
>> }
>
> The change fixes the issue, but I would prefer to use t = this->task instead of using the return value of get_task_struct():
> Then all wake_q_add_safe() users are identical.
>
> Ok for you?
>
> Slightly tested patch attached.
Thanks, I've sent out a v4 at [1] integrating sem.c/msg.c. Note that I went with
context-local naming and used what get_task_struct returns as I don't see much
difference either way.
[1] https://lore.kernel.org/lkml/[email protected]/
Thanks,
Varad
>
> --
>
> Manfred
>
--
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
HRB 36809, AG Nürnberg
Geschäftsführer: Felix Imendörffer