LinuxLists.cc - [PATCH] 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read

2020-06-11 01:59:25

Subject: [PATCH] 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work

p9_read_work and p9_fd_cancelled may be called concurrently.
Before list_del(&m->rreq->req_list) in p9_read_work is called,
the req->req_list may have been deleted in p9_fd_cancelled.
We can fix it by setting req->status to REQ_STATUS_FLSHD after
list_del(&req->req_list) in p9_fd_cancelled.

Before list_del(&req->req_list) in p9_fd_cancelled is called,
the req->req_list may have been deleted in p9_read_work.
We should return when req->status = REQ_STATUS_RCVD which means
we just received a response for oldreq, so we need do nothing
in p9_fd_cancelled.

Fixes: 60ff779c4abb ("9p: client: remove unused code and any reference to "cancelled" function")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Wang Hai <[email protected]>
---
net/9p/trans_fd.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index f868cf6fba79..a563699629cb 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -718,11 +718,18 @@ static int p9_fd_cancelled(struct p9_client *client, struct p9_req_t *req)
{
p9_debug(P9_DEBUG_TRANS, "client %p req %p\n", client, req);

- /* we haven't received a response for oldreq,
- * remove it from the list.
+ /* If req->status == REQ_STATUS_RCVD, it means we just received a
+ * response for oldreq, we need do nothing here. Else, remove it from
+ * the list.
*/
spin_lock(&client->lock);
+ if (req->status == REQ_STATUS_RCVD) {
+ spin_unlock(&client->lock);
+ return 0;
+ }
+
list_del(&req->req_list);
+ req->status = REQ_STATUS_FLSHD;
spin_unlock(&client->lock);
p9_req_put(req);

--
2.17.1

2020-06-11 14:55:35

by Dominique Martinet

[permalink] [raw]

Subject: Re: [PATCH] 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work

Wang Hai wrote on Thu, Jun 11, 2020:
> p9_read_work and p9_fd_cancelled may be called concurrently.

Good catch. I'm sure this fixes some of the old syzbot bugs...
I'll check other transports handle this properly as well.

> Before list_del(&m->rreq->req_list) in p9_read_work is called,
> the req->req_list may have been deleted in p9_fd_cancelled.
> We can fix it by setting req->status to REQ_STATUS_FLSHD after
> list_del(&req->req_list) in p9_fd_cancelled.

hm if you do that read_work will fail with EIO and all further 9p
messages will not be read?
p9_read_work probably should handle REQ_STATUS_FLSHD in a special case
that just throws the message away without error as well.

> Before list_del(&req->req_list) in p9_fd_cancelled is called,
> the req->req_list may have been deleted in p9_read_work.
> We should return when req->status = REQ_STATUS_RCVD which means
> we just received a response for oldreq, so we need do nothing
> in p9_fd_cancelled.

I'll need some time to convince myself the refcounting is correct in
this case.
Pre-ref counting this definitely was wrong, but now it might just work
by chance.... I'll double-check.

> Fixes: 60ff779c4abb ("9p: client: remove unused code and any reference
> to "cancelled" function")

I don't understand how this commit is related?
At least make it afd8d65411 ("9P: Add cancelled() to the transport
functions.") which adds the op, not something that removed a previous
version of cancelled even earlier.

> diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
> index f868cf6fba79..a563699629cb 100644
> --- a/net/9p/trans_fd.c
> +++ b/net/9p/trans_fd.c
> @@ -718,11 +718,18 @@ static int p9_fd_cancelled(struct p9_client *client, struct p9_req_t *req)
> {
> p9_debug(P9_DEBUG_TRANS, "client %p req %p\n", client, req);
>
> - /* we haven't received a response for oldreq,
> - * remove it from the list.
> + /* If req->status == REQ_STATUS_RCVD, it means we just received a
> + * response for oldreq, we need do nothing here. Else, remove it from
> + * the list.

(nitpick) this feels a bit hard to read, and does not give any
information: you're just paraphrasing the C code.

I would suggest moving the comment after the spinlock and say what we
really do ; something as simple as "ignore cancelled request if message
has been received before lock" is enough.

> */
> spin_lock(&client->lock);
> + if (req->status == REQ_STATUS_RCVD) {
> + spin_unlock(&client->lock);
> + return 0;
> + }
> +
> list_del(&req->req_list);
> + req->status = REQ_STATUS_FLSHD;
> spin_unlock(&client->lock);
> p9_req_put(req);
>
--
Dominique

2020-06-12 06:49:04

by Dominique Martinet

[permalink] [raw]

Subject: Re: [PATCH] 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work

wanghai (M) wrote on Fri, Jun 12, 2020:
> You are right, I got a syzkaller bug.
>
> "p9_read_work+0x7c3/0xd90" points to list_del(&m->rreq->req_list);
>
> [ 62.733598] kasan: CONFIG_KASAN_INLINE enabled
> [ 62.734484] kasan: GPF could be caused by NULL-ptr deref or user memory access
> [ 62.735670] general protection fault: 0000 [#1] SMP KASAN PTI
> [ 62.736577] CPU: 3 PID: 82 Comm: kworker/3:1 Not tainted 4.19.124+ #2
> [ 62.737582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> [ 62.738988] Workqueue: events p9_read_work
> [ 62.739642] RIP: 0010:p9_read_work+0x7c3/0xd90
> [ 62.740348] Code: 48 c1 e9 03 80 3c 01 00 0f 85 cb 05 00 00 48 8d 7a 08 48 b9 00 00 00 00 00 fc ff df 49 8b 87 b8 00 00 00 48 89 fe 48 c1 ee 03 <80> 3c 0e 00 0f 85 89 05 00 00 48 89 c6 48 b9 00 00 00 00 00 fc ff
> [ 62.743236] RSP: 0018:ffff8883ece17ca0 EFLAGS: 00010a06
> [ 62.744059] RAX: dead000000000200 RBX: ffff8883d45666b0 RCX: dffffc0000000000
> [ 62.745173] RDX: dead000000000100 RSI: 1bd5a00000000021 RDI: dead000000000108
> [ 62.746279] RBP: ffff8883d4566590 R08: ffffed107a8acf31 R09: ffffed107a8acf31
> [ 62.747398] R10: 0000000000000001 R11: ffffed107a8acf30 R12: 1ffff1107d9c2f9b
> [ 62.748505] R13: ffff8883d45665d0 R14: ffff8883d4566608 R15: ffff8883e1f1c000
> [ 62.749615] FS: 0000000000000000(0000) GS:ffff8883ef180000(0000) knlGS:0000000000000000
> [ 62.750881] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 62.751784] CR2: 0000000000000000 CR3: 000000009c622003 CR4: 00000000007606e0
> [ 62.752898] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 62.754011] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 62.755126] PKRU: 55555554
> [ 62.755561] Call Trace:
> [ 62.755963] ? p9_write_work+0xa00/0xa00
> [ 62.756592] process_one_work+0xae4/0x1b20
> [ 62.757252] ? apply_wqattrs_commit+0x3e0/0x3e0
> [ 62.757985] worker_thread+0x8c/0xe80
> [ 62.758600] ? __kthread_parkme+0xe9/0x190
> [ 62.759254] ? process_one_work+0x1b20/0x1b20
> [ 62.759950] kthread+0x341/0x410
> [ 62.760479] ? kthread_create_worker_on_cpu+0xf0/0xf0
> [ 62.761296] ret_from_fork+0x3a/0x50
> [ 62.761874] Modules linked in:
> [ 62.762378] Dumping ftrace buffer:
> [ 62.762942] (ftrace buffer empty)
> [ 62.763547] ---[ end trace 69672816613947a3 ]---

This looks like:
https://syzkaller.appspot.com/bug?id=5df4f85d764ee89863d0294b4e0c87ef2fd2c624
I'm not sure how active this still is but please also add this
Reported-by tag:
Reported-by: [email protected]

(can keep both)

> Yes，In this case, all further 9p messages will not be read.
> >p9_read_work probably should handle REQ_STATUS_FLSHD in a special case
> >that just throws the message away without error as well.
>
> Can it be solved like this?
>
> --- a/net/9p/trans_fd.c
> +++ b/net/9p/trans_fd.c
> @@ -362,7 +362,7 @@ static void p9_read_work(struct work_struct *work)
>                 if (m->rreq->status == REQ_STATUS_SENT) {
>                         list_del(&m->rreq->req_list);
>                         p9_client_cb(m->client, m->rreq, REQ_STATUS_RCVD);
> -               } else {
> +               } else if (m->rreq->status != REQ_STATUS_FLSHD) {
>                         spin_unlock(&m->client->lock);
>                         p9_debug(P9_DEBUG_ERROR,
>                                  "Request tag %d errored out while
> we were reading the reply\n",

Yes that is probably correct.
Please add a comment above saying we ignore replies associated with a
cancelled request.

> This patch "afd8d65411" just moved list_del into cancelled ops. It
> is not actually the initial patch that caused the bug
>
> In 60ff779c4abb ("9p: client: remove unused code and any reference
> to "cancelled" function")
>
> It moved spin_lock under "if (oldreq->status == REQ_STATUS_FLSH)" .
>
> After "if (oldreq->status == REQ_STATUS_FLSH)", oldreq may be
> changed by other thread.

Ok, thank you for explaining; I agree now.

--
Dominique