2022-08-31 19:07:22

by Schspa Shi

[permalink] [raw]
Subject: [PATCH] p9: trans_fd: Fix deadlock when connection cancel

There is a deadlock condition when connection canceled.

Please refer to the following scenarios.
task 0 task1
------------------------------------------------------------------
p9_client_rpc
req = p9_client_prepare_req(c, type, c->msize, fmt, ap);
// refcount = 2
err = c->trans_mod->request(c, req);
// req was added to unsent_req_list
wait_event_killable(req->wq, req->status >= REQ_STATUS_RCVD);

p9_read_work
// IO error happen
error:
p9_conn_cancel(m, err);
spin_lock(&m->client->lock);
// hold client->lock now
p9_client_cb
req->status = REQ_STATUS_ERROR;
wake_up(&req->wq);
// task 0 wakeup
<< preempted >>

reterr:
p9_req_put(c, req);
// refcount = 1 now
<< got scheduled >>
p9_req_put
// refcount = 0
p9_tag_remove(c, r);
spin_lock_irqsave(&c->lock, flags);
------------- deadlock -------------

[ 651.564169] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 651.564176] rcu: 3-...0: (8 ticks this GP) idle=40b4/1/0x4000000000000000 softirq=1289/1290 fqs=83762
[ 651.564185] (detected by 2, t=420047 jiffies, g=1601, q=992 ncpus=4)
[ 651.564190] Sending NMI from CPU 2 to CPUs 3:
[ 651.539301] NMI backtrace for cpu 3
[ 651.539301] CPU: 3 PID: 46 Comm: kworker/3:1 Not tainted 6.0.0-rc2-rt3-00493-g2af9a9504166 #3
[ 651.539301] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 651.539301] Workqueue: events p9_read_work
[ 651.539301] RIP: 0010:queued_spin_lock_slowpath+0xfc/0x590
[ 651.539301] Code: 00 00 00 65 48 2b 04 25 28 00 00 00 0f 85 a5 04 00 00 48 81 c4 88 00 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc0
[ 651.539301] RSP: 0018:ffff888002987ad8 EFLAGS: 00000002
[ 651.539301] RAX: 0000000000000000 RBX: 0000000000000001 RCX: dffffc0000000000
[ 651.539301] RDX: 0000000000000003 RSI: 0000000000000004 RDI: ffff888004adf600
[ 651.539301] RBP: ffff888004adf600 R08: ffffffff81d341a0 R09: ffff888004adf603
[ 651.539301] R10: ffffed100095bec0 R11: 0000000000000001 R12: 0000000000000001
[ 651.539301] R13: 1ffff11000530f5c R14: ffff888004adf600 R15: ffff888002987c38
[ 651.539301] FS: 0000000000000000(0000) GS:ffff888036580000(0000) knlGS:0000000000000000
[ 651.539301] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 651.539301] CR2: 00007fe012d608dc CR3: 000000000bc16000 CR4: 0000000000350ee0
[ 651.539301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 651.539301] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 651.539301] Call Trace:
[ 651.539301] <TASK>
[ 651.539301] ? osq_unlock+0x100/0x100
[ 651.539301] ? ret_from_fork+0x1f/0x30
[ 651.539301] do_raw_spin_lock+0x196/0x1a0
[ 651.539301] ? spin_bug+0x90/0x90
[ 651.539301] ? do_raw_spin_lock+0x114/0x1a0
[ 651.539301] _raw_spin_lock_irqsave+0x1c/0x30
[ 651.539301] p9_req_put+0x61/0x130
[ 651.539301] p9_conn_cancel+0x321/0x3b0
[ 651.539301] ? p9_conn_create+0x1f0/0x1f0
[ 651.539301] p9_read_work+0x207/0x7d0
[ 651.539301] ? p9_fd_create+0x1d0/0x1d0
[ 651.539301] ? spin_bug+0x90/0x90
[ 651.539301] ? read_word_at_a_time+0xe/0x20
[ 651.539301] process_one_work+0x420/0x720
[ 651.539301] worker_thread+0x2b9/0x700
[ 651.539301] ? rescuer_thread+0x620/0x620
[ 651.539301] kthread+0x176/0x1b0
[ 651.539301] ? kthread_complete_and_exit+0x20/0x20
[ 651.539301] ret_from_fork+0x1f/0x30

To fix it, we can add extra reference counter to avoid deadlock, and
decrease it after we unlock the client->lock.

Fixes: 67dd8e445ee0 ("9p: roll p9_tag_remove into p9_req_put")

Signed-off-by: Schspa Shi <[email protected]>
---
net/9p/trans_fd.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index e758978b44bee..2e4e039b38e3e 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -205,14 +205,19 @@ static void p9_conn_cancel(struct p9_conn *m, int err)
list_move(&req->req_list, &cancel_list);
}

- list_for_each_entry_safe(req, rtmp, &cancel_list, req_list) {
+ list_for_each_entry(req, &cancel_list, req_list) {
p9_debug(P9_DEBUG_ERROR, "call back req %p\n", req);
- list_del(&req->req_list);
if (!req->t_err)
req->t_err = err;
+ p9_req_get(req);
p9_client_cb(m->client, req, REQ_STATUS_ERROR);
}
spin_unlock(&m->client->lock);
+
+ list_for_each_entry_safe(req, rtmp, &cancel_list, req_list) {
+ list_del(&req->req_list);
+ p9_req_put(m->client, req);
+ }
}

static __poll_t
--
2.37.2


2022-08-31 21:11:57

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] p9: trans_fd: Fix deadlock when connection cancel

Schspa Shi wrote on Thu, Sep 01, 2022 at 02:09:50AM +0800:
> To fix it, we can add extra reference counter to avoid deadlock, and
> decrease it after we unlock the client->lock.

Thanks for the patch!

Unfortunately I already sent a slightly different version to the list,
hidden in another syzbot thread, here:
https://lkml.kernel.org/r/[email protected]

(yes, sorry, not exactly somewhere I'd expect someone to find it... 9p
hasn't had many contributors recently)


Basically instead of taking an extra lock I just released the client
lock before calling p9_client_cb, so it shouldn't hang anymore.

We don't need the lock to call the cb as in p9_conn_cancel we already
won't accept any new request and by this point the requests are in a
local list that isn't shared anywhere.

If you have a test setup, would you mind testing my patch?
That's the main reason I was delaying pushing it.

Since you went out of your way to make this patch if you agree with my
approach I don't mind adding your sign off or another mark of having
worked on it.

Thank you,
--
Dominique

2022-09-01 03:08:52

by Schspa Shi

[permalink] [raw]
Subject: Re: [PATCH] p9: trans_fd: Fix deadlock when connection cancel


[email protected] writes:

> Schspa Shi wrote on Thu, Sep 01, 2022 at 02:09:50AM +0800:
>> To fix it, we can add extra reference counter to avoid deadlock, and
>> decrease it after we unlock the client->lock.
>
> Thanks for the patch!
>
> Unfortunately I already sent a slightly different version to the list,
> hidden in another syzbot thread, here:
> https://lkml.kernel.org/r/[email protected]
>
> (yes, sorry, not exactly somewhere I'd expect someone to find it... 9p
> hasn't had many contributors recently)
>
>
> Basically instead of taking an extra lock I just released the client
> lock before calling p9_client_cb, so it shouldn't hang anymore.
>
> We don't need the lock to call the cb as in p9_conn_cancel we already
> won't accept any new request and by this point the requests are in a
> local list that isn't shared anywhere.
>

Ok, thank you for pointing that out.

> If you have a test setup, would you mind testing my patch?
> That's the main reason I was delaying pushing it.
>

I have test it with my enviroment, it not hang anymore.

> Since you went out of your way to make this patch if you agree with my
> approach I don't mind adding your sign off or another mark of having
> worked on it.
>
> Thank you,

--
BRs
Schspa Shi

2022-09-01 15:46:03

by Christian Schoenebeck

[permalink] [raw]
Subject: Re: [PATCH] p9: trans_fd: Fix deadlock when connection cancel

On Donnerstag, 1. September 2022 04:55:36 CEST Schspa Shi wrote:
> [email protected] writes:
> > Schspa Shi wrote on Thu, Sep 01, 2022 at 02:09:50AM +0800:
> >> To fix it, we can add extra reference counter to avoid deadlock, and
> >> decrease it after we unlock the client->lock.
> >
> > Thanks for the patch!
> >
> > Unfortunately I already sent a slightly different version to the list,
> > hidden in another syzbot thread, here:
> > https://lkml.kernel.org/r/[email protected]
> >
> > (yes, sorry, not exactly somewhere I'd expect someone to find it... 9p
> > hasn't had many contributors recently)
> >
> >
> > Basically instead of taking an extra lock I just released the client
> > lock before calling p9_client_cb, so it shouldn't hang anymore.
> >
> > We don't need the lock to call the cb as in p9_conn_cancel we already
> > won't accept any new request and by this point the requests are in a
> > local list that isn't shared anywhere.
>
> Ok, thank you for pointing that out.
>
> > If you have a test setup, would you mind testing my patch?
> > That's the main reason I was delaying pushing it.
>
> I have test it with my enviroment, it not hang anymore.

Are you fine with that Dominique, or do you want me to test your linked patch
as well?

You can also explicitly tell me if you need something to be reviewed/tested.

> > Since you went out of your way to make this patch if you agree with my
> > approach I don't mind adding your sign off or another mark of having
> > worked on it.
> >
> > Thank you,




2022-09-04 07:28:53

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] p9: trans_fd: Fix deadlock when connection cancel

Christian Schoenebeck wrote on Thu, Sep 01, 2022 at 05:27:53PM +0200:
> > > If you have a test setup, would you mind testing my patch?
> > > That's the main reason I was delaying pushing it.
> >
> > I have test it with my enviroment, it not hang anymore.
>
> Are you fine with that Dominique, or do you want me to test your linked patch
> as well?
>
> You can also explicitly tell me if you need something to be reviewed/tested.

I've just resent both patches properly; that should be better for
everyone. It can't hurt to get more tests :)

I don't think we'll catch anything with Tetsuo Handa's other two fixes
as we don't really test trans_fd all that much, so I'll give it a spin
with ganesha on my end when I can find time.

Thanks!
--
Dominiquem