LinuxLists.cc - [RFC] Bluetooth: Use flush_work instead of cancel

2012-01-13 12:16:50

Subject: [RFC] Bluetooth: Use flush_work instead of cancel_work

From: Andrei Emeltchenko <[email protected]>

Fix deadlock when cancelling delayed work.

[ 584.676126] ======================================================
[ 584.676126] [ INFO: possible circular locking dependency detected ]
[ 584.676126] 3.2.0-rc2niko+ #44
[ 584.676126] -------------------------------------------------------
[ 584.676126] kworker/u:1/30 is trying to acquire lock:
[ 584.676126] (&hdev->lock){+.+.+.}, at: [<f81f001c>] hci_conn_timeout+0x6c/0x190 [bluetooth]
[ 584.676126]
[ 584.676126] but task is already holding lock:
[ 584.676126] ((&(&conn->disc_work)->work)){+.+...}, at: [<c1065a78>] process_one_work+0x108/0x460
[ 584.676126]
[ 584.676126] which lock already depends on the new lock.
[ 584.676126]
[ 584.676126]
[ 584.676126] the existing dependency chain (in reverse order) is:
[ 584.676126]
[ 584.676126] -> #1 ((&(&conn->disc_work)->work)){+.+...}:
[ 584.676126] [<c1086748>] lock_acquire+0x88/0x110
[ 584.676126] [<c1066041>] wait_on_work+0x61/0x210
[ 584.676126] [<c106630a>] __cancel_work_timer+0x6a/0x110
[ 584.676126] [<c10663c0>] cancel_delayed_work_sync+0x10/0x20
[ 584.676126] [<f81f935b>] hci_event_packet+0x3b2b/0x4610 [bluetooth]
[ 584.676126] [<f81ea78e>] hci_rx_work+0x20e/0x4c0 [bluetooth]
[ 584.676126] [<c1065aec>] process_one_work+0x17c/0x460
[ 584.676126] [<c10672e4>] worker_thread+0x124/0x2c0
[ 584.676126] [<c106be44>] kthread+0x84/0x90
[ 584.676126] [<c1567f42>] kernel_thread_helper+0x6/0x10
[ 584.676126]
[ 584.676126] -> #0 (&hdev->lock){+.+.+.}:
[ 584.676126] [<c10852cd>] __lock_acquire+0xc0d/0x1ab0
[ 584.676126] [<c1086748>] lock_acquire+0x88/0x110
[ 584.676126] [<c155de50>] mutex_lock_nested+0x70/0x320
[ 584.676126] [<f81f001c>] hci_conn_timeout+0x6c/0x190 [bluetooth]
[ 584.676126] [<c1065aec>] process_one_work+0x17c/0x460
[ 584.676126] [<c10672e4>] worker_thread+0x124/0x2c0
[ 584.676126] [<c106be44>] kthread+0x84/0x90
[ 584.676126] [<c1567f42>] kernel_thread_helper+0x6/0x10
[ 584.676126]
[ 584.676126] other info that might help us debug this:
[ 584.676126]
[ 584.676126] Possible unsafe locking scenario:
[ 584.676126]
[ 584.676126] CPU0 CPU1
[ 584.676126] ---- ----
[ 584.676126] lock((&(&conn->disc_work)->work));
[ 584.676126] lock(&hdev->lock);
[ 584.676126] lock((&(&conn->disc_work)->work));
[ 584.676126] lock(&hdev->lock);
[ 584.676126]
[ 584.676126] *** DEADLOCK ***
[ 584.676126]
[ 584.676126] 2 locks held by kworker/u:1/30:
[ 584.676126] #0: (hdev->name){.+.+.+}, at: [<c1065a78>] process_one_work+0x108/0x460
[ 584.676126] #1: ((&(&conn->disc_work)->work)){+.+...}, at: [<c1065a78>] process_one_work+0x108/0x460

Signed-off-by: Andrei Emeltchenko <[email protected]>
---
net/bluetooth/hci_conn.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
index 3db4324..aa33fa2 100644
--- a/net/bluetooth/hci_conn.c
+++ b/net/bluetooth/hci_conn.c
@@ -437,7 +437,7 @@ int hci_conn_del(struct hci_conn *conn)

del_timer(&conn->idle_timer);

- cancel_delayed_work_sync(&conn->disc_work);
+ flush_delayed_work_sync(&conn->disc_work);

del_timer(&conn->auto_accept_timer);

--
1.7.4.1

2012-01-16 12:35:36

by Andrei Emeltchenko

[permalink] [raw]

Subject: Re: [RFC] Bluetooth: Use flush_work instead of cancel_work

Hi Ulisses,

On Sun, Jan 15, 2012 at 07:10:27PM -0200, Ulisses Furquim wrote:
...
> >> > [ ?584.676126] ======================================================
> >> > [ ?584.676126] [ INFO: possible circular locking dependency detected ]
> >> > [ ?584.676126] 3.2.0-rc2niko+ #44
> >> > [ ?584.676126] -------------------------------------------------------
...
> >> > [ ?584.676126] ?Possible unsafe locking scenario:
> >> > [ ?584.676126]
> >> > [ ?584.676126] ? ? ? ?CPU0 ? ? ? ? ? ? ? ? ? ?CPU1
> >> > [ ?584.676126] ? ? ? ?---- ? ? ? ? ? ? ? ? ? ?----
> >> > [ ?584.676126] ? lock((&(&conn->disc_work)->work));
> >> > [ ?584.676126] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?lock(&hdev->lock);
> >> > [ ?584.676126] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?lock((&(&conn->disc_work)->work));
> >> > [ ?584.676126] ? lock(&hdev->lock);
> >> > [ ?584.676126]
> >> > [ ?584.676126] ?*** DEADLOCK ***
> >> > [ ?584.676126]
> >> > [ ?584.676126] 2 locks held by kworker/u:1/30:
> >> > [ ?584.676126] ?#0: ?(hdev->name){.+.+.+}, at: [<c1065a78>] process_one_work+0x108/0x460
> >> > [ ?584.676126] ?#1: ?((&(&conn->disc_work)->work)){+.+...}, at: [<c1065a78>] process_one_work+0x108/0x460
> > Apparently it does not fix it completely, the reason might be hci_dev_lock
...
> > in hci_conn_timeout. Maybe instead of lock we could use hold/put?
> >
> > I will investigate this issue further.
>
> I believe the real problem is to have a _sync() call for cancelling or
> flushing the delayed work. While you are at it can you try if using
> __cancel_delayed_work() fixes the problem, please? I guess with the
> recent move to workqueues some _sync() calls were added that might
> lead to deadlocks. I'm even thinking we might wanna replace all
> cancel_dealyed_work_sync() calls with __cancel_dealyed_work() ones
> would be good. I don't have time to test now, but if you do, please
> try this too. Thanks a lot.

I've tested cancel_delayed_work and it works in my set of tests. It does
not report deadlock (at least here ;)).

I've just sent modified RFC v2.

Best regards
Andrei Emeltchenko

2012-01-15 21:10:27

by Ulisses Furquim

[permalink] [raw]

Subject: Re: [RFC] Bluetooth: Use flush_work instead of cancel_work

Hi Andrei,

On Fri, Jan 13, 2012 at 1:07 PM, Emeltchenko Andrei
<[email protected]> wrote:
> Hi Marcel,
>
> On Fri, Jan 13, 2012 at 02:18:43PM +0100, Marcel Holtmann wrote:
>> Hi Andrei,
>>
>> > Fix deadlock when cancelling delayed work.
>> >
>> > [ =A0584.676126] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> > [ =A0584.676126] [ INFO: possible circular locking dependency detected=
]
>> > [ =A0584.676126] 3.2.0-rc2niko+ #44
>> > [ =A0584.676126] -----------------------------------------------------=
--
>> > [ =A0584.676126] kworker/u:1/30 is trying to acquire lock:
>> > [ =A0584.676126] =A0(&hdev->lock){+.+.+.}, at: [<f81f001c>] hci_conn_t=
imeout+0x6c/0x190 [bluetooth]
>> > [ =A0584.676126]
>> > [ =A0584.676126] but task is already holding lock:
>> > [ =A0584.676126] =A0((&(&conn->disc_work)->work)){+.+...}, at: [<c1065=
a78>] process_one_work+0x108/0x460
>> > [ =A0584.676126]
>> > [ =A0584.676126] which lock already depends on the new lock.
>> > [ =A0584.676126]
>> > [ =A0584.676126]
>> > [ =A0584.676126] the existing dependency chain (in reverse order) is:
>> > [ =A0584.676126]
>> > [ =A0584.676126] -> #1 ((&(&conn->disc_work)->work)){+.+...}:
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c1086748>] lock_acquire+0x88/0x110
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c1066041>] wait_on_work+0x61/0x210
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c106630a>] __cancel_work_timer+0x6a/=
0x110
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c10663c0>] cancel_delayed_work_sync+=
0x10/0x20
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<f81f935b>] hci_event_packet+0x3b2b/0=
x4610 [bluetooth]
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<f81ea78e>] hci_rx_work+0x20e/0x4c0 [=
bluetooth]
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c1065aec>] process_one_work+0x17c/0x=
460
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c10672e4>] worker_thread+0x124/0x2c0
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c106be44>] kthread+0x84/0x90
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c1567f42>] kernel_thread_helper+0x6/=
0x10
>> > [ =A0584.676126]
>> > [ =A0584.676126] -> #0 (&hdev->lock){+.+.+.}:
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c10852cd>] __lock_acquire+0xc0d/0x1a=
b0
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c1086748>] lock_acquire+0x88/0x110
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c155de50>] mutex_lock_nested+0x70/0x=
320
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<f81f001c>] hci_conn_timeout+0x6c/0x1=
90 [bluetooth]
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c1065aec>] process_one_work+0x17c/0x=
460
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c10672e4>] worker_thread+0x124/0x2c0
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c106be44>] kthread+0x84/0x90
>> > [ =A0584.676126] =A0 =A0 =A0 =A0[<c1567f42>] kernel_thread_helper+0x6/=
0x10
>> > [ =A0584.676126]
>> > [ =A0584.676126] other info that might help us debug this:
>> > [ =A0584.676126]
>> > [ =A0584.676126] =A0Possible unsafe locking scenario:
>> > [ =A0584.676126]
>> > [ =A0584.676126] =A0 =A0 =A0 =A0CPU0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0CPU1
>> > [ =A0584.676126] =A0 =A0 =A0 =A0---- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0----
>> > [ =A0584.676126] =A0 lock((&(&conn->disc_work)->work));
>> > [ =A0584.676126] =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0lock(&hdev->lock);
>> > [ =A0584.676126] =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0lock((&(&conn->disc_work)->work));
>> > [ =A0584.676126] =A0 lock(&hdev->lock);
>> > [ =A0584.676126]
>> > [ =A0584.676126] =A0*** DEADLOCK ***
>> > [ =A0584.676126]
>> > [ =A0584.676126] 2 locks held by kworker/u:1/30:
>> > [ =A0584.676126] =A0#0: =A0(hdev->name){.+.+.+}, at: [<c1065a78>] proc=
ess_one_work+0x108/0x460
>> > [ =A0584.676126] =A0#1: =A0((&(&conn->disc_work)->work)){+.+...}, at: =
[<c1065a78>] process_one_work+0x108/0x460
>> >
>> > Signed-off-by: Andrei Emeltchenko <[email protected]>
>> > ---
>> > =A0net/bluetooth/hci_conn.c | =A0 =A02 +-
>> > =A01 files changed, 1 insertions(+), 1 deletions(-)
>>
>> Acked-by: Marcel Holtmann <[email protected]>
>
> Apparently it does not fix it completely, the reason might be hci_dev_loc=
k
> in hci_conn_timeout. Maybe instead of lock we could use hold/put?
>
> I will investigate this issue further.

I believe the real problem is to have a _sync() call for cancelling or
flushing the delayed work. While you are at it can you try if using
__cancel_delayed_work() fixes the problem, please? I guess with the
recent move to workqueues some _sync() calls were added that might
lead to deadlocks. I'm even thinking we might wanna replace all
cancel_dealyed_work_sync() calls with __cancel_dealyed_work() ones
would be good. I don't have time to test now, but if you do, please
try this too. Thanks a lot.

> Best regards
> Andrei Emeltchenko

Best regards,

--=20
Ulisses Furquim
ProFUSION embedded systems
http://profusion.mobi
Mobile: +55 19 9250 0942
Skype: ulissesffs

2012-01-13 15:07:48

by Andrei Emeltchenko

[permalink] [raw]

Subject: Re: [RFC] Bluetooth: Use flush_work instead of cancel_work

Hi Marcel,

On Fri, Jan 13, 2012 at 02:18:43PM +0100, Marcel Holtmann wrote:
> Hi Andrei,
>
> > Fix deadlock when cancelling delayed work.
> >
> > [ 584.676126] ======================================================
> > [ 584.676126] [ INFO: possible circular locking dependency detected ]
> > [ 584.676126] 3.2.0-rc2niko+ #44
> > [ 584.676126] -------------------------------------------------------
> > [ 584.676126] kworker/u:1/30 is trying to acquire lock:
> > [ 584.676126] (&hdev->lock){+.+.+.}, at: [<f81f001c>] hci_conn_timeout+0x6c/0x190 [bluetooth]
> > [ 584.676126]
> > [ 584.676126] but task is already holding lock:
> > [ 584.676126] ((&(&conn->disc_work)->work)){+.+...}, at: [<c1065a78>] process_one_work+0x108/0x460
> > [ 584.676126]
> > [ 584.676126] which lock already depends on the new lock.
> > [ 584.676126]
> > [ 584.676126]
> > [ 584.676126] the existing dependency chain (in reverse order) is:
> > [ 584.676126]
> > [ 584.676126] -> #1 ((&(&conn->disc_work)->work)){+.+...}:
> > [ 584.676126] [<c1086748>] lock_acquire+0x88/0x110
> > [ 584.676126] [<c1066041>] wait_on_work+0x61/0x210
> > [ 584.676126] [<c106630a>] __cancel_work_timer+0x6a/0x110
> > [ 584.676126] [<c10663c0>] cancel_delayed_work_sync+0x10/0x20
> > [ 584.676126] [<f81f935b>] hci_event_packet+0x3b2b/0x4610 [bluetooth]
> > [ 584.676126] [<f81ea78e>] hci_rx_work+0x20e/0x4c0 [bluetooth]
> > [ 584.676126] [<c1065aec>] process_one_work+0x17c/0x460
> > [ 584.676126] [<c10672e4>] worker_thread+0x124/0x2c0
> > [ 584.676126] [<c106be44>] kthread+0x84/0x90
> > [ 584.676126] [<c1567f42>] kernel_thread_helper+0x6/0x10
> > [ 584.676126]
> > [ 584.676126] -> #0 (&hdev->lock){+.+.+.}:
> > [ 584.676126] [<c10852cd>] __lock_acquire+0xc0d/0x1ab0
> > [ 584.676126] [<c1086748>] lock_acquire+0x88/0x110
> > [ 584.676126] [<c155de50>] mutex_lock_nested+0x70/0x320
> > [ 584.676126] [<f81f001c>] hci_conn_timeout+0x6c/0x190 [bluetooth]
> > [ 584.676126] [<c1065aec>] process_one_work+0x17c/0x460
> > [ 584.676126] [<c10672e4>] worker_thread+0x124/0x2c0
> > [ 584.676126] [<c106be44>] kthread+0x84/0x90
> > [ 584.676126] [<c1567f42>] kernel_thread_helper+0x6/0x10
> > [ 584.676126]
> > [ 584.676126] other info that might help us debug this:
> > [ 584.676126]
> > [ 584.676126] Possible unsafe locking scenario:
> > [ 584.676126]
> > [ 584.676126] CPU0 CPU1
> > [ 584.676126] ---- ----
> > [ 584.676126] lock((&(&conn->disc_work)->work));
> > [ 584.676126] lock(&hdev->lock);
> > [ 584.676126] lock((&(&conn->disc_work)->work));
> > [ 584.676126] lock(&hdev->lock);
> > [ 584.676126]
> > [ 584.676126] *** DEADLOCK ***
> > [ 584.676126]
> > [ 584.676126] 2 locks held by kworker/u:1/30:
> > [ 584.676126] #0: (hdev->name){.+.+.+}, at: [<c1065a78>] process_one_work+0x108/0x460
> > [ 584.676126] #1: ((&(&conn->disc_work)->work)){+.+...}, at: [<c1065a78>] process_one_work+0x108/0x460
> >
> > Signed-off-by: Andrei Emeltchenko <[email protected]>
> > ---
> > net/bluetooth/hci_conn.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
>
> Acked-by: Marcel Holtmann <[email protected]>

Apparently it does not fix it completely, the reason might be hci_dev_lock
in hci_conn_timeout. Maybe instead of lock we could use hold/put?

I will investigate this issue further.

Best regards
Andrei Emeltchenko

2012-01-13 13:18:43

by Marcel Holtmann

[permalink] [raw]

Subject: Re: [RFC] Bluetooth: Use flush_work instead of cancel_work

Hi Andrei,

> Fix deadlock when cancelling delayed work.
>
> [ 584.676126] ======================================================
> [ 584.676126] [ INFO: possible circular locking dependency detected ]
> [ 584.676126] 3.2.0-rc2niko+ #44
> [ 584.676126] -------------------------------------------------------
> [ 584.676126] kworker/u:1/30 is trying to acquire lock:
> [ 584.676126] (&hdev->lock){+.+.+.}, at: [<f81f001c>] hci_conn_timeout+0x6c/0x190 [bluetooth]
> [ 584.676126]
> [ 584.676126] but task is already holding lock:
> [ 584.676126] ((&(&conn->disc_work)->work)){+.+...}, at: [<c1065a78>] process_one_work+0x108/0x460
> [ 584.676126]
> [ 584.676126] which lock already depends on the new lock.
> [ 584.676126]
> [ 584.676126]
> [ 584.676126] the existing dependency chain (in reverse order) is:
> [ 584.676126]
> [ 584.676126] -> #1 ((&(&conn->disc_work)->work)){+.+...}:
> [ 584.676126] [<c1086748>] lock_acquire+0x88/0x110
> [ 584.676126] [<c1066041>] wait_on_work+0x61/0x210
> [ 584.676126] [<c106630a>] __cancel_work_timer+0x6a/0x110
> [ 584.676126] [<c10663c0>] cancel_delayed_work_sync+0x10/0x20
> [ 584.676126] [<f81f935b>] hci_event_packet+0x3b2b/0x4610 [bluetooth]
> [ 584.676126] [<f81ea78e>] hci_rx_work+0x20e/0x4c0 [bluetooth]
> [ 584.676126] [<c1065aec>] process_one_work+0x17c/0x460
> [ 584.676126] [<c10672e4>] worker_thread+0x124/0x2c0
> [ 584.676126] [<c106be44>] kthread+0x84/0x90
> [ 584.676126] [<c1567f42>] kernel_thread_helper+0x6/0x10
> [ 584.676126]
> [ 584.676126] -> #0 (&hdev->lock){+.+.+.}:
> [ 584.676126] [<c10852cd>] __lock_acquire+0xc0d/0x1ab0
> [ 584.676126] [<c1086748>] lock_acquire+0x88/0x110
> [ 584.676126] [<c155de50>] mutex_lock_nested+0x70/0x320
> [ 584.676126] [<f81f001c>] hci_conn_timeout+0x6c/0x190 [bluetooth]
> [ 584.676126] [<c1065aec>] process_one_work+0x17c/0x460
> [ 584.676126] [<c10672e4>] worker_thread+0x124/0x2c0
> [ 584.676126] [<c106be44>] kthread+0x84/0x90
> [ 584.676126] [<c1567f42>] kernel_thread_helper+0x6/0x10
> [ 584.676126]
> [ 584.676126] other info that might help us debug this:
> [ 584.676126]
> [ 584.676126] Possible unsafe locking scenario:
> [ 584.676126]
> [ 584.676126] CPU0 CPU1
> [ 584.676126] ---- ----
> [ 584.676126] lock((&(&conn->disc_work)->work));
> [ 584.676126] lock(&hdev->lock);
> [ 584.676126] lock((&(&conn->disc_work)->work));
> [ 584.676126] lock(&hdev->lock);
> [ 584.676126]
> [ 584.676126] *** DEADLOCK ***
> [ 584.676126]
> [ 584.676126] 2 locks held by kworker/u:1/30:
> [ 584.676126] #0: (hdev->name){.+.+.+}, at: [<c1065a78>] process_one_work+0x108/0x460
> [ 584.676126] #1: ((&(&conn->disc_work)->work)){+.+...}, at: [<c1065a78>] process_one_work+0x108/0x460
>
> Signed-off-by: Andrei Emeltchenko <[email protected]>
> ---
> net/bluetooth/hci_conn.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)

Acked-by: Marcel Holtmann <[email protected]>

Regards

Marcel