2007-04-23 21:56:39

by Jiri Kosina

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

On Mon, 23 Apr 2007, Jeremy Fitzhardinge wrote:

> I got this on resume; it looks like a Bluetooth and/or USB problem.
> PM: Removing info for No Bus:hci0
> BUG: sleeping function called from invalid context at net/core/sock.c:1523
> in_atomic():1, irqs_disabled():0
> 1 lock held by khubd/180:
> #0: (old_style_rw_init#2){-.-?}, at: [<f88c5816>] hci_sock_dev_event+0x42/0xc5 [bluetooth]
> [<c01091b5>] show_trace_log_lvl+0x1a/0x30
> [<c010980c>] show_trace+0x12/0x14
> [<c01098cb>] dump_stack+0x16/0x18
> [<c0124191>] __might_sleep+0xe5/0xeb
> [<c0309ece>] lock_sock_nested+0x1d/0xc4
> [<f88c587b>] hci_sock_dev_event+0xa7/0xc5 [bluetooth]
> [<c037c8d2>] notifier_call_chain+0x20/0x3c
> [<c037c915>] atomic_notifier_call_chain+0x27/0x50
> [<f88c1a55>] hci_notify+0x12/0x14 [bluetooth]
> [<f88c2626>] hci_unregister_dev+0x4c/0x65 [bluetooth]
> [<f89f5d38>] hci_usb_disconnect+0x42/0x6f [hci_usb]
> [<c02d7e49>] usb_unbind_interface+0x33/0x69
> [<c028f3fa>] __device_release_driver+0x74/0x90
> [<c028f870>] device_release_driver+0x33/0x4b
> [<c028ee0f>] bus_remove_device+0x73/0x82
> [<c028d412>] device_del+0x169/0x1cf
> [<c02d5c00>] usb_disable_device+0x62/0xc2
> [<c02d2807>] usb_disconnect+0x95/0x114
> [<c02d34d8>] hub_thread+0x2e2/0x99b
> [<c013c8f7>] kthread+0xb5/0xe2
> [<c0108d97>] kernel_thread_helper+0x7/0x10
> =======================
> PM: Removing info for usb:4-1:1.0

OK, this probably started happening since b40df5743. Before that commit,
hci_sock_dev_event() used bh_lock_sock() to lock the corresponding struct
sock. This was obviously buggy - not deadlock safe against
l2cap_connect_cfm() from softirq context.

This however introduced another problem - hci_sock_dev_event() is now
obviously being triggered (for HCI_DEV_UNREG event, when suspending) in
atomic context with preemption disabled. This is what lock_sock_nested()
complains about, as it is allowed to sleep inside __lock_sock(), waiting
for the lock owner.

Hmm, *sigh*. I guess the patch below fixes the problem, but it is a
masterpiece in the field of ugliness. And I am not sure whether it is
completely correct either. Are there any immediate ideas for better
solution with respect to how struct sock locking works?

diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 71f5cfb..c5c93cd 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -656,7 +656,10 @@ static int hci_sock_dev_event(struct notifier_block *this, unsigned long event,
/* Detach sockets from device */
read_lock(&hci_sk_list.lock);
sk_for_each(sk, node, &hci_sk_list.head) {
- lock_sock(sk);
+ if (in_atomic())
+ bh_lock_sock(sk);
+ else
+ lock_sock(sk);
if (hci_pi(sk)->hdev == hdev) {
hci_pi(sk)->hdev = NULL;
sk->sk_err = EPIPE;
@@ -665,7 +668,10 @@ static int hci_sock_dev_event(struct notifier_block *this, unsigned long event,

hci_dev_put(hdev);
}
- release_sock(sk);
+ if (in_atomic())
+ bh_unlock_sock(sk);
+ else
+ release_sock(sk);
}
read_unlock(&hci_sk_list.lock);
}


2007-04-26 14:31:30

by Jiri Kosina

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

On Mon, 23 Apr 2007, Jiri Kosina wrote:

> > BUG: sleeping function called from invalid context at net/core/sock.c:1523
> > in_atomic():1, irqs_disabled():0
> > 1 lock held by khubd/180:
> > #0: (old_style_rw_init#2){-.-?}, at: [<f88c5816>] hci_sock_dev_event+0x42/0xc5 [bluetooth]
[...]
> OK, this probably started happening since b40df5743. Before that commit,
> hci_sock_dev_event() used bh_lock_sock() to lock the corresponding
> struct sock. This was obviously buggy - not deadlock safe against
> l2cap_connect_cfm() from softirq context. This however introduced
> another problem - hci_sock_dev_event() is now obviously being triggered
> (for HCI_DEV_UNREG event, when suspending) in atomic context with
> preemption disabled. This is what lock_sock_nested() complains about, as
> it is allowed to sleep inside __lock_sock(), waiting for the lock owner.

I guess the patch below is a proper fix. Marcel, does this look okay to
you?



From: Jiri Kosina <[email protected]>

Bluetooth: postpone hci_dev unregistration

Commit b40df57 substituted bh_lock_sock() in hci_sock_dev_event() for
lock_sock() when unregistering HCI device, in order to prevent deadlock
against locking in l2cap_connect_cfm() from softirq context.

This however introduces another problem - hci_sock_dev_event() for
HCI_DEV_UNREG can also be triggered in atomic context, in which calling
lock_sock() is not safe as it could sleep.

This patch moves the detaching of sockets from hci_device into workqueue,
so that lock_sock() can be used safely. This requires movement of
deallocation of hci_dev - deallocating device just after
hci_unregister_dev() would be too soon, as it could happen before the
workqueue has been run.

Signed-off-by: Jiri Kosina <[email protected]>

---
drivers/bluetooth/bfusb.c | 2 -
drivers/bluetooth/bluecard_cs.c | 2 -
drivers/bluetooth/bpa10x.c | 2 -
drivers/bluetooth/bt3c_cs.c | 2 -
drivers/bluetooth/btuart_cs.c | 2 -
drivers/bluetooth/dtl1_cs.c | 2 -
drivers/bluetooth/hci_ldisc.c | 1 -
drivers/bluetooth/hci_usb.c | 2 -
drivers/bluetooth/hci_vhci.c | 2 -
include/net/bluetooth/hci_core.h | 3 ++
net/bluetooth/hci_core.c | 9 +++++++
net/bluetooth/hci_sock.c | 44 ++++++++++++++++++++-----------------
12 files changed, 36 insertions(+), 37 deletions(-)

diff --git a/drivers/bluetooth/bfusb.c b/drivers/bluetooth/bfusb.c
index 4c766f3..db6809e 100644
--- a/drivers/bluetooth/bfusb.c
+++ b/drivers/bluetooth/bfusb.c
@@ -762,8 +762,6 @@ static void bfusb_disconnect(struct usb_

if (hci_unregister_dev(hdev) < 0)
BT_ERR("Can't unregister HCI device %s", hdev->name);
-
- hci_free_dev(hdev);
}

static struct usb_driver bfusb_driver = {
diff --git a/drivers/bluetooth/bluecard_cs.c b/drivers/bluetooth/bluecard_cs.c
index acfb6a4..1184113 100644
--- a/drivers/bluetooth/bluecard_cs.c
+++ b/drivers/bluetooth/bluecard_cs.c
@@ -851,8 +851,6 @@ static int bluecard_close(bluecard_info_
if (hci_unregister_dev(hdev) < 0)
BT_ERR("Can't unregister HCI device %s", hdev->name);

- hci_free_dev(hdev);
-
return 0;
}

diff --git a/drivers/bluetooth/bpa10x.c b/drivers/bluetooth/bpa10x.c
index 9fca651..7dfaa95 100644
--- a/drivers/bluetooth/bpa10x.c
+++ b/drivers/bluetooth/bpa10x.c
@@ -613,8 +613,6 @@ static void bpa10x_disconnect(struct usb

if (hci_unregister_dev(hdev) < 0)
BT_ERR("Can't unregister HCI device %s", hdev->name);
-
- hci_free_dev(hdev);
}

static struct usb_driver bpa10x_driver = {
diff --git a/drivers/bluetooth/bt3c_cs.c b/drivers/bluetooth/bt3c_cs.c
index 18b0f39..6ab7b56 100644
--- a/drivers/bluetooth/bt3c_cs.c
+++ b/drivers/bluetooth/bt3c_cs.c
@@ -640,8 +640,6 @@ static int bt3c_close(bt3c_info_t *info)
if (hci_unregister_dev(hdev) < 0)
BT_ERR("Can't unregister HCI device %s", hdev->name);

- hci_free_dev(hdev);
-
return 0;
}

diff --git a/drivers/bluetooth/btuart_cs.c b/drivers/bluetooth/btuart_cs.c
index c1bce75..93ca675 100644
--- a/drivers/bluetooth/btuart_cs.c
+++ b/drivers/bluetooth/btuart_cs.c
@@ -570,8 +570,6 @@ static int btuart_close(btuart_info_t *i
if (hci_unregister_dev(hdev) < 0)
BT_ERR("Can't unregister HCI device %s", hdev->name);

- hci_free_dev(hdev);
-
return 0;
}

diff --git a/drivers/bluetooth/dtl1_cs.c b/drivers/bluetooth/dtl1_cs.c
index 459aa97..4fc7c02 100644
--- a/drivers/bluetooth/dtl1_cs.c
+++ b/drivers/bluetooth/dtl1_cs.c
@@ -552,8 +552,6 @@ static int dtl1_close(dtl1_info_t *info)
if (hci_unregister_dev(hdev) < 0)
BT_ERR("Can't unregister HCI device %s", hdev->name);

- hci_free_dev(hdev);
-
return 0;
}

diff --git a/drivers/bluetooth/hci_ldisc.c b/drivers/bluetooth/hci_ldisc.c
index 0f4203b..4c4d555 100644
--- a/drivers/bluetooth/hci_ldisc.c
+++ b/drivers/bluetooth/hci_ldisc.c
@@ -312,7 +312,6 @@ static void hci_uart_tty_close(struct tt
if (test_and_clear_bit(HCI_UART_PROTO_SET, &hu->flags)) {
hu->proto->close(hu);
hci_unregister_dev(hdev);
- hci_free_dev(hdev);
}
}
}
diff --git a/drivers/bluetooth/hci_usb.c b/drivers/bluetooth/hci_usb.c
index 406af57..e9e0183 100644
--- a/drivers/bluetooth/hci_usb.c
+++ b/drivers/bluetooth/hci_usb.c
@@ -1069,8 +1069,6 @@ static void hci_usb_disconnect(struct us

if (hci_unregister_dev(hdev) < 0)
BT_ERR("Can't unregister HCI device %s", hdev->name);
-
- hci_free_dev(hdev);
}

static int hci_usb_suspend(struct usb_interface *intf, pm_message_t message)
diff --git a/drivers/bluetooth/hci_vhci.c b/drivers/bluetooth/hci_vhci.c
index b71a5cc..e5a3a8c 100644
--- a/drivers/bluetooth/hci_vhci.c
+++ b/drivers/bluetooth/hci_vhci.c
@@ -308,8 +308,6 @@ static int vhci_release(struct inode *in
BT_ERR("Can't unregister HCI device %s", hdev->name);
}

- hci_free_dev(hdev);
-
file->private_data = NULL;

return 0;
diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
index c0fc396..a0a0a15 100644
--- a/include/net/bluetooth/hci_core.h
+++ b/include/net/bluetooth/hci_core.h
@@ -132,6 +132,8 @@ struct hci_dev {

struct module *owner;

+ struct work_struct hci_dev_unreg_work;
+
int (*open)(struct hci_dev *hdev);
int (*close)(struct hci_dev *hdev);
int (*flush)(struct hci_dev *hdev);
@@ -622,6 +624,7 @@ void hci_si_event(struct hci_dev *hdev,

/* ----- HCI Sockets ----- */
void hci_send_to_sock(struct hci_dev *hdev, struct sk_buff *skb);
+void hci_sock_detach(struct hci_dev *hdev);

/* HCI info for socket */
#define hci_pi(sk) ((struct hci_pinfo *) sk)
diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index 4917919..54a50ee 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -795,6 +795,14 @@ int hci_get_dev_info(void __user *arg)
return err;
}

+void hci_dev_unreg(struct work_struct *work)
+{
+ struct hci_dev *hdev =
+ container_of(work, struct hci_dev, hci_dev_unreg_work);
+ hci_sock_detach(hdev);
+ hci_free_dev(hdev);
+}
+
/* ---- Interface to HCI drivers ---- */

/* Alloc HCI device */
@@ -807,6 +815,7 @@ struct hci_dev *hci_alloc_dev(void)
return NULL;

skb_queue_head_init(&hdev->driver_init);
+ INIT_WORK(&hdev->hci_dev_unreg_work, hci_dev_unreg);

return hdev;
}
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 71f5cfb..fb59408 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -141,6 +141,28 @@ void hci_send_to_sock(struct hci_dev *hd
read_unlock(&hci_sk_list.lock);
}

+void hci_sock_detach(struct hci_dev *hdev)
+{
+ struct sock *sk;
+ struct hlist_node *node;
+
+ /* Detach sockets from device */
+ read_lock(&hci_sk_list.lock);
+ sk_for_each(sk, node, &hci_sk_list.head) {
+ lock_sock(sk);
+ if (hci_pi(sk)->hdev == hdev) {
+ hci_pi(sk)->hdev = NULL;
+ sk->sk_err = EPIPE;
+ sk->sk_state = BT_OPEN;
+ sk->sk_state_change(sk);
+
+ hci_dev_put(hdev);
+ }
+ release_sock(sk);
+ }
+ read_unlock(&hci_sk_list.lock);
+}
+
static int hci_sock_release(struct socket *sock)
{
struct sock *sk = sock->sk;
@@ -649,26 +671,8 @@ static int hci_sock_dev_event(struct not
ev.dev_id = hdev->id;
hci_si_event(NULL, HCI_EV_SI_DEVICE, sizeof(ev), &ev);

- if (event == HCI_DEV_UNREG) {
- struct sock *sk;
- struct hlist_node *node;
-
- /* Detach sockets from device */
- read_lock(&hci_sk_list.lock);
- sk_for_each(sk, node, &hci_sk_list.head) {
- lock_sock(sk);
- if (hci_pi(sk)->hdev == hdev) {
- hci_pi(sk)->hdev = NULL;
- sk->sk_err = EPIPE;
- sk->sk_state = BT_OPEN;
- sk->sk_state_change(sk);
-
- hci_dev_put(hdev);
- }
- release_sock(sk);
- }
- read_unlock(&hci_sk_list.lock);
- }
+ if (event == HCI_DEV_UNREG)
+ schedule_work(&hdev->hci_dev_unreg_work);

return NOTIFY_DONE;
}

2007-04-24 07:59:38

by Jiri Kosina

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

On Tue, 24 Apr 2007, Herbert Xu wrote:

> > Hmm, *sigh*. I guess the patch below fixes the problem, but it is a
> > masterpiece in the field of ugliness. And I am not sure whether it is
> > completely correct either. Are there any immediate ideas for better
> > solution with respect to how struct sock locking works?
> Please cc such patches to netdev. Thanks.

Hi Herbert,

well it's pretty much bluetooth-specific, and bluez-devel was CCed, but
OK.

> > diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
> > index 71f5cfb..c5c93cd 100644
> > --- a/net/bluetooth/hci_sock.c
> > +++ b/net/bluetooth/hci_sock.c
> > @@ -656,7 +656,10 @@ static int hci_sock_dev_event(struct notifier_block *this, unsigned long event,
> > /* Detach sockets from device */
> > read_lock(&hci_sk_list.lock);
> > sk_for_each(sk, node, &hci_sk_list.head) {
> > - lock_sock(sk);
> > + if (in_atomic())
> > + bh_lock_sock(sk);
> > + else
> > + lock_sock(sk);
>
> This doesn't do what you think it does. bh_lock_sock can still succeed
> even with lock_sock held by someone else.

I know, this was precisely the reason why I converted the bh_lock_sock()
to lock_sock() here some time ago (as it was racy with
l2cap_connect_cfm()).

> Does this need to occur immediately when an event occurs? If not I'd
> suggest moving this into a workqueue.

Will have to check whether this will be processed properly in time when
going to suspend.

Thanks,

--
Jiri Kosina

2007-04-24 03:30:09

by Herbert Xu

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

Jiri Kosina <[email protected]> wrote:
>
> Hmm, *sigh*. I guess the patch below fixes the problem, but it is a
> masterpiece in the field of ugliness. And I am not sure whether it is
> completely correct either. Are there any immediate ideas for better
> solution with respect to how struct sock locking works?

Please cc such patches to netdev. Thanks.

> diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
> index 71f5cfb..c5c93cd 100644
> --- a/net/bluetooth/hci_sock.c
> +++ b/net/bluetooth/hci_sock.c
> @@ -656,7 +656,10 @@ static int hci_sock_dev_event(struct notifier_block *this, unsigned long event,
> /* Detach sockets from device */
> read_lock(&hci_sk_list.lock);
> sk_for_each(sk, node, &hci_sk_list.head) {
> - lock_sock(sk);
> + if (in_atomic())
> + bh_lock_sock(sk);
> + else
> + lock_sock(sk);

This doesn't do what you think it does. bh_lock_sock can still succeed
even with lock_sock held by someone else.

Does this need to occur immediately when an event occurs? If not I'd
suggest moving this into a workqueue.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-05-17 06:04:29

by Marcel Holtmann

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

Hi Jiri,

> > > I have just verified that this locking scheme is indeed correct. So you
> > > can add
> > >
> > > Signed-off-by: Jiri Kosina <[email protected]>
> > >
> > > if you wish to, and submit the patch to Andrew.
> > I guess I don't get sent networking patches any more?
> > :-)
>
> Well, this is bluetooth-specific, but it seemed to me that Marcel wasn't
> going to send pull requests to Linus any time soon, therefore I thought
> going through akpm is a thing to do.

actually everything net/ related goes to Dave first. No exception. This
includes the Bluetooth subsystem. I even send drivers/bluetooth/ through
Dave before they go to Linus.

> Honestly, I really don't care through which tree this goes in, so sorry if
> any offence was caused here :)

Having these small ones passed through Andrew is only a convenience
since he is really good in picking them up and make sure that they get
merged by Linus.

Regards

Marcel



2007-05-16 23:20:03

by Jiri Kosina

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

On Wed, 16 May 2007, David Miller wrote:

> > I have just verified that this locking scheme is indeed correct. So you
> > can add
> >
> > Signed-off-by: Jiri Kosina <[email protected]>
> >
> > if you wish to, and submit the patch to Andrew.
> I guess I don't get sent networking patches any more?
> :-)

Well, this is bluetooth-specific, but it seemed to me that Marcel wasn't
going to send pull requests to Linus any time soon, therefore I thought
going through akpm is a thing to do.

Honestly, I really don't care through which tree this goes in, so sorry if
any offence was caused here :)

--
Jiri Kosina

2007-05-16 23:16:09

by David Miller

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

From: Jiri Kosina <[email protected]>
Date: Thu, 17 May 2007 01:03:55 +0200 (CEST)

> On Wed, 16 May 2007, Jiri Kosina wrote:
>
> > > since Jiri has a good test case for it, I leave it to him for testing.
> > > If he confirms that this fixes the locking issues, then this is
> > > Signed-off-by: Marcel Holtmann <[email protected]>
> > I will verify later this evening and will let you know. I am however
> > pretty convinced now that this is the right fix.
>
> Satyam,
>
> I have just verified that this locking scheme is indeed correct. So you
> can add
>
> Signed-off-by: Jiri Kosina <[email protected]>
>
> if you wish to, and submit the patch to Andrew.

I guess I don't get sent networking patches any more?
:-)

2007-05-16 23:03:55

by Jiri Kosina

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

On Wed, 16 May 2007, Jiri Kosina wrote:

> > since Jiri has a good test case for it, I leave it to him for testing.
> > If he confirms that this fixes the locking issues, then this is
> > Signed-off-by: Marcel Holtmann <[email protected]>
> I will verify later this evening and will let you know. I am however
> pretty convinced now that this is the right fix.

Satyam,

I have just verified that this locking scheme is indeed correct. So you
can add

Signed-off-by: Jiri Kosina <[email protected]>

if you wish to, and submit the patch to Andrew.

Thanks,

--
Jiri Kosina

2007-05-16 12:19:04

by Jiri Kosina

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

On Wed, 16 May 2007, Marcel Holtmann wrote:

> since Jiri has a good test case for it, I leave it to him for testing.
> If he confirms that this fixes the locking issues, then this is
> Signed-off-by: Marcel Holtmann <[email protected]>

I will verify later this evening and will let you know. I am however
pretty convinced now that this is the right fix.

Thanks,

--
Jiri Kosina

2007-05-16 12:16:15

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [Bluez-devel] 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

Hi Satyam,

> > > > > > (later)
> > > > > > I Googled a bit to see if this problem was faced elsewhere in the kernel
> > > > > > too. Saw the following commit by Ingo Molnar
> > > > > > (9883a13c72dbf8c518814b6091019643cdb34429):
> > > > > > - lock_sock(sock->sk);
> > > > > > + local_bh_disable();
> > > > > > + bh_lock_sock_nested(sock->sk);
> > > > > > rc = selinux_netlbl_socket_setsid(sock, sksec->sid);
> > > > > > - release_sock(sock->sk);
> > > > > > + bh_unlock_sock(sock->sk);
> > > > > > + local_bh_enable();
> > > > > > Is it _really_ *this* simple?
> > > > > [...]
> > > > > actually this *seems* to be proper solution also for our case, thanks for
> > > > > pointing this out. I will think about it once again, do some more tests
> > > > > with this locking scheme, and will let you know.
> > > >
> > > > Yes, I can almost confirm that this (open-coding of spin_lock_bh,
> > > > effectively) is the proper solution (Rusty's unreliable guide to
> > > > kernel-locking needs to be next to every developer's keyboard :-)
> > > > I also came across this idiom in other places in the networking code
> > > > so it seems to be pretty much the standard way. I wish I owned
> > > > bluetooth hardware, could've tested this for you myself.
> > >
> > > does this mean we should revert previous changes to the locking or only
> > > apply this on top of it?
> >
> > I've fixed a simple patch on top of 2.6.22-rc1 below.
>
> Eek, please ignore previous one. This one's correct.
>
> Signed-off-by: Satyam Sharma <[email protected]>
>
> diff -ruNp a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
> --- a/net/bluetooth/hci_sock.c 2007-05-16 17:31:06.000000000 +0530
> +++ b/net/bluetooth/hci_sock.c 2007-05-16 17:38:35.000000000 +0530
> @@ -665,7 +665,8 @@ static int hci_sock_dev_event(struct not
> /* Detach sockets from device */
> read_lock(&hci_sk_list.lock);
> sk_for_each(sk, node, &hci_sk_list.head) {
> - lock_sock(sk);
> + local_bh_disable();
> + bh_lock_sock_nested(sk);
> if (hci_pi(sk)->hdev == hdev) {
> hci_pi(sk)->hdev = NULL;
> sk->sk_err = EPIPE;
> @@ -674,7 +675,8 @@ static int hci_sock_dev_event(struct not
>
> hci_dev_put(hdev);
> }
> - release_sock(sk);
> + bh_unlock_sock(sk);
> + local_bh_enable();
> }
> read_unlock(&hci_sk_list.lock);
> }

since Jiri has a good test case for it, I leave it to him for testing.
If he confirms that this fixes the locking issues, then this is

Signed-off-by: Marcel Holtmann <[email protected]>

Regards

Marcel



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Bluez-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bluez-devel

2007-05-16 11:59:34

by Satyam Sharma

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

On 5/16/07, Satyam Sharma <[email protected]> wrote:
> Hi Marcel,
> [...]
> > > > > (later)
> > > > > I Googled a bit to see if this problem was faced elsewhere in the kernel
> > > > > too. Saw the following commit by Ingo Molnar
> > > > > (9883a13c72dbf8c518814b6091019643cdb34429):
> > > > > - lock_sock(sock->sk);
> > > > > + local_bh_disable();
> > > > > + bh_lock_sock_nested(sock->sk);
> > > > > rc = selinux_netlbl_socket_setsid(sock, sksec->sid);
> > > > > - release_sock(sock->sk);
> > > > > + bh_unlock_sock(sock->sk);
> > > > > + local_bh_enable();
> > > > > Is it _really_ *this* simple?
> > > > [...]
> > > > actually this *seems* to be proper solution also for our case, thanks for
> > > > pointing this out. I will think about it once again, do some more tests
> > > > with this locking scheme, and will let you know.
> > >
> > > Yes, I can almost confirm that this (open-coding of spin_lock_bh,
> > > effectively) is the proper solution (Rusty's unreliable guide to
> > > kernel-locking needs to be next to every developer's keyboard :-)
> > > I also came across this idiom in other places in the networking code
> > > so it seems to be pretty much the standard way. I wish I owned
> > > bluetooth hardware, could've tested this for you myself.
> >
> > does this mean we should revert previous changes to the locking or only
> > apply this on top of it?
>
> I've fixed a simple patch on top of 2.6.22-rc1 below.

Eek, please ignore previous one. This one's correct.

Signed-off-by: Satyam Sharma <[email protected]>

diff -ruNp a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
--- a/net/bluetooth/hci_sock.c 2007-05-16 17:31:06.000000000 +0530
+++ b/net/bluetooth/hci_sock.c 2007-05-16 17:38:35.000000000 +0530
@@ -665,7 +665,8 @@ static int hci_sock_dev_event(struct not
/* Detach sockets from device */
read_lock(&hci_sk_list.lock);
sk_for_each(sk, node, &hci_sk_list.head) {
- lock_sock(sk);
+ local_bh_disable();
+ bh_lock_sock_nested(sk);
if (hci_pi(sk)->hdev == hdev) {
hci_pi(sk)->hdev = NULL;
sk->sk_err = EPIPE;
@@ -674,7 +675,8 @@ static int hci_sock_dev_event(struct not

hci_dev_put(hdev);
}
- release_sock(sk);
+ bh_unlock_sock(sk);
+ local_bh_enable();
}
read_unlock(&hci_sk_list.lock);
}

2007-05-16 11:56:38

by Satyam Sharma

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

Hi Marcel,

On 5/16/07, Marcel Holtmann <[email protected]> wrote:
> Hi Satayam,
>
> > > > (later)
> > > > I Googled a bit to see if this problem was faced elsewhere in the kernel
> > > > too. Saw the following commit by Ingo Molnar
> > > > (9883a13c72dbf8c518814b6091019643cdb34429):
> > > > - lock_sock(sock->sk);
> > > > + local_bh_disable();
> > > > + bh_lock_sock_nested(sock->sk);
> > > > rc = selinux_netlbl_socket_setsid(sock, sksec->sid);
> > > > - release_sock(sock->sk);
> > > > + bh_unlock_sock(sock->sk);
> > > > + local_bh_enable();
> > > > Is it _really_ *this* simple?
> > > [...]
> > > actually this *seems* to be proper solution also for our case, thanks for
> > > pointing this out. I will think about it once again, do some more tests
> > > with this locking scheme, and will let you know.
> >
> > Yes, I can almost confirm that this (open-coding of spin_lock_bh,
> > effectively) is the proper solution (Rusty's unreliable guide to
> > kernel-locking needs to be next to every developer's keyboard :-)
> > I also came across this idiom in other places in the networking code
> > so it seems to be pretty much the standard way. I wish I owned
> > bluetooth hardware, could've tested this for you myself.
>
> does this mean we should revert previous changes to the locking or only
> apply this on top of it?

I've fixed a simple patch on top of 2.6.22-rc1 below.

Signed-off-by: Satyam Sharma <[email protected]>

diff -ruNp a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
--- a/net/bluetooth/hci_sock.c 2007-05-16 17:31:06.000000000 +0530
+++ b/net/bluetooth/hci_sock.c 2007-05-16 17:33:36.000000000 +0530
@@ -665,7 +665,8 @@ static int hci_sock_dev_event(struct not
/* Detach sockets from device */
read_lock(&hci_sk_list.lock);
sk_for_each(sk, node, &hci_sk_list.head) {
- lock_sock(sk);
+ local_bh_disable();
+ bh_lock_sock_nested(sk);
if (hci_pi(sk)->hdev == hdev) {
hci_pi(sk)->hdev = NULL;
sk->sk_err = EPIPE;
@@ -674,6 +675,8 @@ static int hci_sock_dev_event(struct not

hci_dev_put(hdev);
}
+ bh_unlock_sock(sk);
+ local_bh_enable();
release_sock(sk);
}
read_unlock(&hci_sk_list.lock);

2007-05-16 11:45:04

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [Bluez-devel] 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

Hi Satayam,

> > > (later)
> > > I Googled a bit to see if this problem was faced elsewhere in the kernel
> > > too. Saw the following commit by Ingo Molnar
> > > (9883a13c72dbf8c518814b6091019643cdb34429):
> > > - lock_sock(sock->sk);
> > > + local_bh_disable();
> > > + bh_lock_sock_nested(sock->sk);
> > > rc = selinux_netlbl_socket_setsid(sock, sksec->sid);
> > > - release_sock(sock->sk);
> > > + bh_unlock_sock(sock->sk);
> > > + local_bh_enable();
> > > Is it _really_ *this* simple?
> > [...]
> > actually this *seems* to be proper solution also for our case, thanks for
> > pointing this out. I will think about it once again, do some more tests
> > with this locking scheme, and will let you know.
>
> Yes, I can almost confirm that this (open-coding of spin_lock_bh,
> effectively) is the proper solution (Rusty's unreliable guide to
> kernel-locking needs to be next to every developer's keyboard :-)
> I also came across this idiom in other places in the networking code
> so it seems to be pretty much the standard way. I wish I owned
> bluetooth hardware, could've tested this for you myself.

does this mean we should revert previous changes to the locking or only
apply this on top of it?

Regards

Marcel



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Bluez-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bluez-devel

2007-05-16 11:36:46

by Satyam Sharma

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

Hi Jiri,

On 5/16/07, Jiri Kosina <[email protected]> wrote:
> On Fri, 11 May 2007, Satyam Sharma wrote:
> > (later)
> > I Googled a bit to see if this problem was faced elsewhere in the kernel
> > too. Saw the following commit by Ingo Molnar
> > (9883a13c72dbf8c518814b6091019643cdb34429):
> > - lock_sock(sock->sk);
> > + local_bh_disable();
> > + bh_lock_sock_nested(sock->sk);
> > rc = selinux_netlbl_socket_setsid(sock, sksec->sid);
> > - release_sock(sock->sk);
> > + bh_unlock_sock(sock->sk);
> > + local_bh_enable();
> > Is it _really_ *this* simple?
> [...]
> actually this *seems* to be proper solution also for our case, thanks for
> pointing this out. I will think about it once again, do some more tests
> with this locking scheme, and will let you know.

Yes, I can almost confirm that this (open-coding of spin_lock_bh,
effectively) is the proper solution (Rusty's unreliable guide to
kernel-locking needs to be next to every developer's keyboard :-)
I also came across this idiom in other places in the networking code
so it seems to be pretty much the standard way. I wish I owned
bluetooth hardware, could've tested this for you myself.

Thanks,
Satyam

2007-05-16 09:29:45

by Jiri Kosina

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

On Fri, 11 May 2007, Satyam Sharma wrote:

> (later)
> I Googled a bit to see if this problem was faced elsewhere in the kernel
> too. Saw the following commit by Ingo Molnar
> (9883a13c72dbf8c518814b6091019643cdb34429):
> - lock_sock(sock->sk);
> + local_bh_disable();
> + bh_lock_sock_nested(sock->sk);
> rc = selinux_netlbl_socket_setsid(sock, sksec->sid);
> - release_sock(sock->sk);
> + bh_unlock_sock(sock->sk);
> + local_bh_enable();
> Is it _really_ *this* simple?

Hi Satyam,

actually this *seems* to be proper solution also for our case, thanks for
pointing this out. I will think about it once again, do some more tests
with this locking scheme, and will let you know.

Thanks,

--
Jiri Kosina

2007-05-13 09:20:52

by Greg KH

[permalink] [raw]
Subject: Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

On Fri, May 11, 2007 at 06:59:31PM +0530, Satyam Sharma wrote:
> [1] This is the first problem point. However, I didn't find any reason
> why this particular driver's .disconnect() couldn't sleep. In fact, a
> comment in include/linux/usb.h:811 says:
>
> "The probe() and disconnect() methods are called in a context where
> they can sleep, but they should avoid abusing the privilege. Most
> work to connect to a device should be done when the device is opened,
> and undone at the last close. The disconnect code needs to address
> concurrency issues with respect to open() and close() methods, as
> well as forcing all pending I/O requests to complete (by unlinking
> them as necessary, and blocking until the unlinks complete)."
>
> I'm assuming the comment is not obsolete, of course, but although the
> first sentence says .disconnect() shouldn't abuse the privilege to
> sleep, the last sentence makes it quite evident that we are _allowed_
> to do so anyway, and that is how things are (with the hci_usb driver,
> at least, I didn't check the .remove() or .disconnect() functions of other
> USB drivers, however).

Yes, this is true, you are running in thread context for .disconnect of
usb drivers, so you can sleep if you need to, but you will block all
other device's disconnect and probe functions while you do. So, it's
good to try to not abuse this if possible.

thanks,

greg k-h

2007-05-11 13:29:31

by Satyam Sharma

[permalink] [raw]
Subject: Re: [Bluez-devel] 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

Hi Jiri,

On 4/26/07, Jiri Kosina <[email protected]> wrote:
> On Mon, 23 Apr 2007, Jiri Kosina wrote:
>
> > > BUG: sleeping function called from invalid context at net/core/sock.c:1523
> > > in_atomic():1, irqs_disabled():0
> > > 1 lock held by khubd/180:
> > > #0: (old_style_rw_init#2){-.-?}, at: [<f88c5816>] hci_sock_dev_event+0x42/0xc5 [bluetooth]
> [...]
> > OK, this probably started happening since b40df5743. Before that commit,
> > hci_sock_dev_event() used bh_lock_sock() to lock the corresponding
> > struct sock. This was obviously buggy - not deadlock safe against
> > l2cap_connect_cfm() from softirq context. This however introduced
> > another problem - hci_sock_dev_event() is now obviously being triggered
> > (for HCI_DEV_UNREG event, when suspending) in atomic context with

I saw that hci_sock_dev_event() is _always_ triggered in atomic
context. It's the callout for hci_notifier which is defined as an
atomic notifier chain (hence executed in an RCU read section -- and
sleeping inside that would be illegal).

> > preemption disabled. This is what lock_sock_nested() complains about, as
> > it is allowed to sleep inside __lock_sock(), waiting for the lock owner.
> [...]
> Bluetooth: postpone hci_dev unregistration
>
> Commit b40df57 substituted bh_lock_sock() in hci_sock_dev_event() for
> lock_sock() when unregistering HCI device, in order to prevent deadlock
> against locking in l2cap_connect_cfm() from softirq context.

Isn't this a problem faced by other places in the kernel already
(where simply using bh_lock_sock() would potentially deadlock with
another thread? I wonder what's the "recommended" (or one that's
generally used) way to handle such a case.

> This however introduces another problem - hci_sock_dev_event() for
> HCI_DEV_UNREG can also be triggered in atomic context, in which calling

Actually, I remember going over the hci_sock_dev_event() calling
codepath (in reverse) quite exhaustively, and did not find a
legitimate reason why anybody would want it to be atomic. hci_notify()
has six call sites, and all are sleep-capable, IMO. In the case of
hci_unregister_dev(), for example, what's happening is as follows:

__device_release_driver (can sleep)
usb_unbind_interface (-"-)
hci_usb_disconnect [hci_usb] (can sleep *[1])
hci_unregister_dev [bluetooth] (-"-)
hci_notify [bluetooth] (-"-)
atomic_notifier_call_chain (contains RCU read section)
notifier_call_chain (therefore, CANNOT SLEEP [2])
hci_sock_dev_event [bluetooth] (-"-)
lock_sock_nested (MIGHT SLEEP *BUG*)
__might_sleep

[1] This is the first problem point. However, I didn't find any reason
why this particular driver's .disconnect() couldn't sleep. In fact, a
comment in include/linux/usb.h:811 says:

"The probe() and disconnect() methods are called in a context where
they can sleep, but they should avoid abusing the privilege. Most
work to connect to a device should be done when the device is opened,
and undone at the last close. The disconnect code needs to address
concurrency issues with respect to open() and close() methods, as
well as forcing all pending I/O requests to complete (by unlinking
them as necessary, and blocking until the unlinks complete)."

I'm assuming the comment is not obsolete, of course, but although the
first sentence says .disconnect() shouldn't abuse the privilege to
sleep, the last sentence makes it quite evident that we are _allowed_
to do so anyway, and that is how things are (with the hci_usb driver,
at least, I didn't check the .remove() or .disconnect() functions of other
USB drivers, however).

[2] This is a bogus (and unnecessary) can-sleep-to-cannot-sleep
transition point, IMO. I had copied Alan Stern in another thread a few
days back, and he wasn't sure why hci_notifier was classified as an
atomic notifier chain (when that classification happened with the new
notifier chains API). I had submitted a patch that merely changed 4
lines in net/bluetooth/hci_core.c to convert hci_notifier to a blocking
notifier chain, but couldn't test as I own no bluetooth hardware myself.

So do we ever really _need_ hci_sock_dev_event() to run in atomic
context at all?

> lock_sock() is not safe as it could sleep.
>
> This patch moves the detaching of sockets from hci_device into workqueue,
> so that lock_sock() can be used safely. This requires movement of

I did a workqueue conversion myself, but ran into the following problem:

In the scheduled work function we have:

> + read_lock(&hci_sk_list.lock);
> + sk_for_each(sk, node, &hci_sk_list.head) {
> + lock_sock(sk);

This would still be illegal, we can't sleep while holding an rwlock
(hci_sk_list.lock above). Converting hci_sk_list.lock to an rwsem
is _even_ more problematic, because hci_send_to_sock()
just *cannot* sleep.

> deallocation of hci_dev - deallocating device just after
> hci_unregister_dev() would be too soon, as it could happen before the
> workqueue has been run.

Suggest a better solution for this: just introduce a flush_scheduled_work()
after hci_unregister_dev() but before hci_free_dev() in all those places.
Less disruptive that way.

So this is quite an interesting problem indeed, but I can't help wondering
that this must be faced elsewhere in the kernel (other users of lock_sock)
too. CC'ing netdev@, for any ideas.

(later)

I Googled a bit to see if this problem was faced elsewhere in the kernel
too. Saw the following commit by Ingo Molnar
(9883a13c72dbf8c518814b6091019643cdb34429):

- lock_sock(sock->sk);
+ local_bh_disable();
+ bh_lock_sock_nested(sock->sk);
rc = selinux_netlbl_socket_setsid(sock, sksec->sid);
- release_sock(sock->sk);
+ bh_unlock_sock(sock->sk);
+ local_bh_enable();

Is it _really_ *this* simple?

Satyam

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Bluez-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bluez-devel