2014-04-08 19:47:53

by Belisko Marek

[permalink] [raw]
Subject: Re: w1: 3.14-rc7 - possible recursive locking detected

Hi,

On Mon, Mar 24, 2014 at 1:01 PM, Evgeniy Polyakov <[email protected]> wrote:
> Hi everyone
>
> 24.03.2014, 01:50, "David Fries" <[email protected]>:
>> On Mon, Mar 17, 2014 at 10:38:13PM +0100, Belisko Marek wrote:
>>> booting latest 3.14-rc7 on gta04 board gives following warning:
>>>
>>> [ 3.101409] Driver for 1-wire Dallas network protocol.
>>> [ 3.109649] omap_hdq omap_hdq: OMAP HDQ Hardware Rev 0.5. Driver in
>>> [ 3.400146] CPU0
>>> [ 3.402709] ----
>>> [ 3.405273] lock(&(&priv->bus_notifier)->rwsem);
>>> [ 3.410308] lock(&(&priv->bus_notifier)->rwsem);
>>> [ 3.415374]
>>> [ 3.415374] *** DEADLOCK ***
>>>
>>> AFAIK we didn't see that on (at least 3/14-rc2).
>>
>> The last drivers/w1 change went in before v3.14-rc1, so if it is
>> something in the w1 code, it has either been there or something else
>> in the rest of the kernel changed to expose it. I'm using the ds2490
>> USB w1 dongle, I didn't see that problem when enabling lock debugging
>> on 3.14.0-rc5+ which includes some changes I'm working on.
>>
>> https://github.com/dfries/linux.git w1_rework
>
> Can it be a bug in omap w1 driver?
We test with actual Linus master and it's even worse (it hangs completely).
Trace look little bit different and it only happens when booting via
device tree (not with board file).

[ 2.807434] Driver for 1-wire Dallas network protocol.
[ 2.814117] omap_hdq omap_hdq: OMAP HDQ Hardware Rev 0.5. Driver in
Interrupt mode
[ 3.034271] call_modprobe: w1-family-0x1
[ 3.051818]
[ 3.053405] =============================================
[ 3.059051] [ INFO: possible recursive locking detected ]
[ 3.064727] 3.14.0-gta04 #556 Not tainted
[ 3.068939] ---------------------------------------------
[ 3.074615] w1_bus_master1/795 is trying to acquire lock:
[ 3.080261] (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
[<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
[ 3.090911]
[ 3.090911] but task is already holding lock:
[ 3.097045] (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
[<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
[ 3.107666]
[ 3.107666] other info that might help us debug this:
[ 3.114501] Possible unsafe locking scenario:
[ 3.114501]
[ 3.120727] CPU0
[ 3.123291] ----
[ 3.125854] lock(&(&priv->bus_notifier)->rwsem);
[ 3.130889] lock(&(&priv->bus_notifier)->rwsem);
[ 3.135925]
[ 3.135925] *** DEADLOCK ***
[ 3.135925]
[ 3.142150] May be due to missing lock nesting notation
[ 3.142150]
[ 3.149291] 2 locks held by w1_bus_master1/795:
[ 3.154052] #0: (&dev->mutex#3){+.+.+.}, at: [<c03ba1c8>]
w1_attach_slave_device+0xc4/0x1c8
[ 3.163055] #1: (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
[<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
[ 3.174163]
[ 3.174163] stack backtrace:
[ 3.178741] CPU: 0 PID: 795 Comm: w1_bus_master1 Not tainted
3.14.0-gta04 #556
[ 3.186370] [<c0013ca0>] (unwind_backtrace) from [<c0010f68>]
(show_stack+0x10/0x14)
[ 3.194488] [<c0010f68>] (show_stack) from [<c04f0900>]
(dump_stack+0x6c/0x88)
[ 3.202117] [<c04f0900>] (dump_stack) from [<c00746bc>]
(print_deadlock_bug+0xc0/0xf0)
[ 3.210449] [<c00746bc>] (print_deadlock_bug) from [<c0075f80>]
(validate_chain.isra.29+0x4dc/0x534)
[ 3.220031] [<c0075f80>] (validate_chain.isra.29) from [<c0076dac>]
(__lock_acquire+0x728/0x834)
[ 3.229248] [<c0076dac>] (__lock_acquire) from [<c00774b8>]
(lock_acquire+0xf4/0x118)
[ 3.237487] [<c00774b8>] (lock_acquire) from [<c04f8e04>]
(down_read+0x24/0x34)
[ 3.245178] [<c04f8e04>] (down_read) from [<c0058ab0>]
(__blocking_notifier_call_chain+0x28/0x58)
[ 3.254516] [<c0058ab0>] (__blocking_notifier_call_chain) from
[<c0058af4>] (blocking_notifier_call_chain+0x14/0x18)
[ 3.265563] [<c0058af4>] (blocking_notifier_call_chain) from
[<c0301db8>] (device_add+0x1f4/0x368)
[ 3.274993] [<c0301db8>] (device_add) from [<c03055cc>]
(platform_device_add+0x138/0x1c8)
[ 3.283569] [<c03055cc>] (platform_device_add) from [<c03bd124>]
(w1_bq27000_add_slave+0x44/0x74)
[ 3.292907] [<c03bd124>] (w1_bq27000_add_slave) from [<c03b9ed8>]
(w1_bus_notify.part.0+0x48/0xc8)
[ 3.302337] [<c03b9ed8>] (w1_bus_notify.part.0) from [<c04fbe1c>]
(notifier_call_chain+0x38/0x68)
[ 3.311645] [<c04fbe1c>] (notifier_call_chain) from [<c0058acc>]
(__blocking_notifier_call_chain+0x44/0x58)
[ 3.321899] [<c0058acc>] (__blocking_notifier_call_chain) from
[<c0058af4>] (blocking_notifier_call_chain+0x14/0x18)
[ 3.332946] [<c0058af4>] (blocking_notifier_call_chain) from
[<c0301db8>] (device_add+0x1f4/0x368)
[ 3.342346] [<c0301db8>] (device_add) from [<c03b9ad0>]
(__w1_attach_slave_device+0x9c/0x134)
[ 3.351318] [<c03b9ad0>] (__w1_attach_slave_device) from
[<c03ba238>] (w1_attach_slave_device+0x134/0x1c8)
[ 3.361480] [<c03ba238>] (w1_attach_slave_device) from [<c03ba348>]
(w1_slave_found+0x7c/0x98)
[ 3.370513] [<c03ba348>] (w1_slave_found) from [<c03bc818>]
(omap_w1_search_bus+0x54/0x5c)
[ 3.379211] [<c03bc818>] (omap_w1_search_bus) from [<c03bc758>]
(w1_search_devices+0x3c/0x48)
[ 3.388153] [<c03bc758>] (w1_search_devices) from [<c03baaec>]
(w1_search_process_cb+0x64/0x108)
[ 3.397399] [<c03baaec>] (w1_search_process_cb) from [<c03bac60>]
(w1_process+0x68/0x14c)
[ 3.405975] [<c03bac60>] (w1_process) from [<c0054bdc>] (kthread+0xdc/0xf0)
[ 3.413299] [<c0054bdc>] (kthread) from [<c000dca8>]
(ret_from_fork+0x14/0x2c)
[ 24.451049] INFO: rcu_sched detected stalls on CPUs/tasks: {}
(detected by 0, t=2102 jiffies, g=4294967086, c=4294967085, q=19)
[ 24.463134] INFO: Stall ended before state dump start
[ 87.501037] INFO: rcu_sched detected stalls on CPUs/tasks: {}
(detected by 0, t=8407 jiffies, g=4294967086, c=4294967085, q=19)
[ 87.513122] INFO: Stall ended before state dump start
[ 150.551055] INFO: rcu_sched detected stalls on CPUs/tasks: {}
(detected by 0, t=14712 jiffies, g=4294967086, c=4294967085, q=19)
[ 150.563232] INFO: Stall ended before state dump start
[ 213.601043] INFO: rcu_sched detected stalls on CPUs/tasks: {}
(detected by 0, t=21017 jiffies, g=4294967086, c=4294967085, q=19)
[ 213.613220] INFO: Stall ended before state dump start
[ 276.651062] INFO: rcu_sched detected stalls on CPUs/tasks: {}
(detected by 0, t=27322 jiffies, g=4294967086, c=4294967085, q=19)
[ 276.663238] INFO: Stall ended before state dump start

BR,

marek

--
as simple and primitive as possible
-------------------------------------------------
Marek Belisko - OPEN-NANDRA
Freelance Developer

Ruska Nova Ves 219 | Presov, 08005 Slovak Republic
Tel: +421 915 052 184
skype: marekwhite
twitter: #opennandra
web: http://open-nandra.com


2014-04-13 22:24:24

by David Fries

[permalink] [raw]
Subject: Re: w1: 3.14-rc7 - possible recursive locking detected

Belisko Marek,
Here is a possible solution, could you give it a try and report back?

Greg Kroah-Hartman,
Evgeniy asked me to look into this report. I don't have the
reporter's hardware configuration, but I wouldn't think that would be
needed, just some w1 bus master (even W1_MASTER_GPIO might work), then
loading the slave device and manually adding a slave device with that
family id. Even then I didn't reproduce the reported recursive
locking error. I saw unrelated locking reports, but not this one. I
wrote up the included patch, but that undoes the notify changes that
you added earlier in commit 47eba33a0997fc7, and I wanted to ask about
that commit. Specifically these two lines,

err = device_register(&sl->dev);
...
+ dev_set_uevent_suppress(&sl->dev, false);
+ kobject_uevent(&sl->dev.kobj, KOBJ_ADD);

Wouldn't the default be to not suppress? Nothing in the W1 system
enables suppressing so is that even needed? (I'm fine with saying
it's a good idea).
device_register at some point must call device_add and device_add
calls kobject_uevent(&dev->kobj, KOBJ_ADD);
so doesn't the KOBJ_ADD send the add a second time? As in it
shouldn't be needed?
Can the suppress be called before device_register to avoid the
automatic notify, then after it returns setup the slave device as this
patch does to avoid this problem report, and then call the KOBJ_ADD to
make everything happy?

I'm based on commit ce7613db2d8d4d5af2587a.

On Tue, Apr 08, 2014 at 09:47:50PM +0200, Belisko Marek wrote:
> Hi,
>
> On Mon, Mar 24, 2014 at 1:01 PM, Evgeniy Polyakov <[email protected]> wrote:
> > Hi everyone
> >
> > 24.03.2014, 01:50, "David Fries" <[email protected]>:
> >> On Mon, Mar 17, 2014 at 10:38:13PM +0100, Belisko Marek wrote:
> >>> booting latest 3.14-rc7 on gta04 board gives following warning:
> >>>
> >>> [ 3.101409] Driver for 1-wire Dallas network protocol.
> >>> [ 3.109649] omap_hdq omap_hdq: OMAP HDQ Hardware Rev 0.5. Driver in
> >>> [ 3.400146] CPU0
> >>> [ 3.402709] ----
> >>> [ 3.405273] lock(&(&priv->bus_notifier)->rwsem);
> >>> [ 3.410308] lock(&(&priv->bus_notifier)->rwsem);
> >>> [ 3.415374]
> >>> [ 3.415374] *** DEADLOCK ***
> >>>
> >>> AFAIK we didn't see that on (at least 3/14-rc2).
> >>
> >> The last drivers/w1 change went in before v3.14-rc1, so if it is
> >> something in the w1 code, it has either been there or something else
> >> in the rest of the kernel changed to expose it. I'm using the ds2490
> >> USB w1 dongle, I didn't see that problem when enabling lock debugging
> >> on 3.14.0-rc5+ which includes some changes I'm working on.
> >>
> >> https://github.com/dfries/linux.git w1_rework
> >
> > Can it be a bug in omap w1 driver?
> We test with actual Linus master and it's even worse (it hangs completely).
> Trace look little bit different and it only happens when booting via
> device tree (not with board file).
>
> [ 2.807434] Driver for 1-wire Dallas network protocol.
> [ 2.814117] omap_hdq omap_hdq: OMAP HDQ Hardware Rev 0.5. Driver in
> Interrupt mode
> [ 3.034271] call_modprobe: w1-family-0x1
> [ 3.051818]
> [ 3.053405] =============================================
> [ 3.059051] [ INFO: possible recursive locking detected ]
> [ 3.064727] 3.14.0-gta04 #556 Not tainted
> [ 3.068939] ---------------------------------------------
> [ 3.074615] w1_bus_master1/795 is trying to acquire lock:
> [ 3.080261] (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
> [<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
> [ 3.090911]
> [ 3.090911] but task is already holding lock:
> [ 3.097045] (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
> [<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
> [ 3.107666]
> [ 3.107666] other info that might help us debug this:
> [ 3.114501] Possible unsafe locking scenario:
> [ 3.114501]
> [ 3.120727] CPU0
> [ 3.123291] ----
> [ 3.125854] lock(&(&priv->bus_notifier)->rwsem);
> [ 3.130889] lock(&(&priv->bus_notifier)->rwsem);
> [ 3.135925]
> [ 3.135925] *** DEADLOCK ***
> [ 3.135925]
> [ 3.142150] May be due to missing lock nesting notation
> [ 3.142150]
> [ 3.149291] 2 locks held by w1_bus_master1/795:
> [ 3.154052] #0: (&dev->mutex#3){+.+.+.}, at: [<c03ba1c8>]
> w1_attach_slave_device+0xc4/0x1c8
> [ 3.163055] #1: (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
> [<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
> [ 3.174163]
> [ 3.174163] stack backtrace:
> [ 3.178741] CPU: 0 PID: 795 Comm: w1_bus_master1 Not tainted
> 3.14.0-gta04 #556
> [ 3.186370] [<c0013ca0>] (unwind_backtrace) from [<c0010f68>]
> (show_stack+0x10/0x14)
> [ 3.194488] [<c0010f68>] (show_stack) from [<c04f0900>]
> (dump_stack+0x6c/0x88)
> [ 3.202117] [<c04f0900>] (dump_stack) from [<c00746bc>]
> (print_deadlock_bug+0xc0/0xf0)
> [ 3.210449] [<c00746bc>] (print_deadlock_bug) from [<c0075f80>]
> (validate_chain.isra.29+0x4dc/0x534)
> [ 3.220031] [<c0075f80>] (validate_chain.isra.29) from [<c0076dac>]
> (__lock_acquire+0x728/0x834)
> [ 3.229248] [<c0076dac>] (__lock_acquire) from [<c00774b8>]
> (lock_acquire+0xf4/0x118)
> [ 3.237487] [<c00774b8>] (lock_acquire) from [<c04f8e04>]
> (down_read+0x24/0x34)
> [ 3.245178] [<c04f8e04>] (down_read) from [<c0058ab0>]
> (__blocking_notifier_call_chain+0x28/0x58)
> [ 3.254516] [<c0058ab0>] (__blocking_notifier_call_chain) from
> [<c0058af4>] (blocking_notifier_call_chain+0x14/0x18)
> [ 3.265563] [<c0058af4>] (blocking_notifier_call_chain) from
> [<c0301db8>] (device_add+0x1f4/0x368)
> [ 3.274993] [<c0301db8>] (device_add) from [<c03055cc>]
> (platform_device_add+0x138/0x1c8)
> [ 3.283569] [<c03055cc>] (platform_device_add) from [<c03bd124>]
> (w1_bq27000_add_slave+0x44/0x74)
> [ 3.292907] [<c03bd124>] (w1_bq27000_add_slave) from [<c03b9ed8>]
> (w1_bus_notify.part.0+0x48/0xc8)
> [ 3.302337] [<c03b9ed8>] (w1_bus_notify.part.0) from [<c04fbe1c>]
> (notifier_call_chain+0x38/0x68)
> [ 3.311645] [<c04fbe1c>] (notifier_call_chain) from [<c0058acc>]
> (__blocking_notifier_call_chain+0x44/0x58)
> [ 3.321899] [<c0058acc>] (__blocking_notifier_call_chain) from
> [<c0058af4>] (blocking_notifier_call_chain+0x14/0x18)
> [ 3.332946] [<c0058af4>] (blocking_notifier_call_chain) from
> [<c0301db8>] (device_add+0x1f4/0x368)
> [ 3.342346] [<c0301db8>] (device_add) from [<c03b9ad0>]
> (__w1_attach_slave_device+0x9c/0x134)
> [ 3.351318] [<c03b9ad0>] (__w1_attach_slave_device) from
> [<c03ba238>] (w1_attach_slave_device+0x134/0x1c8)
> [ 3.361480] [<c03ba238>] (w1_attach_slave_device) from [<c03ba348>]
> (w1_slave_found+0x7c/0x98)
> [ 3.370513] [<c03ba348>] (w1_slave_found) from [<c03bc818>]
> (omap_w1_search_bus+0x54/0x5c)
> [ 3.379211] [<c03bc818>] (omap_w1_search_bus) from [<c03bc758>]
> (w1_search_devices+0x3c/0x48)
> [ 3.388153] [<c03bc758>] (w1_search_devices) from [<c03baaec>]
> (w1_search_process_cb+0x64/0x108)
> [ 3.397399] [<c03baaec>] (w1_search_process_cb) from [<c03bac60>]
> (w1_process+0x68/0x14c)
> [ 3.405975] [<c03bac60>] (w1_process) from [<c0054bdc>] (kthread+0xdc/0xf0)
> [ 3.413299] [<c0054bdc>] (kthread) from [<c000dca8>]
> (ret_from_fork+0x14/0x2c)
> [ 24.451049] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> (detected by 0, t=2102 jiffies, g=4294967086, c=4294967085, q=19)
> [ 24.463134] INFO: Stall ended before state dump start
> [ 87.501037] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> (detected by 0, t=8407 jiffies, g=4294967086, c=4294967085, q=19)
> [ 87.513122] INFO: Stall ended before state dump start
> [ 150.551055] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> (detected by 0, t=14712 jiffies, g=4294967086, c=4294967085, q=19)
> [ 150.563232] INFO: Stall ended before state dump start
> [ 213.601043] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> (detected by 0, t=21017 jiffies, g=4294967086, c=4294967085, q=19)
> [ 213.613220] INFO: Stall ended before state dump start
> [ 276.651062] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> (detected by 0, t=27322 jiffies, g=4294967086, c=4294967085, q=19)
> [ 276.663238] INFO: Stall ended before state dump start
>
> BR,
>
> marek
>
> --
> as simple and primitive as possible
> -------------------------------------------------
> Marek Belisko - OPEN-NANDRA
> Freelance Developer
>
> Ruska Nova Ves 219 | Presov, 08005 Slovak Republic
> Tel: +421 915 052 184
> skype: marekwhite
> twitter: #opennandra
> web: http://open-nandra.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
David Fries <[email protected]> PGP pub CB1EE8F0
http://fries.net/~david/


commit b3afc24a3ca6c261a8fe2c08b5bbccb51233c09a
Author: David Fries <[email protected]>
Date: Sun Apr 13 15:36:45 2014 -0500

w1: avoid recursive device_add

__w1_attach_slave_device calls device_add which calls w1_bus_notify
which calls for the w1_bq27000 slave driver, which calls
platform_device_add and device_add and deadlocks on getting
&(&priv->bus_notifier)->rwsem as it is still held in the previous
device_add. This avoids the problem by processing the family
add/remove outside of the slave device_add call. This does mean that
the slave device isn't setup when the notification that a slave is
available goes out.

Reported-by: Belisko Marek <[email protected]>

diff --git a/drivers/w1/w1.c b/drivers/w1/w1.c
index b96f61b..dde09c1 100644
--- a/drivers/w1/w1.c
+++ b/drivers/w1/w1.c
@@ -614,27 +614,11 @@ end:
return err;
}

-/*
- * Handle sysfs file creation and removal here, before userspace is told that
- * the device is added / removed from the system
- */
-static int w1_bus_notify(struct notifier_block *nb, unsigned long action,
- void *data)
+static int w1_family_notify(unsigned long action, struct w1_slave *sl)
{
- struct device *dev = data;
- struct w1_slave *sl;
struct w1_family_ops *fops;
int err;

- /*
- * Only care about slave devices at the moment. Yes, we should use a
- * separate "type" for this, but for now, look at the release function
- * to know which type it is...
- */
- if (dev->release != w1_slave_release)
- return 0;
-
- sl = dev_to_w1_slave(dev);
fops = sl->family->fops;

if (!fops)
@@ -673,10 +657,6 @@ static int w1_bus_notify(struct notifier_block *nb, unsigned long action,
return 0;
}

-static struct notifier_block w1_bus_nb = {
- .notifier_call = w1_bus_notify,
-};
-
static int __w1_attach_slave_device(struct w1_slave *sl)
{
int err;
@@ -705,6 +685,7 @@ static int __w1_attach_slave_device(struct w1_slave *sl)
dev_name(&sl->dev), err);
return err;
}
+ w1_family_notify(BUS_NOTIFY_ADD_DEVICE, sl);


dev_set_uevent_suppress(&sl->dev, false);
@@ -799,6 +780,7 @@ int w1_unref_slave(struct w1_slave *sl)
msg.type = W1_SLAVE_REMOVE;
w1_netlink_send(sl->master, &msg);

+ w1_family_notify(BUS_NOTIFY_DEL_DEVICE, sl);
device_unregister(&sl->dev);
#ifdef DEBUG
memset(sl, 0, sizeof(*sl));
@@ -1186,10 +1168,6 @@ static int __init w1_init(void)
goto err_out_exit_init;
}

- retval = bus_register_notifier(&w1_bus_type, &w1_bus_nb);
- if (retval)
- goto err_out_bus_unregister;
-
retval = driver_register(&w1_master_driver);
if (retval) {
printk(KERN_ERR

2014-04-14 13:10:19

by H. Nikolaus Schaller

[permalink] [raw]
Subject: Re: w1: 3.14-rc7 - possible recursive locking detected

Hi David,
I have tested and it appears to suppress the locking issue on our
GTA04 board.

BR,
Nikolaus

Am 14.04.2014 um 00:24 schrieb David Fries:

> Belisko Marek,
> Here is a possible solution, could you give it a try and report back?
>
> Greg Kroah-Hartman,
> Evgeniy asked me to look into this report. I don't have the
> reporter's hardware configuration, but I wouldn't think that would be
> needed, just some w1 bus master (even W1_MASTER_GPIO might work), then
> loading the slave device and manually adding a slave device with that
> family id. Even then I didn't reproduce the reported recursive
> locking error. I saw unrelated locking reports, but not this one. I
> wrote up the included patch, but that undoes the notify changes that
> you added earlier in commit 47eba33a0997fc7, and I wanted to ask about
> that commit. Specifically these two lines,
>
> err = device_register(&sl->dev);
> ...
> + dev_set_uevent_suppress(&sl->dev, false);
> + kobject_uevent(&sl->dev.kobj, KOBJ_ADD);
>
> Wouldn't the default be to not suppress? Nothing in the W1 system
> enables suppressing so is that even needed? (I'm fine with saying
> it's a good idea).
> device_register at some point must call device_add and device_add
> calls kobject_uevent(&dev->kobj, KOBJ_ADD);
> so doesn't the KOBJ_ADD send the add a second time? As in it
> shouldn't be needed?
> Can the suppress be called before device_register to avoid the
> automatic notify, then after it returns setup the slave device as this
> patch does to avoid this problem report, and then call the KOBJ_ADD to
> make everything happy?
>
> I'm based on commit ce7613db2d8d4d5af2587a.
>
> On Tue, Apr 08, 2014 at 09:47:50PM +0200, Belisko Marek wrote:
>> Hi,
>>
>> On Mon, Mar 24, 2014 at 1:01 PM, Evgeniy Polyakov <[email protected]> wrote:
>>> Hi everyone
>>>
>>> 24.03.2014, 01:50, "David Fries" <[email protected]>:
>>>> On Mon, Mar 17, 2014 at 10:38:13PM +0100, Belisko Marek wrote:
>>>>> booting latest 3.14-rc7 on gta04 board gives following warning:
>>>>>
>>>>> [ 3.101409] Driver for 1-wire Dallas network protocol.
>>>>> [ 3.109649] omap_hdq omap_hdq: OMAP HDQ Hardware Rev 0.5. Driver in
>>>>> [ 3.400146] CPU0
>>>>> [ 3.402709] ----
>>>>> [ 3.405273] lock(&(&priv->bus_notifier)->rwsem);
>>>>> [ 3.410308] lock(&(&priv->bus_notifier)->rwsem);
>>>>> [ 3.415374]
>>>>> [ 3.415374] *** DEADLOCK ***
>>>>>
>>>>> AFAIK we didn't see that on (at least 3/14-rc2).
>>>>
>>>> The last drivers/w1 change went in before v3.14-rc1, so if it is
>>>> something in the w1 code, it has either been there or something else
>>>> in the rest of the kernel changed to expose it. I'm using the ds2490
>>>> USB w1 dongle, I didn't see that problem when enabling lock debugging
>>>> on 3.14.0-rc5+ which includes some changes I'm working on.
>>>>
>>>> https://github.com/dfries/linux.git w1_rework
>>>
>>> Can it be a bug in omap w1 driver?
>> We test with actual Linus master and it's even worse (it hangs completely).
>> Trace look little bit different and it only happens when booting via
>> device tree (not with board file).
>>
>> [ 2.807434] Driver for 1-wire Dallas network protocol.
>> [ 2.814117] omap_hdq omap_hdq: OMAP HDQ Hardware Rev 0.5. Driver in
>> Interrupt mode
>> [ 3.034271] call_modprobe: w1-family-0x1
>> [ 3.051818]
>> [ 3.053405] =============================================
>> [ 3.059051] [ INFO: possible recursive locking detected ]
>> [ 3.064727] 3.14.0-gta04 #556 Not tainted
>> [ 3.068939] ---------------------------------------------
>> [ 3.074615] w1_bus_master1/795 is trying to acquire lock:
>> [ 3.080261] (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
>> [<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
>> [ 3.090911]
>> [ 3.090911] but task is already holding lock:
>> [ 3.097045] (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
>> [<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
>> [ 3.107666]
>> [ 3.107666] other info that might help us debug this:
>> [ 3.114501] Possible unsafe locking scenario:
>> [ 3.114501]
>> [ 3.120727] CPU0
>> [ 3.123291] ----
>> [ 3.125854] lock(&(&priv->bus_notifier)->rwsem);
>> [ 3.130889] lock(&(&priv->bus_notifier)->rwsem);
>> [ 3.135925]
>> [ 3.135925] *** DEADLOCK ***
>> [ 3.135925]
>> [ 3.142150] May be due to missing lock nesting notation
>> [ 3.142150]
>> [ 3.149291] 2 locks held by w1_bus_master1/795:
>> [ 3.154052] #0: (&dev->mutex#3){+.+.+.}, at: [<c03ba1c8>]
>> w1_attach_slave_device+0xc4/0x1c8
>> [ 3.163055] #1: (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
>> [<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
>> [ 3.174163]
>> [ 3.174163] stack backtrace:
>> [ 3.178741] CPU: 0 PID: 795 Comm: w1_bus_master1 Not tainted
>> 3.14.0-gta04 #556
>> [ 3.186370] [<c0013ca0>] (unwind_backtrace) from [<c0010f68>]
>> (show_stack+0x10/0x14)
>> [ 3.194488] [<c0010f68>] (show_stack) from [<c04f0900>]
>> (dump_stack+0x6c/0x88)
>> [ 3.202117] [<c04f0900>] (dump_stack) from [<c00746bc>]
>> (print_deadlock_bug+0xc0/0xf0)
>> [ 3.210449] [<c00746bc>] (print_deadlock_bug) from [<c0075f80>]
>> (validate_chain.isra.29+0x4dc/0x534)
>> [ 3.220031] [<c0075f80>] (validate_chain.isra.29) from [<c0076dac>]
>> (__lock_acquire+0x728/0x834)
>> [ 3.229248] [<c0076dac>] (__lock_acquire) from [<c00774b8>]
>> (lock_acquire+0xf4/0x118)
>> [ 3.237487] [<c00774b8>] (lock_acquire) from [<c04f8e04>]
>> (down_read+0x24/0x34)
>> [ 3.245178] [<c04f8e04>] (down_read) from [<c0058ab0>]
>> (__blocking_notifier_call_chain+0x28/0x58)
>> [ 3.254516] [<c0058ab0>] (__blocking_notifier_call_chain) from
>> [<c0058af4>] (blocking_notifier_call_chain+0x14/0x18)
>> [ 3.265563] [<c0058af4>] (blocking_notifier_call_chain) from
>> [<c0301db8>] (device_add+0x1f4/0x368)
>> [ 3.274993] [<c0301db8>] (device_add) from [<c03055cc>]
>> (platform_device_add+0x138/0x1c8)
>> [ 3.283569] [<c03055cc>] (platform_device_add) from [<c03bd124>]
>> (w1_bq27000_add_slave+0x44/0x74)
>> [ 3.292907] [<c03bd124>] (w1_bq27000_add_slave) from [<c03b9ed8>]
>> (w1_bus_notify.part.0+0x48/0xc8)
>> [ 3.302337] [<c03b9ed8>] (w1_bus_notify.part.0) from [<c04fbe1c>]
>> (notifier_call_chain+0x38/0x68)
>> [ 3.311645] [<c04fbe1c>] (notifier_call_chain) from [<c0058acc>]
>> (__blocking_notifier_call_chain+0x44/0x58)
>> [ 3.321899] [<c0058acc>] (__blocking_notifier_call_chain) from
>> [<c0058af4>] (blocking_notifier_call_chain+0x14/0x18)
>> [ 3.332946] [<c0058af4>] (blocking_notifier_call_chain) from
>> [<c0301db8>] (device_add+0x1f4/0x368)
>> [ 3.342346] [<c0301db8>] (device_add) from [<c03b9ad0>]
>> (__w1_attach_slave_device+0x9c/0x134)
>> [ 3.351318] [<c03b9ad0>] (__w1_attach_slave_device) from
>> [<c03ba238>] (w1_attach_slave_device+0x134/0x1c8)
>> [ 3.361480] [<c03ba238>] (w1_attach_slave_device) from [<c03ba348>]
>> (w1_slave_found+0x7c/0x98)
>> [ 3.370513] [<c03ba348>] (w1_slave_found) from [<c03bc818>]
>> (omap_w1_search_bus+0x54/0x5c)
>> [ 3.379211] [<c03bc818>] (omap_w1_search_bus) from [<c03bc758>]
>> (w1_search_devices+0x3c/0x48)
>> [ 3.388153] [<c03bc758>] (w1_search_devices) from [<c03baaec>]
>> (w1_search_process_cb+0x64/0x108)
>> [ 3.397399] [<c03baaec>] (w1_search_process_cb) from [<c03bac60>]
>> (w1_process+0x68/0x14c)
>> [ 3.405975] [<c03bac60>] (w1_process) from [<c0054bdc>] (kthread+0xdc/0xf0)
>> [ 3.413299] [<c0054bdc>] (kthread) from [<c000dca8>]
>> (ret_from_fork+0x14/0x2c)
>> [ 24.451049] INFO: rcu_sched detected stalls on CPUs/tasks: {}
>> (detected by 0, t=2102 jiffies, g=4294967086, c=4294967085, q=19)
>> [ 24.463134] INFO: Stall ended before state dump start
>> [ 87.501037] INFO: rcu_sched detected stalls on CPUs/tasks: {}
>> (detected by 0, t=8407 jiffies, g=4294967086, c=4294967085, q=19)
>> [ 87.513122] INFO: Stall ended before state dump start
>> [ 150.551055] INFO: rcu_sched detected stalls on CPUs/tasks: {}
>> (detected by 0, t=14712 jiffies, g=4294967086, c=4294967085, q=19)
>> [ 150.563232] INFO: Stall ended before state dump start
>> [ 213.601043] INFO: rcu_sched detected stalls on CPUs/tasks: {}
>> (detected by 0, t=21017 jiffies, g=4294967086, c=4294967085, q=19)
>> [ 213.613220] INFO: Stall ended before state dump start
>> [ 276.651062] INFO: rcu_sched detected stalls on CPUs/tasks: {}
>> (detected by 0, t=27322 jiffies, g=4294967086, c=4294967085, q=19)
>> [ 276.663238] INFO: Stall ended before state dump start
>>
>> BR,
>>
>> marek
>>
>> --
>> as simple and primitive as possible
>> -------------------------------------------------
>> Marek Belisko - OPEN-NANDRA
>> Freelance Developer
>>
>> Ruska Nova Ves 219 | Presov, 08005 Slovak Republic
>> Tel: +421 915 052 184
>> skype: marekwhite
>> twitter: #opennandra
>> web: http://open-nandra.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>
> --
> David Fries <[email protected]> PGP pub CB1EE8F0
> http://fries.net/~david/
>
>
> commit b3afc24a3ca6c261a8fe2c08b5bbccb51233c09a
> Author: David Fries <[email protected]>
> Date: Sun Apr 13 15:36:45 2014 -0500
>
> w1: avoid recursive device_add
>
> __w1_attach_slave_device calls device_add which calls w1_bus_notify
> which calls for the w1_bq27000 slave driver, which calls
> platform_device_add and device_add and deadlocks on getting
> &(&priv->bus_notifier)->rwsem as it is still held in the previous
> device_add. This avoids the problem by processing the family
> add/remove outside of the slave device_add call. This does mean that
> the slave device isn't setup when the notification that a slave is
> available goes out.
>
> Reported-by: Belisko Marek <[email protected]>
>
> diff --git a/drivers/w1/w1.c b/drivers/w1/w1.c
> index b96f61b..dde09c1 100644
> --- a/drivers/w1/w1.c
> +++ b/drivers/w1/w1.c
> @@ -614,27 +614,11 @@ end:
> return err;
> }
>
> -/*
> - * Handle sysfs file creation and removal here, before userspace is told that
> - * the device is added / removed from the system
> - */
> -static int w1_bus_notify(struct notifier_block *nb, unsigned long action,
> - void *data)
> +static int w1_family_notify(unsigned long action, struct w1_slave *sl)
> {
> - struct device *dev = data;
> - struct w1_slave *sl;
> struct w1_family_ops *fops;
> int err;
>
> - /*
> - * Only care about slave devices at the moment. Yes, we should use a
> - * separate "type" for this, but for now, look at the release function
> - * to know which type it is...
> - */
> - if (dev->release != w1_slave_release)
> - return 0;
> -
> - sl = dev_to_w1_slave(dev);
> fops = sl->family->fops;
>
> if (!fops)
> @@ -673,10 +657,6 @@ static int w1_bus_notify(struct notifier_block *nb, unsigned long action,
> return 0;
> }
>
> -static struct notifier_block w1_bus_nb = {
> - .notifier_call = w1_bus_notify,
> -};
> -
> static int __w1_attach_slave_device(struct w1_slave *sl)
> {
> int err;
> @@ -705,6 +685,7 @@ static int __w1_attach_slave_device(struct w1_slave *sl)
> dev_name(&sl->dev), err);
> return err;
> }
> + w1_family_notify(BUS_NOTIFY_ADD_DEVICE, sl);
>
>
> dev_set_uevent_suppress(&sl->dev, false);
> @@ -799,6 +780,7 @@ int w1_unref_slave(struct w1_slave *sl)
> msg.type = W1_SLAVE_REMOVE;
> w1_netlink_send(sl->master, &msg);
>
> + w1_family_notify(BUS_NOTIFY_DEL_DEVICE, sl);
> device_unregister(&sl->dev);
> #ifdef DEBUG
> memset(sl, 0, sizeof(*sl));
> @@ -1186,10 +1168,6 @@ static int __init w1_init(void)
> goto err_out_exit_init;
> }
>
> - retval = bus_register_notifier(&w1_bus_type, &w1_bus_nb);
> - if (retval)
> - goto err_out_bus_unregister;
> -
> retval = driver_register(&w1_master_driver);
> if (retval) {
> printk(KERN_ERR

2014-04-14 18:42:05

by Greg KH

[permalink] [raw]
Subject: Re: w1: 3.14-rc7 - possible recursive locking detected

On Sun, Apr 13, 2014 at 05:24:01PM -0500, David Fries wrote:
> Belisko Marek,
> Here is a possible solution, could you give it a try and report back?
>
> Greg Kroah-Hartman,
> Evgeniy asked me to look into this report. I don't have the
> reporter's hardware configuration, but I wouldn't think that would be
> needed, just some w1 bus master (even W1_MASTER_GPIO might work), then
> loading the slave device and manually adding a slave device with that
> family id. Even then I didn't reproduce the reported recursive
> locking error. I saw unrelated locking reports, but not this one. I
> wrote up the included patch, but that undoes the notify changes that
> you added earlier in commit 47eba33a0997fc7, and I wanted to ask about
> that commit. Specifically these two lines,
>
> err = device_register(&sl->dev);
> ...
> + dev_set_uevent_suppress(&sl->dev, false);
> + kobject_uevent(&sl->dev.kobj, KOBJ_ADD);
>
> Wouldn't the default be to not suppress? Nothing in the W1 system
> enables suppressing so is that even needed? (I'm fine with saying
> it's a good idea).
> device_register at some point must call device_add and device_add
> calls kobject_uevent(&dev->kobj, KOBJ_ADD);
> so doesn't the KOBJ_ADD send the add a second time? As in it
> shouldn't be needed?
> Can the suppress be called before device_register to avoid the
> automatic notify, then after it returns setup the slave device as this
> patch does to avoid this problem report, and then call the KOBJ_ADD to
> make everything happy?

I really have no idea, if your fix resolves an issue, that's great, I'll
be glad to take it. I have no w1 devices to test any of this, and don't
even remember writing that kernel patch :)

thanks,

greg k-h

2014-04-16 06:21:53

by David Fries

[permalink] [raw]
Subject: [PATCH] w1: avoid recursive device_add

__w1_attach_slave_device calls device_add which calls w1_bus_notify
which calls the w1_bq27000 slave driver, which calls
platform_device_add and device_add and deadlocks on getting
&(&priv->bus_notifier)->rwsem as it is still held in the previous
device_add. This avoids the problem by processing the family
add/remove outside of the slave device_add call.

Commit 47eba33a0997fc7362a introduced this deadlock and added
a KOBJ_ADD, as the add was already reported in device_register two add
events were being sent. This change suppresses the device_register
add so that any slave device sysfs entries are setup before the add
goes out.

Belisko Marek reported this change fixed the deadlock he was seeing on
ARM device tree, while testing on my x86-64 system never saw the
deadlock.

Reported-by: Belisko Marek <[email protected]>
Signed-off-by: David Fries <[email protected]>
---
Evgeniy any comments? The patch looks good to me.
Dr. H. Nikolaus Schaller, Can you give this patch another test? I
don't expect any problems, but it is now only sending one add instead
of two which is a change.
I verified two KOBJ_ADD notifications were going out prior to this
change, and only one after. There should only be one, also verified
that the w1_bq27000 add notification went out before the slave device
id, showing the suppression was working. Used
udevadm monitor --kernel --property
for notification monitoring.

drivers/w1/w1.c | 32 ++++++--------------------------
1 file changed, 6 insertions(+), 26 deletions(-)

diff --git a/drivers/w1/w1.c b/drivers/w1/w1.c
index b96f61b..ff52618 100644
--- a/drivers/w1/w1.c
+++ b/drivers/w1/w1.c
@@ -614,27 +614,11 @@ end:
return err;
}

-/*
- * Handle sysfs file creation and removal here, before userspace is told that
- * the device is added / removed from the system
- */
-static int w1_bus_notify(struct notifier_block *nb, unsigned long action,
- void *data)
+static int w1_family_notify(unsigned long action, struct w1_slave *sl)
{
- struct device *dev = data;
- struct w1_slave *sl;
struct w1_family_ops *fops;
int err;

- /*
- * Only care about slave devices at the moment. Yes, we should use a
- * separate "type" for this, but for now, look at the release function
- * to know which type it is...
- */
- if (dev->release != w1_slave_release)
- return 0;
-
- sl = dev_to_w1_slave(dev);
fops = sl->family->fops;

if (!fops)
@@ -673,10 +657,6 @@ static int w1_bus_notify(struct notifier_block *nb, unsigned long action,
return 0;
}

-static struct notifier_block w1_bus_nb = {
- .notifier_call = w1_bus_notify,
-};
-
static int __w1_attach_slave_device(struct w1_slave *sl)
{
int err;
@@ -698,6 +678,9 @@ static int __w1_attach_slave_device(struct w1_slave *sl)
dev_dbg(&sl->dev, "%s: registering %s as %p.\n", __func__,
dev_name(&sl->dev), sl);

+ /* suppress for w1_family_notify before sending KOBJ_ADD */
+ dev_set_uevent_suppress(&sl->dev, true);
+
err = device_register(&sl->dev);
if (err < 0) {
dev_err(&sl->dev,
@@ -705,7 +688,7 @@ static int __w1_attach_slave_device(struct w1_slave *sl)
dev_name(&sl->dev), err);
return err;
}
-
+ w1_family_notify(BUS_NOTIFY_ADD_DEVICE, sl);

dev_set_uevent_suppress(&sl->dev, false);
kobject_uevent(&sl->dev.kobj, KOBJ_ADD);
@@ -799,6 +782,7 @@ int w1_unref_slave(struct w1_slave *sl)
msg.type = W1_SLAVE_REMOVE;
w1_netlink_send(sl->master, &msg);

+ w1_family_notify(BUS_NOTIFY_DEL_DEVICE, sl);
device_unregister(&sl->dev);
#ifdef DEBUG
memset(sl, 0, sizeof(*sl));
@@ -1186,10 +1170,6 @@ static int __init w1_init(void)
goto err_out_exit_init;
}

- retval = bus_register_notifier(&w1_bus_type, &w1_bus_nb);
- if (retval)
- goto err_out_bus_unregister;
-
retval = driver_register(&w1_master_driver);
if (retval) {
printk(KERN_ERR
--
1.7.10.4

2014-04-16 20:46:36

by H. Nikolaus Schaller

[permalink] [raw]
Subject: Re: [PATCH] w1: avoid recursive device_add

Hi David,

Am 16.04.2014 um 08:21 schrieb David Fries:

> __w1_attach_slave_device calls device_add which calls w1_bus_notify
> which calls the w1_bq27000 slave driver, which calls
> platform_device_add and device_add and deadlocks on getting
> &(&priv->bus_notifier)->rwsem as it is still held in the previous
> device_add. This avoids the problem by processing the family
> add/remove outside of the slave device_add call.
>
> Commit 47eba33a0997fc7362a introduced this deadlock and added
> a KOBJ_ADD, as the add was already reported in device_register two add
> events were being sent. This change suppresses the device_register
> add so that any slave device sysfs entries are setup before the add
> goes out.
>
> Belisko Marek reported this change fixed the deadlock he was seeing on
> ARM device tree, while testing on my x86-64 system never saw the
> deadlock.
>
> Reported-by: Belisko Marek <[email protected]>
> Signed-off-by: David Fries <[email protected]>
> ---
> Evgeniy any comments? The patch looks good to me.
> Dr. H. Nikolaus Schaller, Can you give this patch another test? I
> don't expect any problems, but it is now only sending one add instead
> of two which is a change.

It merges fine and we don't see the problematic deadlock.

BR,
Nikolaus

> I verified two KOBJ_ADD notifications were going out prior to this
> change, and only one after. There should only be one, also verified
> that the w1_bq27000 add notification went out before the slave device
> id, showing the suppression was working. Used
> udevadm monitor --kernel --property
> for notification monitoring.
>
> drivers/w1/w1.c | 32 ++++++--------------------------
> 1 file changed, 6 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/w1/w1.c b/drivers/w1/w1.c
> index b96f61b..ff52618 100644
> --- a/drivers/w1/w1.c
> +++ b/drivers/w1/w1.c
> @@ -614,27 +614,11 @@ end:
> return err;
> }
>
> -/*
> - * Handle sysfs file creation and removal here, before userspace is told that
> - * the device is added / removed from the system
> - */
> -static int w1_bus_notify(struct notifier_block *nb, unsigned long action,
> - void *data)
> +static int w1_family_notify(unsigned long action, struct w1_slave *sl)
> {
> - struct device *dev = data;
> - struct w1_slave *sl;
> struct w1_family_ops *fops;
> int err;
>
> - /*
> - * Only care about slave devices at the moment. Yes, we should use a
> - * separate "type" for this, but for now, look at the release function
> - * to know which type it is...
> - */
> - if (dev->release != w1_slave_release)
> - return 0;
> -
> - sl = dev_to_w1_slave(dev);
> fops = sl->family->fops;
>
> if (!fops)
> @@ -673,10 +657,6 @@ static int w1_bus_notify(struct notifier_block *nb, unsigned long action,
> return 0;
> }
>
> -static struct notifier_block w1_bus_nb = {
> - .notifier_call = w1_bus_notify,
> -};
> -
> static int __w1_attach_slave_device(struct w1_slave *sl)
> {
> int err;
> @@ -698,6 +678,9 @@ static int __w1_attach_slave_device(struct w1_slave *sl)
> dev_dbg(&sl->dev, "%s: registering %s as %p.\n", __func__,
> dev_name(&sl->dev), sl);
>
> + /* suppress for w1_family_notify before sending KOBJ_ADD */
> + dev_set_uevent_suppress(&sl->dev, true);
> +
> err = device_register(&sl->dev);
> if (err < 0) {
> dev_err(&sl->dev,
> @@ -705,7 +688,7 @@ static int __w1_attach_slave_device(struct w1_slave *sl)
> dev_name(&sl->dev), err);
> return err;
> }
> -
> + w1_family_notify(BUS_NOTIFY_ADD_DEVICE, sl);
>
> dev_set_uevent_suppress(&sl->dev, false);
> kobject_uevent(&sl->dev.kobj, KOBJ_ADD);
> @@ -799,6 +782,7 @@ int w1_unref_slave(struct w1_slave *sl)
> msg.type = W1_SLAVE_REMOVE;
> w1_netlink_send(sl->master, &msg);
>
> + w1_family_notify(BUS_NOTIFY_DEL_DEVICE, sl);
> device_unregister(&sl->dev);
> #ifdef DEBUG
> memset(sl, 0, sizeof(*sl));
> @@ -1186,10 +1170,6 @@ static int __init w1_init(void)
> goto err_out_exit_init;
> }
>
> - retval = bus_register_notifier(&w1_bus_type, &w1_bus_nb);
> - if (retval)
> - goto err_out_bus_unregister;
> -
> retval = driver_register(&w1_master_driver);
> if (retval) {
> printk(KERN_ERR
> --
> 1.7.10.4