2004-04-06 18:11:35

by Brian King

[permalink] [raw]
Subject: call_usermodehelper hang

I have been running into some kernel hangs due to call_usermodehelper. Looking
at the backtrace, it looks to me like there are deadlock issues with adding
devices from work queues. Attached is a sample backtrace from one of the
hangs I experienced. My question is why does call_usermodehelper do 2 different
things depending on whether or not it is called from the kevent task? It appears
that the simple way to fix the hang would be to never have call_usermodehelper
use a work_queue since it must be called from process context anyway, or
am I missing something?


0xc0000000017df300 1 0 0 0 D 0xc0000000017df7b0 swapper
SP(esp) PC(eip) Function(args)
0xc00000003fc9f460 0x0000000000000000 NO_SYMBOL or Userspace
0xc00000003fc9f4f0 0xc000000000058c40 .schedule +0xb4
0xc00000003fc9f5c0 0xc00000000005a464 .wait_for_completion +0x138
0xc00000003fc9f6c0 0xc00000000007c594 .call_usermodehelper +0x104
0xc00000003fc9f810 0xc00000000022d3e8 .kobject_hotplug +0x3c4
0xc00000003fc9f900 0xc00000000022d67c .kobject_add +0x134
0xc00000003fc9f9a0 0xc00000000012b3d8 .register_disk +0x70
0xc00000003fc9fa40 0xc00000000027dfe4 .add_disk +0x60
0xc00000003fc9fad0 0xc0000000002dc7dc .sd_probe +0x290
0xc00000003fc9fb80 0xc00000000026fbe8 .bus_match +0x94
0xc00000003fc9fc10 0xc00000000026ff70 .driver_attach +0x8c
0xc00000003fc9fca0 0xc000000000270104 .bus_add_driver +0x110
0xc00000003fc9fd50 0xc000000000270a18 .driver_register +0x38
0xc00000003fc9fdd0 0xc0000000002cd8f8 .scsi_register_driver +0x28
0xc00000003fc9fe50 0xc0000000004941d8 .init_sd +0x8c
0xc00000003fc9fee0 0xc00000000000c720 .init +0x25c
0xc00000003fc9ff90 0xc0000000000183ec .kernel_thread +0x4c

0xc00000003fab3380 4 1 0 0 D 0xc00000003fab3830 events/0
SP(esp) PC(eip) Function(args)
0xc00000003faaf6e0 0x0000000000000000 NO_SYMBOL or Userspace
0xc00000003faaf770 0xc000000000058c40 .schedule +0xb4
0xc00000003faaf840 0xc00000000022fa20 .rwsem_down_write_failed +0x14c
0xc00000003faaf910 0xc00000000026fed0 .bus_add_device +0x11c
0xc00000003faaf9b0 0xc00000000026e288 .device_add +0xd0
0xc00000003faafa50 0xc0000000002cdb00 .scsi_sysfs_add_sdev +0x8c
0xc00000003faafb00 0xc0000000002cbff8 .scsi_probe_and_add_lun +0xb04
0xc00000003faafc00 0xc0000000002ccca0 .scsi_add_device +0x90
0xc00000003faafcb0 0xc0000000002d9458 .ipr_worker_thread +0xc60
0xc00000003faafdc0 0xc00000000007cd9c .worker_thread +0x268
0xc00000003faafee0 0xc0000000000839cc .kthread +0x160
0xc00000003faaff90 0xc0000000000183ec .kernel_thread +0x4c



--
Brian King
eServer Storage I/O
IBM Linux Technology Center



--
Brian King
eServer Storage I/O
IBM Linux Technology Center


2004-04-07 00:27:24

by Andrew Morton

[permalink] [raw]
Subject: Re: call_usermodehelper hang

Brian King <[email protected]> wrote:
>
> I have been running into some kernel hangs due to call_usermodehelper. Looking
> at the backtrace, it looks to me like there are deadlock issues with adding
> devices from work queues. Attached is a sample backtrace from one of the
> hangs I experienced. My question is why does call_usermodehelper do 2 different
> things depending on whether or not it is called from the kevent task? It appears
> that the simple way to fix the hang would be to never have call_usermodehelper
> use a work_queue since it must be called from process context anyway, or
> am I missing something?
>

swapper is running call_usermodehelper() while holding
down_write(&bus->subsys.rwsem); via bus_add_driver().

Meanwhile, keventd is blocked on the same lock in bus_add_device().

I'd say that the bug lies in the kobject code - we should not call
call_usermodehelper() while holding any locks which keventd may ever
acquire.


> 0xc0000000017df300 1 0 0 0 D 0xc0000000017df7b0 swapper
> SP(esp) PC(eip) Function(args)
> 0xc00000003fc9f460 0x0000000000000000 NO_SYMBOL or Userspace
> 0xc00000003fc9f4f0 0xc000000000058c40 .schedule +0xb4
> 0xc00000003fc9f5c0 0xc00000000005a464 .wait_for_completion +0x138
> 0xc00000003fc9f6c0 0xc00000000007c594 .call_usermodehelper +0x104
> 0xc00000003fc9f810 0xc00000000022d3e8 .kobject_hotplug +0x3c4
> 0xc00000003fc9f900 0xc00000000022d67c .kobject_add +0x134
> 0xc00000003fc9f9a0 0xc00000000012b3d8 .register_disk +0x70
> 0xc00000003fc9fa40 0xc00000000027dfe4 .add_disk +0x60
> 0xc00000003fc9fad0 0xc0000000002dc7dc .sd_probe +0x290
> 0xc00000003fc9fb80 0xc00000000026fbe8 .bus_match +0x94
> 0xc00000003fc9fc10 0xc00000000026ff70 .driver_attach +0x8c
> 0xc00000003fc9fca0 0xc000000000270104 .bus_add_driver +0x110
> 0xc00000003fc9fd50 0xc000000000270a18 .driver_register +0x38
> 0xc00000003fc9fdd0 0xc0000000002cd8f8 .scsi_register_driver +0x28
> 0xc00000003fc9fe50 0xc0000000004941d8 .init_sd +0x8c
> 0xc00000003fc9fee0 0xc00000000000c720 .init +0x25c
> 0xc00000003fc9ff90 0xc0000000000183ec .kernel_thread +0x4c
>
> 0xc00000003fab3380 4 1 0 0 D 0xc00000003fab3830 events/0
> SP(esp) PC(eip) Function(args)
> 0xc00000003faaf6e0 0x0000000000000000 NO_SYMBOL or Userspace
> 0xc00000003faaf770 0xc000000000058c40 .schedule +0xb4
> 0xc00000003faaf840 0xc00000000022fa20 .rwsem_down_write_failed +0x14c
> 0xc00000003faaf910 0xc00000000026fed0 .bus_add_device +0x11c
> 0xc00000003faaf9b0 0xc00000000026e288 .device_add +0xd0
> 0xc00000003faafa50 0xc0000000002cdb00 .scsi_sysfs_add_sdev +0x8c
> 0xc00000003faafb00 0xc0000000002cbff8 .scsi_probe_and_add_lun +0xb04
> 0xc00000003faafc00 0xc0000000002ccca0 .scsi_add_device +0x90
> 0xc00000003faafcb0 0xc0000000002d9458 .ipr_worker_thread +0xc60
> 0xc00000003faafdc0 0xc00000000007cd9c .worker_thread +0x268
> 0xc00000003faafee0 0xc0000000000839cc .kthread +0x160
> 0xc00000003faaff90 0xc0000000000183ec .kernel_thread +0x4c

2004-04-07 00:41:53

by Chris Wright

[permalink] [raw]
Subject: Re: call_usermodehelper hang

* Brian King ([email protected]) wrote:
> I have been running into some kernel hangs due to call_usermodehelper. Looking
> at the backtrace, it looks to me like there are deadlock issues with adding
> devices from work queues. Attached is a sample backtrace from one of the
> hangs I experienced. My question is why does call_usermodehelper do 2 different
> things depending on whether or not it is called from the kevent task? It appears
> that the simple way to fix the hang would be to never have call_usermodehelper
> use a work_queue since it must be called from process context anyway, or
> am I missing something?

It does two different things because it's trying to run from keventd.
In the case that current is not keventd, it schedules the work, so
keventd will pick that work up later to run in it's process context.

How early is this hang? It looks like init thread adds work and waits
for it's completion while holding a semaphore. It is never woken up by
keventd which is sleeping waiting for wakeup from semaphore that init
thread took.

Seems troubling to hold the sem while calling call_usermodehelper, as that
could go off for a long time.

swapper events/0
------- --------
bus_add_driver() ipr_worker_thread()
down_write(&bus->subsys.rwsem) /* here */ scsi_add_device()
driver_attach() bus_add_device()
... /* sem is already taken */
sd_probe() down_write(&bus->subsys.rwsem)
add_disk() schedule() /* bye! */
...
call_usermodehelper()
wait_for_completion() /* never awoken by keventd */

thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net

2004-04-07 01:46:50

by Brian King

[permalink] [raw]
Subject: Re: call_usermodehelper hang

Chris Wright wrote:
> * Brian King ([email protected]) wrote:
>
>>I have been running into some kernel hangs due to call_usermodehelper. Looking
>>at the backtrace, it looks to me like there are deadlock issues with adding
>>devices from work queues. Attached is a sample backtrace from one of the
>>hangs I experienced. My question is why does call_usermodehelper do 2 different
>>things depending on whether or not it is called from the kevent task? It appears
>>that the simple way to fix the hang would be to never have call_usermodehelper
>>use a work_queue since it must be called from process context anyway, or
>>am I missing something?
>
>
> It does two different things because it's trying to run from keventd.
> In the case that current is not keventd, it schedules the work, so
> keventd will pick that work up later to run in it's process context.
>
> How early is this hang?

Pretty early. Boot time, loading scsi drivers. While initializing the
third scsi adapter on the system a device shows up dynamically on the
first adapter, the LLD schedules work to call scsi_add_device and we
end up with the hang.


It looks like init thread adds work and waits
> for it's completion while holding a semaphore. It is never woken up by
> keventd which is sleeping waiting for wakeup from semaphore that init
> thread took.
>
> Seems troubling to hold the sem while calling call_usermodehelper, as that
> could go off for a long time.

I agree. There is another similar hang I ran into recently with the scsi
core in that I was calling scsi_add_device from keventd, which ended up
sleeping on the scan_mutex since the module load process was calling
scsi_scan_host. scsi_scan_host was in the same kobject hotplug code,
calling call_usermodehelper. I figured my LLD shouldn't be doing that,
so I synchronized the events. Perhaps that was the wrong fix... I
suppose I could have changed the LLD to always call
scsi_scan_host/scsi_add_device/etc. from keventd...

-Brian


2004-04-07 06:12:07

by Greg KH

[permalink] [raw]
Subject: Re: call_usermodehelper hang

On Tue, Apr 06, 2004 at 05:29:03PM -0700, Andrew Morton wrote:
> Brian King <[email protected]> wrote:
> >
> > I have been running into some kernel hangs due to call_usermodehelper. Looking
> > at the backtrace, it looks to me like there are deadlock issues with adding
> > devices from work queues. Attached is a sample backtrace from one of the
> > hangs I experienced. My question is why does call_usermodehelper do 2 different
> > things depending on whether or not it is called from the kevent task? It appears
> > that the simple way to fix the hang would be to never have call_usermodehelper
> > use a work_queue since it must be called from process context anyway, or
> > am I missing something?
> >
>
> swapper is running call_usermodehelper() while holding
> down_write(&bus->subsys.rwsem); via bus_add_driver().
>
> Meanwhile, keventd is blocked on the same lock in bus_add_device().
>
> I'd say that the bug lies in the kobject code - we should not call
> call_usermodehelper() while holding any locks which keventd may ever
> acquire.

How is keventd calling sysfs code? Is scsi using it to drive device
detection somehow? I don't see how the kobject core code itself can do
this on its own.

thanks,

greg k-h

2004-04-07 14:00:40

by Brian King

[permalink] [raw]
Subject: Re: call_usermodehelper hang

Greg KH wrote:
> On Tue, Apr 06, 2004 at 05:29:03PM -0700, Andrew Morton wrote:
>
>>Brian King <[email protected]> wrote:
>>
>>>I have been running into some kernel hangs due to call_usermodehelper. Looking
>>>at the backtrace, it looks to me like there are deadlock issues with adding
>>>devices from work queues. Attached is a sample backtrace from one of the
>>>hangs I experienced. My question is why does call_usermodehelper do 2 different
>>>things depending on whether or not it is called from the kevent task? It appears
>>>that the simple way to fix the hang would be to never have call_usermodehelper
>>>use a work_queue since it must be called from process context anyway, or
>>>am I missing something?
>>>
>>
>>swapper is running call_usermodehelper() while holding
>>down_write(&bus->subsys.rwsem); via bus_add_driver().
>>
>>Meanwhile, keventd is blocked on the same lock in bus_add_device().
>>
>>I'd say that the bug lies in the kobject code - we should not call
>>call_usermodehelper() while holding any locks which keventd may ever
>>acquire.
>
>
> How is keventd calling sysfs code? Is scsi using it to drive device
> detection somehow? I don't see how the kobject core code itself can do
> this on its own.

scsi_add_device is being called from keventd by the ipr LLD. I am assuming this
is a reasonable thing to do. If not, then I would argue the scsi_add_device API
is broken as it must be called from task level, and if a LLD cannot use keventd,
then every LLD that wants to use it would need to create its own workqueue thread.


--
Brian King
eServer Storage I/O
IBM Linux Technology Center

2004-04-16 08:52:46

by Heiko Carstens

[permalink] [raw]
Subject: Re: call_usermodehelper hang

> I have been running into some kernel hangs due to call_usermodehelper.
Looking
> at the backtrace, it looks to me like there are deadlock issues with
adding
> devices from work queues. Attached is a sample backtrace from one of the
> hangs I experienced. My question is why does call_usermodehelper do
> 2 different
> things depending on whether or not it is called from the kevent
> task? It appears
> that the simple way to fix the hang would be to never have
call_usermodehelper
> use a work_queue since it must be called from process context anyway, or
> am I missing something?

Do you have a solution for this problem? As it appears we have the very
same
problem with the zfcp lldd, since we also trigger a scsi_add_device call
by
using schedule_work. IMO the right way to solve this problem would be to
not
trigger any hotplug events while holding a semaphore.

> 0xc0000000017df300 1 0 0 0 D 0xc0000000017df7b0
swapper
> SP(esp) PC(eip) Function(args)
> 0xc00000003fc9f460 0x0000000000000000 NO_SYMBOL or Userspace
> 0xc00000003fc9f4f0 0xc000000000058c40 .schedule +0xb4
> 0xc00000003fc9f5c0 0xc00000000005a464 .wait_for_completion +0x138
> 0xc00000003fc9f6c0 0xc00000000007c594 .call_usermodehelper +0x104
> 0xc00000003fc9f810 0xc00000000022d3e8 .kobject_hotplug +0x3c4
> 0xc00000003fc9f900 0xc00000000022d67c .kobject_add +0x134
> 0xc00000003fc9f9a0 0xc00000000012b3d8 .register_disk +0x70
> 0xc00000003fc9fa40 0xc00000000027dfe4 .add_disk +0x60
> 0xc00000003fc9fad0 0xc0000000002dc7dc .sd_probe +0x290
> 0xc00000003fc9fb80 0xc00000000026fbe8 .bus_match +0x94
> 0xc00000003fc9fc10 0xc00000000026ff70 .driver_attach +0x8c
> 0xc00000003fc9fca0 0xc000000000270104 .bus_add_driver +0x110
> 0xc00000003fc9fd50 0xc000000000270a18 .driver_register +0x38
> 0xc00000003fc9fdd0 0xc0000000002cd8f8 .scsi_register_driver +0x28
> 0xc00000003fc9fe50 0xc0000000004941d8 .init_sd +0x8c
> 0xc00000003fc9fee0 0xc00000000000c720 .init +0x25c
> 0xc00000003fc9ff90 0xc0000000000183ec .kernel_thread +0x4c
>
> 0xc00000003fab3380 4 1 0 0 D 0xc00000003fab3830
events/0
> SP(esp) PC(eip) Function(args)
> 0xc00000003faaf6e0 0x0000000000000000 NO_SYMBOL or Userspace
> 0xc00000003faaf770 0xc000000000058c40 .schedule +0xb4
> 0xc00000003faaf840 0xc00000000022fa20 .rwsem_down_write_failed +0x14c
> 0xc00000003faaf910 0xc00000000026fed0 .bus_add_device +0x11c
> 0xc00000003faaf9b0 0xc00000000026e288 .device_add +0xd0
> 0xc00000003faafa50 0xc0000000002cdb00 .scsi_sysfs_add_sdev +0x8c
> 0xc00000003faafb00 0xc0000000002cbff8 .scsi_probe_and_add_lun +0xb04
> 0xc00000003faafc00 0xc0000000002ccca0 .scsi_add_device +0x90
> 0xc00000003faafcb0 0xc0000000002d9458 .ipr_worker_thread +0xc60
> 0xc00000003faafdc0 0xc00000000007cd9c .worker_thread +0x268
> 0xc00000003faafee0 0xc0000000000839cc .kthread +0x160
> 0xc00000003faaff90 0xc0000000000183ec .kernel_thread +0x4c

2004-04-16 14:07:07

by Brian King

[permalink] [raw]
Subject: Re: call_usermodehelper hang

The fix is in Andrew Morton's tree. The solution was to add a
call_usermodehelper_async, which can be called with semaphores held.
Grab the following patches:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mm6/broken-out/call_usermodehelper_async.patch
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mm6/broken-out/call_usermodehelper_async-always.patch
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mm6/broken-out/kstrdup-and-friends.patch

You will get a compile error since the mm tree has another patch in it.
Change the offending (system_state != SYSTEM_RUNNING) to (!system_running)

-Brian


Heiko Carstens wrote:
>>I have been running into some kernel hangs due to call_usermodehelper.
>
> Looking
>
>>at the backtrace, it looks to me like there are deadlock issues with
>
> adding
>
>>devices from work queues. Attached is a sample backtrace from one of the
>>hangs I experienced. My question is why does call_usermodehelper do
>>2 different
>>things depending on whether or not it is called from the kevent
>>task? It appears
>>that the simple way to fix the hang would be to never have
>
> call_usermodehelper
>
>>use a work_queue since it must be called from process context anyway, or
>>am I missing something?
>
>
> Do you have a solution for this problem? As it appears we have the very
> same
> problem with the zfcp lldd, since we also trigger a scsi_add_device call
> by
> using schedule_work. IMO the right way to solve this problem would be to
> not
> trigger any hotplug events while holding a semaphore.
>
>
>>0xc0000000017df300 1 0 0 0 D 0xc0000000017df7b0
>
> swapper
>
>> SP(esp) PC(eip) Function(args)
>>0xc00000003fc9f460 0x0000000000000000 NO_SYMBOL or Userspace
>>0xc00000003fc9f4f0 0xc000000000058c40 .schedule +0xb4
>>0xc00000003fc9f5c0 0xc00000000005a464 .wait_for_completion +0x138
>>0xc00000003fc9f6c0 0xc00000000007c594 .call_usermodehelper +0x104
>>0xc00000003fc9f810 0xc00000000022d3e8 .kobject_hotplug +0x3c4
>>0xc00000003fc9f900 0xc00000000022d67c .kobject_add +0x134
>>0xc00000003fc9f9a0 0xc00000000012b3d8 .register_disk +0x70
>>0xc00000003fc9fa40 0xc00000000027dfe4 .add_disk +0x60
>>0xc00000003fc9fad0 0xc0000000002dc7dc .sd_probe +0x290
>>0xc00000003fc9fb80 0xc00000000026fbe8 .bus_match +0x94
>>0xc00000003fc9fc10 0xc00000000026ff70 .driver_attach +0x8c
>>0xc00000003fc9fca0 0xc000000000270104 .bus_add_driver +0x110
>>0xc00000003fc9fd50 0xc000000000270a18 .driver_register +0x38
>>0xc00000003fc9fdd0 0xc0000000002cd8f8 .scsi_register_driver +0x28
>>0xc00000003fc9fe50 0xc0000000004941d8 .init_sd +0x8c
>>0xc00000003fc9fee0 0xc00000000000c720 .init +0x25c
>>0xc00000003fc9ff90 0xc0000000000183ec .kernel_thread +0x4c
>>
>>0xc00000003fab3380 4 1 0 0 D 0xc00000003fab3830
>
> events/0
>
>> SP(esp) PC(eip) Function(args)
>>0xc00000003faaf6e0 0x0000000000000000 NO_SYMBOL or Userspace
>>0xc00000003faaf770 0xc000000000058c40 .schedule +0xb4
>>0xc00000003faaf840 0xc00000000022fa20 .rwsem_down_write_failed +0x14c
>>0xc00000003faaf910 0xc00000000026fed0 .bus_add_device +0x11c
>>0xc00000003faaf9b0 0xc00000000026e288 .device_add +0xd0
>>0xc00000003faafa50 0xc0000000002cdb00 .scsi_sysfs_add_sdev +0x8c
>>0xc00000003faafb00 0xc0000000002cbff8 .scsi_probe_and_add_lun +0xb04
>>0xc00000003faafc00 0xc0000000002ccca0 .scsi_add_device +0x90
>>0xc00000003faafcb0 0xc0000000002d9458 .ipr_worker_thread +0xc60
>>0xc00000003faafdc0 0xc00000000007cd9c .worker_thread +0x268
>>0xc00000003faafee0 0xc0000000000839cc .kthread +0x160
>>0xc00000003faaff90 0xc0000000000183ec .kernel_thread +0x4c
>
>
>


--
Brian King
eServer Storage I/O
IBM Linux Technology Center