2005-11-25 13:42:36

by Guillaume Thouvenin

[permalink] [raw]
Subject: [BUG linux-2.6.15-rc] process events connector - soft lockup detected

Hello,

I compiled a kernel 2.6.15-rc1 and a 2.6.15-rc2 with Process Events
Connector enabled. The machine has two processors. When I use the
process events connector, a soft lockup is detected (for both releases):

---------B<---------------------------------------
Pid: 2770, comm: sh
EIP: 0060:[<c039ae28>] CPU: 1
EIP is at __read_lock_failed+0x8/0x14
EFLAGS: 00000297 Not tainted (2.6.15-rc2)
EAX: c046fc00 EBX: f5946000 ECX: f5947f7c EDX: 00000286
ESI: 00000000 EDI: ffffffff EBP: f5946000 DS: 007b ES: 007b
CR0: 8005003b CR2: 080e173c CR3: 358a0000 CR4: 000006d0
[<c039c566>] _read_lock+0xb/0xc
[<c011e7ee>] do_wait+0x9d/0x40c
[<c0116fd0>] default_wake_function+0x0/0x12
[<c0116fd0>] default_wake_function+0x0/0x12
[<c011ec32>] sys_wait4+0x43/0x45
[<c0102e39>] syscall_call+0x7/0xb
BUG: soft lockup detected on CPU#1!

Pid: 2770, comm: sh
EIP: 0060:[<c039ae28>] CPU: 1
EIP is at __read_lock_failed+0x8/0x14
EFLAGS: 00000297 Not tainted (2.6.15-rc2)
EAX: c046fc00 EBX: f5946000 ECX: f5947f7c EDX: 00000286
ESI: 00000000 EDI: ffffffff EBP: f5946000 DS: 007b ES: 007b
CR0: 8005003b CR2: 080e173c CR3: 358a0000 CR4: 000006d0
[<c039c566>] _read_lock+0xb/0xc
[<c011e7ee>] do_wait+0x9d/0x40c
[<c0116fd0>] default_wake_function+0x0/0x12
[<c0116fd0>] default_wake_function+0x0/0x12
[<c011ec32>] sys_wait4+0x43/0x45
[<c0102e39>] syscall_call+0x7/0xb
BUG: soft lockup detected on CPU#1!

Pid: 2770, comm: sh
EIP: 0060:[<c039ae28>] CPU: 1
EIP is at __read_lock_failed+0x8/0x14
EFLAGS: 00000297 Not tainted (2.6.15-rc2)
EAX: c046fc00 EBX: f5946000 ECX: f5947f7c EDX: 00000286
ESI: 00000000 EDI: ffffffff EBP: f5946000 DS: 007b ES: 007b
CR0: 8005003b CR2: 080e173c CR3: 358a0000 CR4: 000006d0
[<c039c566>] _read_lock+0xb/0xc
[<c011e7ee>] do_wait+0x9d/0x40c
[<c0116fd0>] default_wake_function+0x0/0x12
[<c0116fd0>] default_wake_function+0x0/0x12
[<c011ec32>] sys_wait4+0x43/0x45
[<c0102e39>] syscall_call+0x7/0xb
BUG: soft lockup detected on CPU#1!

Pid: 2770, comm: sh
EIP: 0060:[<c039ae25>] CPU: 1
EIP is at __read_lock_failed+0x5/0x14
EFLAGS: 00000297 Not tainted (2.6.15-rc2)
EAX: c046fc00 EBX: f5946000 ECX: f5947f7c EDX: 00000286
ESI: 00000000 EDI: ffffffff EBP: f5946000 DS: 007b ES: 007b
CR0: 8005003b CR2: 080e173c CR3: 358a0000 CR4: 000006d0
[<c039c566>] _read_lock+0xb/0xc
[<c011e7ee>] do_wait+0x9d/0x40c
[<c0116fd0>] default_wake_function+0x0/0x12
[<c0116fd0>] default_wake_function+0x0/0x12
[<c011ec32>] sys_wait4+0x43/0x45
[<c0102e39>] syscall_call+0x7/0xb

.... etc, etc where EIP is sometimes equal to 0060:[<c039ae25>] CPU: 1
and sometimes equal to 0060:[<c039ae28>] CPU: 1
---------------------------------------------------

I think that the problem is in kernel/fork.c. The function
proc_fork_connector(p) is called inside a
write_lock_irq(&tasklist_lock). The
cn_netlink_send(msg,CN_IDX_PROC,GFP_KERNEL) is called by the
proc_fork_connector(). Thus, the alloc_skb(size,GFP_KERNEL) is called
within a write_lock_irq(&tasklist_lock)... is it the problem?

If I replace GFP_KERNEL by GFP_ATOMIC the soft lockup disappear (but I
don't know if this solution is right...)


Best regards,

Guillaume


2005-11-25 19:48:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG linux-2.6.15-rc] process events connector - soft lockup detected

Guillaume Thouvenin <[email protected]> wrote:
>
> Hello,
>
> I compiled a kernel 2.6.15-rc1 and a 2.6.15-rc2 with Process Events
> Connector enabled. The machine has two processors. When I use the
> process events connector, a soft lockup is detected (for both releases):
>
> ---------B<---------------------------------------
> Pid: 2770, comm: sh
> EIP: 0060:[<c039ae28>] CPU: 1
> EIP is at __read_lock_failed+0x8/0x14
> EFLAGS: 00000297 Not tainted (2.6.15-rc2)
> EAX: c046fc00 EBX: f5946000 ECX: f5947f7c EDX: 00000286
> ESI: 00000000 EDI: ffffffff EBP: f5946000 DS: 007b ES: 007b
> CR0: 8005003b CR2: 080e173c CR3: 358a0000 CR4: 000006d0
> [<c039c566>] _read_lock+0xb/0xc
> [<c011e7ee>] do_wait+0x9d/0x40c
> [<c0116fd0>] default_wake_function+0x0/0x12
> [<c0116fd0>] default_wake_function+0x0/0x12
> [<c011ec32>] sys_wait4+0x43/0x45
> [<c0102e39>] syscall_call+0x7/0xb
> BUG: soft lockup detected on CPU#1!
>
> ...
>
> I think that the problem is in kernel/fork.c. The function
> proc_fork_connector(p) is called inside a
> write_lock_irq(&tasklist_lock). The
> cn_netlink_send(msg,CN_IDX_PROC,GFP_KERNEL) is called by the
> proc_fork_connector(). Thus, the alloc_skb(size,GFP_KERNEL) is called
> within a write_lock_irq(&tasklist_lock)... is it the problem?
>
> If I replace GFP_KERNEL by GFP_ATOMIC the soft lockup disappear (but I
> don't know if this solution is right...)

Gad, how ddd that get in there?

--- devel/kernel/fork.c~forkc-proc_fork_connector-called-under-write_lock 2005-11-25 11:46:36.000000000 -0800
+++ devel-akpm/kernel/fork.c 2005-11-25 11:46:36.000000000 -0800
@@ -1135,13 +1135,13 @@ static task_t *copy_process(unsigned lon
__get_cpu_var(process_counts)++;
}

- proc_fork_connector(p);
if (!current->signal->tty && p->signal->tty)
p->signal->tty = NULL;

nr_threads++;
total_forks++;
write_unlock_irq(&tasklist_lock);
+ proc_fork_connector(p);
retval = 0;

fork_out:
_


2005-11-25 19:58:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG linux-2.6.15-rc] process events connector - soft lockup detected

Andrew Morton <[email protected]> wrote:
>
> Gad, how ddd that get in there?

That's how to pronounce "did" with a headcold.

I note that copy_process() also calls cpuset_fork() under
write_lock_irq(&tasklist_lock) which is a) inefficient and b) forbidden
according to the nice comment over task_lock().

2005-11-25 22:30:07

by Paul Jackson

[permalink] [raw]
Subject: Re: [BUG linux-2.6.15-rc] process events connector - soft lockup detected

Andrew wrote:
> I note that copy_process() also calls cpuset_fork() under
> write_lock_irq(&tasklist_lock) which is a) inefficient and b) forbidden
> according to the nice comment over task_lock().

Nice catch - thanks. It was (perhaps) only inefficient until recently, because
cpuset_fork() didn't do any locking - just an atomic cpuset counter increment.

But with the locking overhaul that went into Linus's tree Oct 31, 2005,
cpuset_fork() now takes task_lock(), to guard against cpuset.c attach_task()
invalidating the task->cpuset pointer as cpuset_fork is dereferencing it.

I will look into preparing a patch that moves cpuset_fork() outside the
write_lock_irq().

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401