2022-03-29 13:28:51

by Zhang Qiao

[permalink] [raw]
Subject: Question about kill a process group

hello everyone,

I got a hradlockup panic when run the ltp syscall testcases.

348439.713178] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
[348439.713236] irq event stamp: 0
[348439.713237] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[348439.713238] hardirqs last disabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
[348439.713239] softirqs last enabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
[348439.713240] softirqs last disabled at (0): [<0000000000000000>] 0x0
[348439.713241] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S 5.10.0+ #1
[348439.713242] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
[348439.713243] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
[348439.713245] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
[348439.713246] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
[348439.713247] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
[348439.713248] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
[348439.713249] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
[348439.713250] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
[348439.713251] FS: 0000000000000000(0000) GS:ffff96edbc200000(0000) knlGS:0000000000000000
[348439.713252] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[348439.713253] CR2: 0000000000416ea0 CR3: 0000002d91812004 CR4: 00000000001706e0
[348439.713254] Call Trace:
[348439.713255] do_raw_write_lock+0xa9/0xb0
[348439.713256] _raw_write_lock_irq+0x5a/0x70
[348439.713256] do_exit+0x429/0xd00
[348439.713257] do_group_exit+0x39/0xb0
[348439.713258] __x64_sys_exit_group+0x14/0x20
[348439.713259] do_syscall_64+0x33/0x40
[348439.713260] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[348439.713260] RIP: 0033:0x7f59295a7066
[348439.713261] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.
[348439.713262] RSP: 002b:00007fff0afeac38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[348439.713264] RAX: ffffffffffffffda RBX: 00007f5929694530 RCX: 00007f59295a7066
[348439.713265] RDX: 0000000000000002 RSI: 000000000000003c RDI: 0000000000000002
[348439.713266] RBP: 0000000000000002 R08: 00000000000000e7 R09: ffffffffffffff80
[348439.713267] R10: 0000000000000002 R11: 0000000000000246 R12: 00007f5929694530
[348439.713268] R13: 0000000000000001 R14: 00007f5929697f68 R15: 0000000000000000
[348439.713269] Kernel panic - not syncing: Hard LOCKUP
[348439.713270] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S 5.10.0+ #1
[348439.713272] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
[348439.713272] Call Trace:
[348439.713273] <NMI>
[348439.713274] dump_stack+0x77/0x97
[348439.713275] panic+0x10c/0x2fb
[348439.713275] nmi_panic+0x35/0x40
[348439.713276] watchdog_hardlockup_check+0xeb/0x110
[348439.713277] __perf_event_overflow+0x52/0xf0
[348439.713278] handle_pmi_common+0x21a/0x320
[348439.713286] intel_pmu_handle_irq+0xc9/0x1b0
[348439.713287] perf_event_nmi_handler+0x24/0x40
[348439.713288] nmi_handle+0xc3/0x2a0
[348439.713289] default_do_nmi+0x49/0xf0
[348439.713289] exc_nmi+0x146/0x160
[348439.713290] end_repeat_nmi+0x16/0x55
[348439.713291] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
[348439.713293] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
[348439.713295] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
[348439.713296] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
[348439.713296] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
[348439.713297] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
[348439.713298] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
[348439.713300] </NMI>
[348439.713301] do_raw_write_lock+0xa9/0xb0
[348439.713302] _raw_write_lock_irq+0x5a/0x70
[348439.713303] do_exit+0x429/0xd00
[348439.713303] do_group_exit+0x39/0xb0
[348439.713304] __x64_sys_exit_group+0x14/0x20
[348439.713305] do_syscall_64+0x33/0x40
[348439.713306] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[348439.713307] RIP: 0033:0x7f59295a7066
[348439.713308] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.


when analyzing vmcore, i notice lots of fork12 processes are waiting for tasklist read lock or write
lock (see the attachment file all_cpu_stacks.log),and every fork12 process(belongs to the same
process group) call kill(0, SIGQUIT) in their signal handler()[1], it will traverse all the processes in the
same process group and send signal to them one by one, which is a very time-costly work and hold tasklist
read lock long time. At the same time, other processes will exit after receive signal, they try to get
the tasklist write lock at exit_notify().

[1] fork12 testcase: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/fork/fork12.c

some processes call kill(0, SIGQUIT), wait for tasklist read lock:

#5 [ffff972a9b16fd78] native_queued_spin_lock_slowpath at ffffffff9931ed47
#6 [ffff972a9b16fd78] queued_read_lock_slowpath at ffffffff99320a58
#7 [ffff972a9b16fd90] do_wait at ffffffff992bc17d
#8 [ffff972a9b16fdd0] kernel_wait4 at ffffffff992bd88d
#9 [ffff972a9b16fe58] __do_sys_wait4 at ffffffff992bd9e5
#10 [ffff972a9b16ff30] do_syscall_64 at ffffffff9920432d
#11 [ffff972a9b16ff50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad

As the same time, some processes are exiting, wait for tasklist write lock:

#5 [ffff972aa49a7e60] native_queued_spin_lock_slowpath at ffffffff9931ecb0
#6 [ffff972aa49a7e60] queued_write_lock_slowpath at ffffffff993209e4
#7 [ffff972aa49a7e78] do_raw_write_lock at ffffffff99320834
#8 [ffff972aa49a7e88] do_exit at ffffffff992bcd78
#9 [ffff972aa49a7f00] do_group_exit at ffffffff992bd719
#10 [ffff972aa49a7f28] __x64_sys_exit_group at ffffffff992bd7a4
#11 [ffff972aa49a7f30] do_syscall_64 at ffffffff9920432d
#12 [ffff972aa49a7f50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad

In this scenario,there are lots of process are waiting for tasklist read lock or the tasklist
write lock, so they will queue. if the wait queue is long enough, it might cause a hardlockup issue when a
process wait for taking the write lock at exit_notify().

I tried to solve this problem by avoiding traversing the process group multiple times when kill(0, xxxx)
is called multiple times form the same process group, but it doesn't look like a good solution.

Is there any good idea for fixing this problem ?

Thanks!

Qiao
.


Attachments:
all_cpu_stack.log (110.93 kB)

2022-04-02 15:46:04

by Zhang Qiao

[permalink] [raw]
Subject: Re: Question about kill a process group

ping...

Any suggestions for this problem?

thank!
Qiao


在 2022/3/29 16:07, Zhang Qiao 写道:
> hello everyone,
>
> I got a hradlockup panic when run the ltp syscall testcases.
>
> 348439.713178] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
> [348439.713236] irq event stamp: 0
> [348439.713237] hardirqs last enabled at (0): [<0000000000000000>] 0x0
> [348439.713238] hardirqs last disabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
> [348439.713239] softirqs last enabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
> [348439.713240] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [348439.713241] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S 5.10.0+ #1
> [348439.713242] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
> [348439.713243] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
> [348439.713245] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
> [348439.713246] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
> [348439.713247] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
> [348439.713248] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
> [348439.713249] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
> [348439.713250] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
> [348439.713251] FS: 0000000000000000(0000) GS:ffff96edbc200000(0000) knlGS:0000000000000000
> [348439.713252] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [348439.713253] CR2: 0000000000416ea0 CR3: 0000002d91812004 CR4: 00000000001706e0
> [348439.713254] Call Trace:
> [348439.713255] do_raw_write_lock+0xa9/0xb0
> [348439.713256] _raw_write_lock_irq+0x5a/0x70
> [348439.713256] do_exit+0x429/0xd00
> [348439.713257] do_group_exit+0x39/0xb0
> [348439.713258] __x64_sys_exit_group+0x14/0x20
> [348439.713259] do_syscall_64+0x33/0x40
> [348439.713260] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [348439.713260] RIP: 0033:0x7f59295a7066
> [348439.713261] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.
> [348439.713262] RSP: 002b:00007fff0afeac38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [348439.713264] RAX: ffffffffffffffda RBX: 00007f5929694530 RCX: 00007f59295a7066
> [348439.713265] RDX: 0000000000000002 RSI: 000000000000003c RDI: 0000000000000002
> [348439.713266] RBP: 0000000000000002 R08: 00000000000000e7 R09: ffffffffffffff80
> [348439.713267] R10: 0000000000000002 R11: 0000000000000246 R12: 00007f5929694530
> [348439.713268] R13: 0000000000000001 R14: 00007f5929697f68 R15: 0000000000000000
> [348439.713269] Kernel panic - not syncing: Hard LOCKUP
> [348439.713270] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S 5.10.0+ #1
> [348439.713272] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
> [348439.713272] Call Trace:
> [348439.713273] <NMI>
> [348439.713274] dump_stack+0x77/0x97
> [348439.713275] panic+0x10c/0x2fb
> [348439.713275] nmi_panic+0x35/0x40
> [348439.713276] watchdog_hardlockup_check+0xeb/0x110
> [348439.713277] __perf_event_overflow+0x52/0xf0
> [348439.713278] handle_pmi_common+0x21a/0x320
> [348439.713286] intel_pmu_handle_irq+0xc9/0x1b0
> [348439.713287] perf_event_nmi_handler+0x24/0x40
> [348439.713288] nmi_handle+0xc3/0x2a0
> [348439.713289] default_do_nmi+0x49/0xf0
> [348439.713289] exc_nmi+0x146/0x160
> [348439.713290] end_repeat_nmi+0x16/0x55
> [348439.713291] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
> [348439.713293] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
> [348439.713295] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
> [348439.713296] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
> [348439.713296] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
> [348439.713297] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
> [348439.713298] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
> [348439.713300] </NMI>
> [348439.713301] do_raw_write_lock+0xa9/0xb0
> [348439.713302] _raw_write_lock_irq+0x5a/0x70
> [348439.713303] do_exit+0x429/0xd00
> [348439.713303] do_group_exit+0x39/0xb0
> [348439.713304] __x64_sys_exit_group+0x14/0x20
> [348439.713305] do_syscall_64+0x33/0x40
> [348439.713306] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [348439.713307] RIP: 0033:0x7f59295a7066
> [348439.713308] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.
>
>
> when analyzing vmcore, i notice lots of fork12 processes are waiting for tasklist read lock or write
> lock (see the attachment file all_cpu_stacks.log),and every fork12 process(belongs to the same
> process group) call kill(0, SIGQUIT) in their signal handler()[1], it will traverse all the processes in the
> same process group and send signal to them one by one, which is a very time-costly work and hold tasklist
> read lock long time. At the same time, other processes will exit after receive signal, they try to get
> the tasklist write lock at exit_notify().
>
> [1] fork12 testcase: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/fork/fork12.c
>
> some processes call kill(0, SIGQUIT), wait for tasklist read lock:
>
> #5 [ffff972a9b16fd78] native_queued_spin_lock_slowpath at ffffffff9931ed47
> #6 [ffff972a9b16fd78] queued_read_lock_slowpath at ffffffff99320a58
> #7 [ffff972a9b16fd90] do_wait at ffffffff992bc17d
> #8 [ffff972a9b16fdd0] kernel_wait4 at ffffffff992bd88d
> #9 [ffff972a9b16fe58] __do_sys_wait4 at ffffffff992bd9e5
> #10 [ffff972a9b16ff30] do_syscall_64 at ffffffff9920432d
> #11 [ffff972a9b16ff50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad
>
> As the same time, some processes are exiting, wait for tasklist write lock:
>
> #5 [ffff972aa49a7e60] native_queued_spin_lock_slowpath at ffffffff9931ecb0
> #6 [ffff972aa49a7e60] queued_write_lock_slowpath at ffffffff993209e4
> #7 [ffff972aa49a7e78] do_raw_write_lock at ffffffff99320834
> #8 [ffff972aa49a7e88] do_exit at ffffffff992bcd78
> #9 [ffff972aa49a7f00] do_group_exit at ffffffff992bd719
> #10 [ffff972aa49a7f28] __x64_sys_exit_group at ffffffff992bd7a4
> #11 [ffff972aa49a7f30] do_syscall_64 at ffffffff9920432d
> #12 [ffff972aa49a7f50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad
>
> In this scenario,there are lots of process are waiting for tasklist read lock or the tasklist
> write lock, so they will queue. if the wait queue is long enough, it might cause a hardlockup issue when a
> process wait for taking the write lock at exit_notify().
>
> I tried to solve this problem by avoiding traversing the process group multiple times when kill(0, xxxx)
> is called multiple times form the same process group, but it doesn't look like a good solution.
>
> Is there any good idea for fixing this problem ?
>
> Thanks!
>
> Qiao
> .
>

2022-04-13 08:22:53

by Zhang Qiao

[permalink] [raw]
Subject: Re: Question about kill a process group



Gentle ping. Any comments on this problem?

在 2022/4/2 10:22, Zhang Qiao 写道:
> ping...
>
> Any suggestions for this problem?
>
> thank!
> Qiao
>
>
> 在 2022/3/29 16:07, Zhang Qiao 写道:
>> hello everyone,
>>
>> I got a hradlockup panic when run the ltp syscall testcases.
>>
>> 348439.713178] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
>> [348439.713236] irq event stamp: 0
>> [348439.713237] hardirqs last enabled at (0): [<0000000000000000>] 0x0
>> [348439.713238] hardirqs last disabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
>> [348439.713239] softirqs last enabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
>> [348439.713240] softirqs last disabled at (0): [<0000000000000000>] 0x0
>> [348439.713241] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S 5.10.0+ #1
>> [348439.713242] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
>> [348439.713243] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
>> [348439.713245] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
>> [348439.713246] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
>> [348439.713247] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
>> [348439.713248] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
>> [348439.713249] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
>> [348439.713250] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
>> [348439.713251] FS: 0000000000000000(0000) GS:ffff96edbc200000(0000) knlGS:0000000000000000
>> [348439.713252] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [348439.713253] CR2: 0000000000416ea0 CR3: 0000002d91812004 CR4: 00000000001706e0
>> [348439.713254] Call Trace:
>> [348439.713255] do_raw_write_lock+0xa9/0xb0
>> [348439.713256] _raw_write_lock_irq+0x5a/0x70
>> [348439.713256] do_exit+0x429/0xd00
>> [348439.713257] do_group_exit+0x39/0xb0
>> [348439.713258] __x64_sys_exit_group+0x14/0x20
>> [348439.713259] do_syscall_64+0x33/0x40
>> [348439.713260] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [348439.713260] RIP: 0033:0x7f59295a7066
>> [348439.713261] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.
>> [348439.713262] RSP: 002b:00007fff0afeac38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>> [348439.713264] RAX: ffffffffffffffda RBX: 00007f5929694530 RCX: 00007f59295a7066
>> [348439.713265] RDX: 0000000000000002 RSI: 000000000000003c RDI: 0000000000000002
>> [348439.713266] RBP: 0000000000000002 R08: 00000000000000e7 R09: ffffffffffffff80
>> [348439.713267] R10: 0000000000000002 R11: 0000000000000246 R12: 00007f5929694530
>> [348439.713268] R13: 0000000000000001 R14: 00007f5929697f68 R15: 0000000000000000
>> [348439.713269] Kernel panic - not syncing: Hard LOCKUP
>> [348439.713270] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S 5.10.0+ #1
>> [348439.713272] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
>> [348439.713272] Call Trace:
>> [348439.713273] <NMI>
>> [348439.713274] dump_stack+0x77/0x97
>> [348439.713275] panic+0x10c/0x2fb
>> [348439.713275] nmi_panic+0x35/0x40
>> [348439.713276] watchdog_hardlockup_check+0xeb/0x110
>> [348439.713277] __perf_event_overflow+0x52/0xf0
>> [348439.713278] handle_pmi_common+0x21a/0x320
>> [348439.713286] intel_pmu_handle_irq+0xc9/0x1b0
>> [348439.713287] perf_event_nmi_handler+0x24/0x40
>> [348439.713288] nmi_handle+0xc3/0x2a0
>> [348439.713289] default_do_nmi+0x49/0xf0
>> [348439.713289] exc_nmi+0x146/0x160
>> [348439.713290] end_repeat_nmi+0x16/0x55
>> [348439.713291] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
>> [348439.713293] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
>> [348439.713295] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
>> [348439.713296] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
>> [348439.713296] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
>> [348439.713297] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
>> [348439.713298] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
>> [348439.713300] </NMI>
>> [348439.713301] do_raw_write_lock+0xa9/0xb0
>> [348439.713302] _raw_write_lock_irq+0x5a/0x70
>> [348439.713303] do_exit+0x429/0xd00
>> [348439.713303] do_group_exit+0x39/0xb0
>> [348439.713304] __x64_sys_exit_group+0x14/0x20
>> [348439.713305] do_syscall_64+0x33/0x40
>> [348439.713306] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [348439.713307] RIP: 0033:0x7f59295a7066
>> [348439.713308] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.
>>
>>
>> when analyzing vmcore, i notice lots of fork12 processes are waiting for tasklist read lock or write
>> lock (see the attachment file all_cpu_stacks.log),and every fork12 process(belongs to the same
>> process group) call kill(0, SIGQUIT) in their signal handler()[1], it will traverse all the processes in the
>> same process group and send signal to them one by one, which is a very time-costly work and hold tasklist
>> read lock long time. At the same time, other processes will exit after receive signal, they try to get
>> the tasklist write lock at exit_notify().
>>
>> [1] fork12 testcase: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/fork/fork12.c
>>
>> some processes call kill(0, SIGQUIT), wait for tasklist read lock:
>>
>> #5 [ffff972a9b16fd78] native_queued_spin_lock_slowpath at ffffffff9931ed47
>> #6 [ffff972a9b16fd78] queued_read_lock_slowpath at ffffffff99320a58
>> #7 [ffff972a9b16fd90] do_wait at ffffffff992bc17d
>> #8 [ffff972a9b16fdd0] kernel_wait4 at ffffffff992bd88d
>> #9 [ffff972a9b16fe58] __do_sys_wait4 at ffffffff992bd9e5
>> #10 [ffff972a9b16ff30] do_syscall_64 at ffffffff9920432d
>> #11 [ffff972a9b16ff50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad
>>
>> As the same time, some processes are exiting, wait for tasklist write lock:
>>
>> #5 [ffff972aa49a7e60] native_queued_spin_lock_slowpath at ffffffff9931ecb0
>> #6 [ffff972aa49a7e60] queued_write_lock_slowpath at ffffffff993209e4
>> #7 [ffff972aa49a7e78] do_raw_write_lock at ffffffff99320834
>> #8 [ffff972aa49a7e88] do_exit at ffffffff992bcd78
>> #9 [ffff972aa49a7f00] do_group_exit at ffffffff992bd719
>> #10 [ffff972aa49a7f28] __x64_sys_exit_group at ffffffff992bd7a4
>> #11 [ffff972aa49a7f30] do_syscall_64 at ffffffff9920432d
>> #12 [ffff972aa49a7f50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad
>>
>> In this scenario,there are lots of process are waiting for tasklist read lock or the tasklist
>> write lock, so they will queue. if the wait queue is long enough, it might cause a hardlockup issue when a
>> process wait for taking the write lock at exit_notify().
>>
>> I tried to solve this problem by avoiding traversing the process group multiple times when kill(0, xxxx)
>> is called multiple times form the same process group, but it doesn't look like a good solution.
>>
>> Is there any good idea for fixing this problem ?
>>
>> Thanks!
>>
>> Qiao
>> .
>>
> .
>

2022-04-14 05:20:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Question about kill a process group

Zhang Qiao <[email protected]> writes:

> Gentle ping. Any comments on this problem?

Is fork12 a new test?

Is there a real world use case that connects to this?

How many children are being created in this test? Several million?

I would like to blame this on the old issue that tasklist_lock being
a global lock. Given the number of child processes (as many as can be
created) I don't think we are hurt much by using a global lock. The
problem for solubility is that we have a lock.

Fundamentally there must be a lock taken to maintain the parent's
list of children.

I only see SIGQUIT being called once in the parent process so that
should not be an issue.

There is a minor issue in fork12 that it calls exit(0) instead of
_exit(0) in the children. Not the problem you are dealing with
but it does look like it can be a distraction.

I suspect the issue really is the thundering hurd of a million+
processes synchronizing on a single lock.

I don't think this is a hard lockup, just a global slow down.
I expect everything will eventually exit.


To do something about this is going to take a deep and fundamental
redesign of how we maintain process lists to handle a parent
with millions of children well.

Is there any real world reason to care about this case? Without
real world motivation I am inclined to just note that this is
something that is handled poorly, and leave it as is.

Eric

>
> 在 2022/4/2 10:22, Zhang Qiao 写道:
>> ping...
>>
>> Any suggestions for this problem?
>>
>> thank!
>> Qiao
>>
>>
>> 在 2022/3/29 16:07, Zhang Qiao 写道:
>>> hello everyone,
>>>
>>> I got a hradlockup panic when run the ltp syscall testcases.
>>>
>>> 348439.713178] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
>>> [348439.713236] irq event stamp: 0
>>> [348439.713237] hardirqs last enabled at (0): [<0000000000000000>] 0x0
>>> [348439.713238] hardirqs last disabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
>>> [348439.713239] softirqs last enabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
>>> [348439.713240] softirqs last disabled at (0): [<0000000000000000>] 0x0
>>> [348439.713241] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S 5.10.0+ #1
>>> [348439.713242] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
>>> [348439.713243] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
>>> [348439.713245] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
>>> [348439.713246] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
>>> [348439.713247] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
>>> [348439.713248] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
>>> [348439.713249] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
>>> [348439.713250] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
>>> [348439.713251] FS: 0000000000000000(0000) GS:ffff96edbc200000(0000) knlGS:0000000000000000
>>> [348439.713252] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [348439.713253] CR2: 0000000000416ea0 CR3: 0000002d91812004 CR4: 00000000001706e0
>>> [348439.713254] Call Trace:
>>> [348439.713255] do_raw_write_lock+0xa9/0xb0
>>> [348439.713256] _raw_write_lock_irq+0x5a/0x70
>>> [348439.713256] do_exit+0x429/0xd00
>>> [348439.713257] do_group_exit+0x39/0xb0
>>> [348439.713258] __x64_sys_exit_group+0x14/0x20
>>> [348439.713259] do_syscall_64+0x33/0x40
>>> [348439.713260] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [348439.713260] RIP: 0033:0x7f59295a7066
>>> [348439.713261] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.
>>> [348439.713262] RSP: 002b:00007fff0afeac38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>> [348439.713264] RAX: ffffffffffffffda RBX: 00007f5929694530 RCX: 00007f59295a7066
>>> [348439.713265] RDX: 0000000000000002 RSI: 000000000000003c RDI: 0000000000000002
>>> [348439.713266] RBP: 0000000000000002 R08: 00000000000000e7 R09: ffffffffffffff80
>>> [348439.713267] R10: 0000000000000002 R11: 0000000000000246 R12: 00007f5929694530
>>> [348439.713268] R13: 0000000000000001 R14: 00007f5929697f68 R15: 0000000000000000
>>> [348439.713269] Kernel panic - not syncing: Hard LOCKUP
>>> [348439.713270] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S 5.10.0+ #1
>>> [348439.713272] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
>>> [348439.713272] Call Trace:
>>> [348439.713273] <NMI>
>>> [348439.713274] dump_stack+0x77/0x97
>>> [348439.713275] panic+0x10c/0x2fb
>>> [348439.713275] nmi_panic+0x35/0x40
>>> [348439.713276] watchdog_hardlockup_check+0xeb/0x110
>>> [348439.713277] __perf_event_overflow+0x52/0xf0
>>> [348439.713278] handle_pmi_common+0x21a/0x320
>>> [348439.713286] intel_pmu_handle_irq+0xc9/0x1b0
>>> [348439.713287] perf_event_nmi_handler+0x24/0x40
>>> [348439.713288] nmi_handle+0xc3/0x2a0
>>> [348439.713289] default_do_nmi+0x49/0xf0
>>> [348439.713289] exc_nmi+0x146/0x160
>>> [348439.713290] end_repeat_nmi+0x16/0x55
>>> [348439.713291] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
>>> [348439.713293] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
>>> [348439.713295] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
>>> [348439.713296] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
>>> [348439.713296] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
>>> [348439.713297] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
>>> [348439.713298] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
>>> [348439.713300] </NMI>
>>> [348439.713301] do_raw_write_lock+0xa9/0xb0
>>> [348439.713302] _raw_write_lock_irq+0x5a/0x70
>>> [348439.713303] do_exit+0x429/0xd00
>>> [348439.713303] do_group_exit+0x39/0xb0
>>> [348439.713304] __x64_sys_exit_group+0x14/0x20
>>> [348439.713305] do_syscall_64+0x33/0x40
>>> [348439.713306] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [348439.713307] RIP: 0033:0x7f59295a7066
>>> [348439.713308] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.
>>>
>>>
>>> when analyzing vmcore, i notice lots of fork12 processes are waiting for tasklist read lock or write
>>> lock (see the attachment file all_cpu_stacks.log),and every fork12 process(belongs to the same
>>> process group) call kill(0, SIGQUIT) in their signal handler()[1], it will traverse all the processes in the
>>> same process group and send signal to them one by one, which is a very time-costly work and hold tasklist
>>> read lock long time. At the same time, other processes will exit after receive signal, they try to get
>>> the tasklist write lock at exit_notify().
>>>
>>> [1] fork12 testcase: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/fork/fork12.c
>>>
>>> some processes call kill(0, SIGQUIT), wait for tasklist read lock:
>>>
>>> #5 [ffff972a9b16fd78] native_queued_spin_lock_slowpath at ffffffff9931ed47
>>> #6 [ffff972a9b16fd78] queued_read_lock_slowpath at ffffffff99320a58
>>> #7 [ffff972a9b16fd90] do_wait at ffffffff992bc17d
>>> #8 [ffff972a9b16fdd0] kernel_wait4 at ffffffff992bd88d
>>> #9 [ffff972a9b16fe58] __do_sys_wait4 at ffffffff992bd9e5
>>> #10 [ffff972a9b16ff30] do_syscall_64 at ffffffff9920432d
>>> #11 [ffff972a9b16ff50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad
>>>
>>> As the same time, some processes are exiting, wait for tasklist write lock:
>>>
>>> #5 [ffff972aa49a7e60] native_queued_spin_lock_slowpath at ffffffff9931ecb0
>>> #6 [ffff972aa49a7e60] queued_write_lock_slowpath at ffffffff993209e4
>>> #7 [ffff972aa49a7e78] do_raw_write_lock at ffffffff99320834
>>> #8 [ffff972aa49a7e88] do_exit at ffffffff992bcd78
>>> #9 [ffff972aa49a7f00] do_group_exit at ffffffff992bd719
>>> #10 [ffff972aa49a7f28] __x64_sys_exit_group at ffffffff992bd7a4
>>> #11 [ffff972aa49a7f30] do_syscall_64 at ffffffff9920432d
>>> #12 [ffff972aa49a7f50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad
>>>
>>> In this scenario,there are lots of process are waiting for tasklist read lock or the tasklist
>>> write lock, so they will queue. if the wait queue is long enough, it might cause a hardlockup issue when a
>>> process wait for taking the write lock at exit_notify().
>>>
>>> I tried to solve this problem by avoiding traversing the process group multiple times when kill(0, xxxx)
>>> is called multiple times form the same process group, but it doesn't look like a good solution.
>>>
>>> Is there any good idea for fixing this problem ?
>>>
>>> Thanks!
>>>
>>> Qiao
>>> .
>>>
>> .
>>

2022-04-15 05:44:26

by Zhang Qiao

[permalink] [raw]
Subject: Re: Question about kill a process group



在 2022/4/13 23:47, Eric W. Biederman 写道:
> Zhang Qiao <[email protected]> writes:
>
>> Gentle ping. Any comments on this problem?
>
> Is fork12 a new test?


The fork12 is a ltp testcase.
(https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/fork/fork12.c)


>
> Is there a real world use case that connects to this?
>
> How many children are being created in this test? Several million?

There are about 300,000+ processes.

>
> I would like to blame this on the old issue that tasklist_lock being
> a global lock. Given the number of child processes (as many as can be
> created) I don't think we are hurt much by using a global lock. The
> problem for solubility is that we have a lock.
>
> Fundamentally there must be a lock taken to maintain the parent's
> list of children.
>
> I only see SIGQUIT being called once in the parent process so that
> should not be an issue.


In fork12, every child will call kill(0, SIGQUIT) at cleanup().
There are a lot of kill(0, SIGQUIT) calls.

>
> There is a minor issue in fork12 that it calls exit(0) instead of
> _exit(0) in the children. Not the problem you are dealing with
> but it does look like it can be a distraction.
>
> I suspect the issue really is the thundering hurd of a million+
> processes synchronizing on a single lock.
>
> I don't think this is a hard lockup, just a global slow down.
> I expect everything will eventually exit.
>

But according to the vmcore, this is a hardlockup issue, and i think
there may be the following scenarios:

rl = read_lock(tasklist_lock);
ru = read_unlock(tasklist_lock);
wl = write_lock_irq(tasklist_lock);
wu = write_unlock_irq(tasklist_lock);

t0 t1 t2 t3 t4 t5 t6 t7 t8 ......
cpu0: rl<------------speed 1s ----------->ru // a fork12 call kill(0, SIGQUIT) at t0 on cpu0,
taking tasklist read lock at __kill_pgrp_info()

cpu1: wl<-----wait lock---------------->|<--get lock-->wu // a fork12 exit, and will disable irq, spin for waiting
tasklist write lock at exit_notify() util cpu0 unlock.

cpu2: rl<---- wait readlock---------------------.....-->ru // a fork12 call kill(0, SIGQUIT), spin for waiting cpu1 unlock.

cpu3: wl<-----------------------------......-------->wu // a fork12 do exit, spin for waiting cpu2 unlock...

.....

cpux: rl<-------------------......-------------------->ru // a fork12 call kill(0, SIGQUIT), spin for waiting other cpu unlock.

cpux+1: wl<-------------------......-------------------->wu // a fork12 do exit, spin for waiting cpux unlock. The cpu may
trigger a hardlockup if too many fork12 are spining to acquire
the tasklist read/write lock.


As above,the fork12 will take a lot of time to send the signal to the child process at
__kill_pgrp_info(), the whole process will take more than a second(more than 300000+ children).

when the fork12 hold tasklist read lock over one sencond at __kill_pgrp_info(), there may be a
large number of chilren do exit and kill(0, SIGQUIT), they will alternately acquire the tasklist
lock(queued spinlock) and spin on waitqueue.

Because the process that call __kill_pgrp_info() on the queue takes a lot of time, the exiting process
at the tail of waitqueue will wait for long time at exit_notify(), it will cause a hardlockup issue.


>
> To do something about this is going to take a deep and fundamental
> redesign of how we maintain process lists to handle a parent
> with millions of children well.
>
> Is there any real world reason to care about this case? Without
> real world motivation I am inclined to just note that this is

I just foune it while i ran ltp test.


thanks!

qiao.

> something that is handled poorly, and leave it as is.




>
> Eric
>
>>
>> 在 2022/4/2 10:22, Zhang Qiao 写道:
>>> ping...
>>>
>>> Any suggestions for this problem?
>>>
>>> thank!
>>> Qiao
>>>
>>>
>>> 在 2022/3/29 16:07, Zhang Qiao 写道:
>>>> hello everyone,
>>>>
>>>> I got a hradlockup panic when run the ltp syscall testcases.
>>>>
>>>> 348439.713178] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
>>>> [348439.713236] irq event stamp: 0
>>>> [348439.713237] hardirqs last enabled at (0): [<0000000000000000>] 0x0
>>>> [348439.713238] hardirqs last disabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
>>>> [348439.713239] softirqs last enabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
>>>> [348439.713240] softirqs last disabled at (0): [<0000000000000000>] 0x0
>>>> [348439.713241] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S 5.10.0+ #1
>>>> [348439.713242] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
>>>> [348439.713243] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
>>>> [348439.713245] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
>>>> [348439.713246] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
>>>> [348439.713247] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
>>>> [348439.713248] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
>>>> [348439.713249] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
>>>> [348439.713250] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
>>>> [348439.713251] FS: 0000000000000000(0000) GS:ffff96edbc200000(0000) knlGS:0000000000000000
>>>> [348439.713252] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [348439.713253] CR2: 0000000000416ea0 CR3: 0000002d91812004 CR4: 00000000001706e0
>>>> [348439.713254] Call Trace:
>>>> [348439.713255] do_raw_write_lock+0xa9/0xb0
>>>> [348439.713256] _raw_write_lock_irq+0x5a/0x70
>>>> [348439.713256] do_exit+0x429/0xd00
>>>> [348439.713257] do_group_exit+0x39/0xb0
>>>> [348439.713258] __x64_sys_exit_group+0x14/0x20
>>>> [348439.713259] do_syscall_64+0x33/0x40
>>>> [348439.713260] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> [348439.713260] RIP: 0033:0x7f59295a7066
>>>> [348439.713261] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.
>>>> [348439.713262] RSP: 002b:00007fff0afeac38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>>> [348439.713264] RAX: ffffffffffffffda RBX: 00007f5929694530 RCX: 00007f59295a7066
>>>> [348439.713265] RDX: 0000000000000002 RSI: 000000000000003c RDI: 0000000000000002
>>>> [348439.713266] RBP: 0000000000000002 R08: 00000000000000e7 R09: ffffffffffffff80
>>>> [348439.713267] R10: 0000000000000002 R11: 0000000000000246 R12: 00007f5929694530
>>>> [348439.713268] R13: 0000000000000001 R14: 00007f5929697f68 R15: 0000000000000000
>>>> [348439.713269] Kernel panic - not syncing: Hard LOCKUP
>>>> [348439.713270] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S 5.10.0+ #1
>>>> [348439.713272] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
>>>> [348439.713272] Call Trace:
>>>> [348439.713273] <NMI>
>>>> [348439.713274] dump_stack+0x77/0x97
>>>> [348439.713275] panic+0x10c/0x2fb
>>>> [348439.713275] nmi_panic+0x35/0x40
>>>> [348439.713276] watchdog_hardlockup_check+0xeb/0x110
>>>> [348439.713277] __perf_event_overflow+0x52/0xf0
>>>> [348439.713278] handle_pmi_common+0x21a/0x320
>>>> [348439.713286] intel_pmu_handle_irq+0xc9/0x1b0
>>>> [348439.713287] perf_event_nmi_handler+0x24/0x40
>>>> [348439.713288] nmi_handle+0xc3/0x2a0
>>>> [348439.713289] default_do_nmi+0x49/0xf0
>>>> [348439.713289] exc_nmi+0x146/0x160
>>>> [348439.713290] end_repeat_nmi+0x16/0x55
>>>> [348439.713291] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
>>>> [348439.713293] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
>>>> [348439.713295] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
>>>> [348439.713296] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
>>>> [348439.713296] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
>>>> [348439.713297] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
>>>> [348439.713298] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
>>>> [348439.713300] </NMI>
>>>> [348439.713301] do_raw_write_lock+0xa9/0xb0
>>>> [348439.713302] _raw_write_lock_irq+0x5a/0x70
>>>> [348439.713303] do_exit+0x429/0xd00
>>>> [348439.713303] do_group_exit+0x39/0xb0
>>>> [348439.713304] __x64_sys_exit_group+0x14/0x20
>>>> [348439.713305] do_syscall_64+0x33/0x40
>>>> [348439.713306] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> [348439.713307] RIP: 0033:0x7f59295a7066
>>>> [348439.713308] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.
>>>>
>>>>
>>>> when analyzing vmcore, i notice lots of fork12 processes are waiting for tasklist read lock or write
>>>> lock (see the attachment file all_cpu_stacks.log),and every fork12 process(belongs to the same
>>>> process group) call kill(0, SIGQUIT) in their signal handler()[1], it will traverse all the processes in the
>>>> same process group and send signal to them one by one, which is a very time-costly work and hold tasklist
>>>> read lock long time. At the same time, other processes will exit after receive signal, they try to get
>>>> the tasklist write lock at exit_notify().
>>>>
>>>> [1] fork12 testcase: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/fork/fork12.c
>>>>
>>>> some processes call kill(0, SIGQUIT), wait for tasklist read lock:
>>>>
>>>> #5 [ffff972a9b16fd78] native_queued_spin_lock_slowpath at ffffffff9931ed47
>>>> #6 [ffff972a9b16fd78] queued_read_lock_slowpath at ffffffff99320a58
>>>> #7 [ffff972a9b16fd90] do_wait at ffffffff992bc17d
>>>> #8 [ffff972a9b16fdd0] kernel_wait4 at ffffffff992bd88d
>>>> #9 [ffff972a9b16fe58] __do_sys_wait4 at ffffffff992bd9e5
>>>> #10 [ffff972a9b16ff30] do_syscall_64 at ffffffff9920432d
>>>> #11 [ffff972a9b16ff50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad
>>>>
>>>> As the same time, some processes are exiting, wait for tasklist write lock:
>>>>
>>>> #5 [ffff972aa49a7e60] native_queued_spin_lock_slowpath at ffffffff9931ecb0
>>>> #6 [ffff972aa49a7e60] queued_write_lock_slowpath at ffffffff993209e4
>>>> #7 [ffff972aa49a7e78] do_raw_write_lock at ffffffff99320834
>>>> #8 [ffff972aa49a7e88] do_exit at ffffffff992bcd78
>>>> #9 [ffff972aa49a7f00] do_group_exit at ffffffff992bd719
>>>> #10 [ffff972aa49a7f28] __x64_sys_exit_group at ffffffff992bd7a4
>>>> #11 [ffff972aa49a7f30] do_syscall_64 at ffffffff9920432d
>>>> #12 [ffff972aa49a7f50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad
>>>>
>>>> In this scenario,there are lots of process are waiting for tasklist read lock or the tasklist
>>>> write lock, so they will queue. if the wait queue is long enough, it might cause a hardlockup issue when a
>>>> process wait for taking the write lock at exit_notify().
>>>>
>>>> I tried to solve this problem by avoiding traversing the process group multiple times when kill(0, xxxx)
>>>> is called multiple times form the same process group, but it doesn't look like a good solution.
>>>>
>>>> Is there any good idea for fixing this problem ?
>>>>
>>>> Thanks!
>>>>
>>>> Qiao
>>>> .
>>>>
>>> .
>>>
> .
>

2022-04-22 19:24:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Question about kill a process group

Zhang Qiao <[email protected]> writes:

> 在 2022/4/13 23:47, Eric W. Biederman 写道:
>> To do something about this is going to take a deep and fundamental
>> redesign of how we maintain process lists to handle a parent
>> with millions of children well.
>>
>> Is there any real world reason to care about this case? Without
>> real world motivation I am inclined to just note that this is
>
> I just foune it while i ran ltp test.

So I looked and fork12 has been around since 2002 in largely it's
current form. So I am puzzled why you have run into problems
and other people have not.

Did you perhaps have lock debugging enabled?

Did you run on a very large machine where a ridiculous number processes
could be created?

Did you happen to run fork12 on a machine where locks are much more
expensive than on most machines?


>> Is there a real world use case that connects to this?
>>
>> How many children are being created in this test? Several million?
>
> There are about 300,000+ processes.

Not as many as I was guessing, but still enough to cause a huge
wait on locks.

>> I would like to blame this on the old issue that tasklist_lock being
>> a global lock. Given the number of child processes (as many as can be
>> created) I don't think we are hurt much by using a global lock. The
>> problem for solubility is that we have a lock.
>>
>> Fundamentally there must be a lock taken to maintain the parent's
>> list of children.
>>
>> I only see SIGQUIT being called once in the parent process so that
>> should not be an issue.
>
>
> In fork12, every child will call kill(0, SIGQUIT) at cleanup().
> There are a lot of kill(0, SIGQUIT) calls.

I had missed that. I can see that stressing out a lot.

At the same time as I read fork12.c that is very much a bug. The
children in fork12.c should call _exit() instead of exit(). Which
would suppress calling the atexit() handlers and let fork12.c
test what it is trying to test.

That doesn't mean there isn't a mystery here, but more that if
we really want to test lots of processes calling the same
signal at the same time it should be a test that means to do that.


>> There is a minor issue in fork12 that it calls exit(0) instead of
>> _exit(0) in the children. Not the problem you are dealing with
>> but it does look like it can be a distraction.
>>
>> I suspect the issue really is the thundering hurd of a million+
>> processes synchronizing on a single lock.
>>
>> I don't think this is a hard lockup, just a global slow down.
>> I expect everything will eventually exit.
>>
>
> But according to the vmcore, this is a hardlockup issue, and i think
> there may be the following scenarios:

Let me rewind a second. I just realized that I don't have a clue what
a hard lockup is (outside of the linux hard lockup detector).

The two kinds of lockups that I understand with a technical meaning are
deadlock (such taking two locks in opposite orders which can never be
escaped), and livelock (where things are so busy no progress is made for
an extended period of time).

I meant to say this is not a deadlock situation. This looks like a
livelock, but I think given enough time the code would make progress and
get out of it.

I do agree over 1 second for holding a spin lock is ridiculous and a
denial of service attack.



What I unfortunately do not see is a real world scenario where this will
happen. Without a real world scenario it is hard to find motivation to
spend the year or so it would take to rework all of the data structures.
The closest I can imagine to a real world scenario is that this
situation can be used as a denial of service attack.

The hardest part of the problem is that signals sent to a group need to
be sent to the group atomically. That is the signals need to be sent to
every member of the group.

Anyway I am very curious why you are the only one seeing a problem with
fork12. That we can definitely investigate as tracking down what is
different about your setup versus other people who have run ltp seems
much easier than redesigning all of the signal processing data
structures from scratch.

Eric

2022-04-28 09:46:07

by Zhang Qiao

[permalink] [raw]
Subject: Re: Question about kill a process group


hi,

在 2022/4/22 0:12, Eric W. Biederman 写道:
> Zhang Qiao <[email protected]> writes:
>
>> 在 2022/4/13 23:47, Eric W. Biederman 写道:
>>> To do something about this is going to take a deep and fundamental
>>> redesign of how we maintain process lists to handle a parent
>>> with millions of children well.
>>>
>>> Is there any real world reason to care about this case? Without
>>> real world motivation I am inclined to just note that this is
>>
>> I just foune it while i ran ltp test.
>
> So I looked and fork12 has been around since 2002 in largely it's
> current form. So I am puzzled why you have run into problems
> and other people have not.
>
> Did you perhaps have lock debugging enabled?
>
> Did you run on a very large machine where a ridiculous number processes
> could be created?
>
> Did you happen to run fork12 on a machine where locks are much more
> expensive than on most machines?


I don't think so, I reproduced this problem on two servers with different configurations.
One of server info as follows:
cpu: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 64 cpus,
RAM: 377G

Do you need any other information?

>
>
>>> Is there a real world use case that connects to this?
>>>
>>> How many children are being created in this test? Several million?
>>
>> There are about 300,000+ processes.
>
> Not as many as I was guessing, but still enough to cause a huge
> wait on locks.
>
>>> I would like to blame this on the old issue that tasklist_lock being
>>> a global lock. Given the number of child processes (as many as can be
>>> created) I don't think we are hurt much by using a global lock. The
>>> problem for solubility is that we have a lock.
>>>
>>> Fundamentally there must be a lock taken to maintain the parent's
>>> list of children.
>>>
>>> I only see SIGQUIT being called once in the parent process so that
>>> should not be an issue.
>>
>>
>> In fork12, every child will call kill(0, SIGQUIT) at cleanup().
>> There are a lot of kill(0, SIGQUIT) calls.
>
> I had missed that. I can see that stressing out a lot.
>
> At the same time as I read fork12.c that is very much a bug. The
> children in fork12.c should call _exit() instead of exit(). Which
> would suppress calling the atexit() handlers and let fork12.c
> test what it is trying to test.
>
> That doesn't mean there isn't a mystery here, but more that if
> we really want to test lots of processes calling the same
> signal at the same time it should be a test that means to do that.
>
>
>>> There is a minor issue in fork12 that it calls exit(0) instead of
>>> _exit(0) in the children. Not the problem you are dealing with
>>> but it does look like it can be a distraction.
>>>
>>> I suspect the issue really is the thundering hurd of a million+
>>> processes synchronizing on a single lock.
>>>
>>> I don't think this is a hard lockup, just a global slow down.
>>> I expect everything will eventually exit.
>>>
>>
>> But according to the vmcore, this is a hardlockup issue, and i think
>> there may be the following scenarios:
>
> Let me rewind a second. I just realized that I don't have a clue what
> a hard lockup is (outside of the linux hard lockup detector).
>
> The two kinds of lockups that I understand with a technical meaning are
> deadlock (such taking two locks in opposite orders which can never be
> escaped), and livelock (where things are so busy no progress is made for
> an extended period of time).
>
> I meant to say this is not a deadlock situation. This looks like a
> livelock, but I think given enough time the code would make progress and
> get out of it.
>
> I do agree over 1 second for holding a spin lock is ridiculous and a
> denial of service attack.
>
>
>
> What I unfortunately do not see is a real world scenario where this will
> happen. Without a real world scenario it is hard to find motivation to
> spend the year or so it would take to rework all of the data structures.
> The closest I can imagine to a real world scenario is that this
> situation can be used as a denial of service attack.
>
> The hardest part of the problem is that signals sent to a group need to
> be sent to the group atomically. That is the signals need to be sent to
> every member of the group.
>
> Anyway I am very curious why you are the only one seeing a problem with
> fork12. That we can definitely investigate as tracking down what is
> different about your setup versus other people who have run ltp seems
> much easier than redesigning all of the signal processing data
> structures from scratch.

the test steps are as follows:

1. git clone https://github.com/linux-test-project/ltp.git --depth=1
2. cd ltp/
3. make autotools
4. ./configure
5. cd testcases/kernel/syscalls/
6. make -j64
7. find ./ -type f -executable > newlist
8. while read line;do ./$line -I 30;done < newlist
9. After ten hours, i trigger Ctrl+C repeatedly.

>
> Eric
> .
>

2022-04-28 13:18:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Question about kill a process group

On Thu, Apr 21 2022 at 11:12, Eric W. Biederman wrote:
> Zhang Qiao <[email protected]> writes:
>>> How many children are being created in this test? Several million?
>>
>> There are about 300,000+ processes.
>
> Not as many as I was guessing, but still enough to cause a huge
> wait on locks.

Indeed. It's about 4-5us per process to send the signal on a 2GHz
SKL-X. So with 20000k processes tasklist lock is read held for 1 second.

> I do agree over 1 second for holding a spin lock is ridiculous and a
> denial of service attack.

Exactly. Even holding it for 100ms (20k forks) is daft.

So unless the number of PIDs for a user is limited this _is_ an
unpriviledged DoS vector.

> Anyway I am very curious why you are the only one seeing a problem with
> fork12.

It's fully reproducible. It's just a question how big the machine is and
what the PID limits are on the box you are testing on.

>>> I suspect the issue really is the thundering hurd of a million+
>>> processes synchronizing on a single lock.

There are several issues:

1) The parent sending the signal is holding the lock for an
obscene long time.

2) Every signaled child runs into tasklist lock contention as all of
them need to aquire it for write in do_exit(). That means within
(NR_CPUS - 1) * 5usec all CPUs are spinning on tasklist lock with
interrupts disabled up to the point where #1 has finished.

So depending on the number of childs and the configured limits of a
lockup detector this is sufficient to trigger a warning.

Thanks,

tglx

2022-05-12 06:28:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Question about kill a process group

On Wed, May 11 2022 at 13:33, Eric W. Biederman wrote:
> Thomas Gleixner <[email protected]> writes:
>> So unless the number of PIDs for a user is limited this _is_ an
>> unpriviledged DoS vector.
>
> After having slept on this a bit it finally occurred to me the
> semi-obvious solution to this issue is to convert tasklist_lock
> from a rw-spinlock to rw-semaphore. The challenge is finding
> the users (tty layer?) that generate signals from interrupt
> context and redirect that signal generation.

From my outdated notes where I looked at this before:

[soft]interrupt context which acquires tasklist_lock:
sysrq-e send_sig_all()
sysrq-i send_sig_all()
sysrq-n normalize_rt_tasks()

tasklist_lock nesting into other locks:
fs/fcntl.c: send_sigio(), send_sigurg()

send_sigurg() is called from the network stack ...

Some very obscure stuff in arch/ia64/kernel/mca.c which is called
from a DIE notifier.

Plus quite a bunch of read_lock() instances which nest inside
rcu_read_lock() held sections.

This is probably incomplete, but the scope of the problem has been
greatly reduced vs. the point where I looked at it last time a couple of
years ago. But that's still a herculean task.

> Once signals holding tasklist_lock are no longer generated from
> interrupt context irqs no longer need to be disabled and
> after verifying tasklist_lock isn't held under any other spinlocks
> it can be converted to a semaphore.

Going to take a while. :)

> It won't help the signal delivery times, but it should reduce
> the effect on the rest of the system, and prevent watchdogs from
> firing.

The signal delivery time itself is the least of the worries, but this
still prevents any other operations which require tasklist_lock from
making progress for quite some time, i.e. fork/exec for unrelated
processes/users will have to wait too. So you converted the 'visible'
DoS to an 'invisible' one.

The real problem is that the scope of tasklist_lock is too broad for
most use cases. That does not change when you actually can convert it to
a rwsem. The underlying problem still persists.

Let's take a step back and look what most sane use cases (sysrq-* is not
in that category) require:

Preventing that tasks are added or removed

Do they require that any task is added or removed? No.

They require to prevent add/remove for the intended scope.

That's the thing we need to focus on: reducing the protection scope.

If we can segment the protection for the required scope of e.g. kill(2)
then we still can let unrelated processes/tasks make progress and just
inflict the damage on the affected portion of processes/tasks.

For example:

read_lock(&tasklist_lock);
for_each_process(p) {
if (task_pid_vnr(p) > 1 &&
!same_thread_group(p, current)) {

group_send_sig_info(...., p);
}
}
read_unlock(&tasklist_lock);

same_thread_group() does:

return p->signal == current->signal;

Ideally we can do:

read_lock(&tasklist_lock);
prevent_add_remove(current->signal);
read_unlock(&tasklist_lock);

rcu_read_lock();
for_each_process(p) {
if (task_pid_vnr(p) > 1 &&
!same_thread_group(p, current)) {

group_send_sig_info(...., p);
}
}
rcu_read_unlock();

allow_add_remove(current->signal);

Where prevent_add_remove() sets a state which has to be waited for to be
cleared by anything which wants to add/remove a task in that scope or
change $relatedtask->signal until allow_add_remove() removes that
blocker. I'm sure it's way more complicated, but you get the idea.

If we find a solution to this scope reduction problem, then it will not
only squash the issue which started this discussion. This will have a
benefit in general.

Thanks,

tglx

2022-05-13 12:16:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Question about kill a process group

Thomas Gleixner <[email protected]> writes:

> On Wed, May 11 2022 at 13:33, Eric W. Biederman wrote:
>> Thomas Gleixner <[email protected]> writes:
>>> So unless the number of PIDs for a user is limited this _is_ an
>>> unpriviledged DoS vector.
>>
>> After having slept on this a bit it finally occurred to me the
>> semi-obvious solution to this issue is to convert tasklist_lock
>> from a rw-spinlock to rw-semaphore. The challenge is finding
>> the users (tty layer?) that generate signals from interrupt
>> context and redirect that signal generation.
>
> From my outdated notes where I looked at this before:
>
> [soft]interrupt context which acquires tasklist_lock:
> sysrq-e send_sig_all()
> sysrq-i send_sig_all()
> sysrq-n normalize_rt_tasks()
>
> tasklist_lock nesting into other locks:
> fs/fcntl.c: send_sigio(), send_sigurg()
>
> send_sigurg() is called from the network stack ...
>
> Some very obscure stuff in arch/ia64/kernel/mca.c which is called
> from a DIE notifier.

I think we are very close to the point that if ia64 is the only user
problem case we can just do "git rm arch/ia64". I am not certain
there is even anyone left that cares enough to report breakage
on ia64.

> Plus quite a bunch of read_lock() instances which nest inside
> rcu_read_lock() held sections.
>
> This is probably incomplete, but the scope of the problem has been
> greatly reduced vs. the point where I looked at it last time a couple of
> years ago. But that's still a herculean task.

I won't argue.

>> Once signals holding tasklist_lock are no longer generated from
>> interrupt context irqs no longer need to be disabled and
>> after verifying tasklist_lock isn't held under any other spinlocks
>> it can be converted to a semaphore.
>
> Going to take a while. :)

It is a very tractable problem that people can work on incrementally.

>> It won't help the signal delivery times, but it should reduce
>> the effect on the rest of the system, and prevent watchdogs from
>> firing.
>
> The signal delivery time itself is the least of the worries, but this
> still prevents any other operations which require tasklist_lock from
> making progress for quite some time, i.e. fork/exec for unrelated
> processes/users will have to wait too. So you converted the 'visible'
> DoS to an 'invisible' one.
>
> The real problem is that the scope of tasklist_lock is too broad for
> most use cases. That does not change when you actually can convert it to
> a rwsem. The underlying problem still persists.
>
> Let's take a step back and look what most sane use cases (sysrq-* is not
> in that category) require:
>
> Preventing that tasks are added or removed
>
> Do they require that any task is added or removed? No.
>
> They require to prevent add/remove for the intended scope.
>
> That's the thing we need to focus on: reducing the protection scope.
>
> If we can segment the protection for the required scope of e.g. kill(2)
> then we still can let unrelated processes/tasks make progress and just
> inflict the damage on the affected portion of processes/tasks.
>
> For example:
>
> read_lock(&tasklist_lock);
> for_each_process(p) {
> if (task_pid_vnr(p) > 1 &&
> !same_thread_group(p, current)) {
>
> group_send_sig_info(...., p);
> }
> }
> read_unlock(&tasklist_lock);
>
> same_thread_group() does:
>
> return p->signal == current->signal;

Yes. So the sender can not send a signal to itself.
Basically it is a test to see if a thread is a member of a process.

> Ideally we can do:
>
> read_lock(&tasklist_lock);
> prevent_add_remove(current->signal);
> read_unlock(&tasklist_lock);
>
> rcu_read_lock();
> for_each_process(p) {
> if (task_pid_vnr(p) > 1 &&
> !same_thread_group(p, current)) {
>
> group_send_sig_info(...., p);
> }
> }
> rcu_read_unlock();
>
> allow_add_remove(current->signal);
>
> Where prevent_add_remove() sets a state which has to be waited for to be
> cleared by anything which wants to add/remove a task in that scope or
> change $relatedtask->signal until allow_add_remove() removes that
> blocker. I'm sure it's way more complicated, but you get the idea.

Hmm.

For sending signals what is needed is the guarantee that the signal is
sent to an atomic snapshot of the appropriate group of processes so that
SIGKILL sent to the group will reliably kill all of the processes. It
should be ok for a process to exit on it's own from the group. As long
as it logically looks like the process exited before the signal was
sent.

There is also ptrace_attach/__ptrace_unlink, reparenting,
kill_orphaned_pgrp, zap_pid_ns_processes, and pid hash table
maintenance in release_task.

I have a patch I am playing with that protects task->parent and
task->real_parent with siglock and with a little luck that can
be generalized so that sending signals to parents and ptrace don't
need tasklist_lock at all.

For reparenting of children the new parents list of children
needs protection but that should not need tasklist lock.

For kill_orphaned_pgrp with some additional per process group
maintenance state so that will_become_orphaned_pgrp and has_stopped_jobs
don't need to traverse the process group it should be possible to
just have it send a sender of a process group signal.

zap_pid_ns_processes is already called without the tasklist_lock.

Maintenance of the pid hash table certainly needs a write lock in
__exit_signal but it doesn't need to be tasklist_lock.





Which is a long way of saying semantically all we need is to
prevent_addition to the group of processes a signal will be sent to. We
have one version of that prevention today in fork where it tests
fatal_signal_pending after taking tasklist_lock and siglock. For the
case you are describing the code would just need to check each group of
processes the new process is put into.


Hmm.

When I boil it all down in my head I wind up with something like:

rwlock_t *lock = signal_lock(enum pid_type type);
read_lock(lock);
/* Do the work of sending the signal */
read_unlock(lock);

With fork needing to grab all of those possible locks for write
as it adds the process to the group.

Maybe it could be something like:

struct group_signal {
struct hlist_node node;
struct kernel_siginfo *info;
};

void start_group_signal(struct group_signal *group, struct
kernel_siginfo *info, enum pid_type type);
void end_group_signal(struct group_signal *group);

struct group_signal group_sig;
start_group_signal(&group_sig, info, PIDTYPE_PGID);

/* Walk through the list and deliver the signal */

end_group_signal(&group_sig);

That would allow fork to see all signals that are being delivered to a
group even it the signal has not been delivered to the parent process
yet. At which point the signal could be delivered to the parent before
the fork. I just need something to ensure that the signal delivery loop
between start_group_signal and end_group_signal skips processes that
hurried up and delivered the signal to themselves, and does not
deliver to newly added processes. A generation counter perhaps.

There is a lot to flesh out, and I am burried alive in other cleanups
but I think that could work, and remove the need to hold tasklist_lock
during signal delivery.


> If we find a solution to this scope reduction problem, then it will not
> only squash the issue which started this discussion. This will have a
> benefit in general.

We need to go farther than simple scope reduction to benefit the
original problem. As all of the process in that problem were
all sending a signal to the same process group. So they all needed
to wait for each other.

If we need to block adds then the adds need to effectively take a
write_lock to the read_lock taken during signal delivery. Because
all of the blocking is the same we won't see an improvement of
the original problem.



If in addition to scope reduction, a barrier is implemented so that
it is guaranteed that past a certain point processes will see the signal
before they fork (or do anything else that userspace could tell the
signal was not delivered atomically) then I think we can eliminate
blocking in the same places and an improvement in the issue that
started this discussion can be seen.


I will queue it up on my list of things I would like to do. I am
burried in other signal related cleanups at the moment so I don't know
when I will be able to get to anything like that. But I really
appreciate the idea.

Eric


2022-05-13 17:23:38

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Question about kill a process group

Thomas Gleixner <[email protected]> writes:

> On Thu, Apr 21 2022 at 11:12, Eric W. Biederman wrote:
>> Zhang Qiao <[email protected]> writes:
>>>> How many children are being created in this test? Several million?
>>>
>>> There are about 300,000+ processes.
>>
>> Not as many as I was guessing, but still enough to cause a huge
>> wait on locks.
>
> Indeed. It's about 4-5us per process to send the signal on a 2GHz
> SKL-X. So with 20000k processes tasklist lock is read held for 1 second.
>
>> I do agree over 1 second for holding a spin lock is ridiculous and a
>> denial of service attack.
>
> Exactly. Even holding it for 100ms (20k forks) is daft.
>
> So unless the number of PIDs for a user is limited this _is_ an
> unpriviledged DoS vector.

After having slept on this a bit it finally occurred to me the
semi-obvious solution to this issue is to convert tasklist_lock
from a rw-spinlock to rw-semaphore. The challenge is finding
the users (tty layer?) that generate signals from interrupt
context and redirect that signal generation.

Once signals holding tasklist_lock are no longer generated from
interrupt context irqs no longer need to be disabled and
after verifying tasklist_lock isn't held under any other spinlocks
it can be converted to a semaphore.

It won't help the signal delivery times, but it should reduce
the effect on the rest of the system, and prevent watchdogs from
firing.

I don't know if I have time to do any of that now, but it does seem a
reasonable direction to move the code in.

Eric

2022-09-26 07:46:26

by Zhang Qiao

[permalink] [raw]
Subject: Re: Question about kill a process group



在 2022/5/13 2:23, Eric W. Biederman 写道:
> Thomas Gleixner <[email protected]> writes:
>
>> On Wed, May 11 2022 at 13:33, Eric W. Biederman wrote:
>>> Thomas Gleixner <[email protected]> writes:
>>>> So unless the number of PIDs for a user is limited this _is_ an
>>>> unpriviledged DoS vector.
>>>
>>> After having slept on this a bit it finally occurred to me the
>>> semi-obvious solution to this issue is to convert tasklist_lock
>>> from a rw-spinlock to rw-semaphore. The challenge is finding
>>> the users (tty layer?) that generate signals from interrupt
>>> context and redirect that signal generation.
>>
>> From my outdated notes where I looked at this before:
>>
>> [soft]interrupt context which acquires tasklist_lock:
>> sysrq-e send_sig_all()
>> sysrq-i send_sig_all()
>> sysrq-n normalize_rt_tasks()
>>
>> tasklist_lock nesting into other locks:
>> fs/fcntl.c: send_sigio(), send_sigurg()
>>
>> send_sigurg() is called from the network stack ...
>>
>> Some very obscure stuff in arch/ia64/kernel/mca.c which is called
>> from a DIE notifier.
>
> I think we are very close to the point that if ia64 is the only user
> problem case we can just do "git rm arch/ia64". I am not certain
> there is even anyone left that cares enough to report breakage
> on ia64.
>
>> Plus quite a bunch of read_lock() instances which nest inside
>> rcu_read_lock() held sections.
>>
>> This is probably incomplete, but the scope of the problem has been
>> greatly reduced vs. the point where I looked at it last time a couple of
>> years ago. But that's still a herculean task.
>
> I won't argue.
>
>>> Once signals holding tasklist_lock are no longer generated from
>>> interrupt context irqs no longer need to be disabled and
>>> after verifying tasklist_lock isn't held under any other spinlocks
>>> it can be converted to a semaphore.
>>
>> Going to take a while. :)
>
> It is a very tractable problem that people can work on incrementally.
>
>>> It won't help the signal delivery times, but it should reduce
>>> the effect on the rest of the system, and prevent watchdogs from
>>> firing.
>>
>> The signal delivery time itself is the least of the worries, but this
>> still prevents any other operations which require tasklist_lock from
>> making progress for quite some time, i.e. fork/exec for unrelated
>> processes/users will have to wait too. So you converted the 'visible'
>> DoS to an 'invisible' one.
>>
>> The real problem is that the scope of tasklist_lock is too broad for
>> most use cases. That does not change when you actually can convert it to
>> a rwsem. The underlying problem still persists.
>>
>> Let's take a step back and look what most sane use cases (sysrq-* is not
>> in that category) require:
>>
>> Preventing that tasks are added or removed
>>
>> Do they require that any task is added or removed? No.
>>
>> They require to prevent add/remove for the intended scope.
>>
>> That's the thing we need to focus on: reducing the protection scope.
>>
>> If we can segment the protection for the required scope of e.g. kill(2)
>> then we still can let unrelated processes/tasks make progress and just
>> inflict the damage on the affected portion of processes/tasks.
>>
>> For example:
>>
>> read_lock(&tasklist_lock);
>> for_each_process(p) {
>> if (task_pid_vnr(p) > 1 &&
>> !same_thread_group(p, current)) {
>>
>> group_send_sig_info(...., p);
>> }
>> }
>> read_unlock(&tasklist_lock);
>>
>> same_thread_group() does:
>>
>> return p->signal == current->signal;
>
> Yes. So the sender can not send a signal to itself.
> Basically it is a test to see if a thread is a member of a process.
>
>> Ideally we can do:
>>
>> read_lock(&tasklist_lock);
>> prevent_add_remove(current->signal);
>> read_unlock(&tasklist_lock);
>>
>> rcu_read_lock();
>> for_each_process(p) {
>> if (task_pid_vnr(p) > 1 &&
>> !same_thread_group(p, current)) {
>>
>> group_send_sig_info(...., p);
>> }
>> }
>> rcu_read_unlock();
>>
>> allow_add_remove(current->signal);
>>
>> Where prevent_add_remove() sets a state which has to be waited for to be
>> cleared by anything which wants to add/remove a task in that scope or
>> change $relatedtask->signal until allow_add_remove() removes that
>> blocker. I'm sure it's way more complicated, but you get the idea.
>
> Hmm.
>
> For sending signals what is needed is the guarantee that the signal is
> sent to an atomic snapshot of the appropriate group of processes so that
> SIGKILL sent to the group will reliably kill all of the processes. It
> should be ok for a process to exit on it's own from the group. As long
> as it logically looks like the process exited before the signal was
> sent.
>
> There is also ptrace_attach/__ptrace_unlink, reparenting,
> kill_orphaned_pgrp, zap_pid_ns_processes, and pid hash table
> maintenance in release_task.
>
> I have a patch I am playing with that protects task->parent and
> task->real_parent with siglock and with a little luck that can
> be generalized so that sending signals to parents and ptrace don't
> need tasklist_lock at all.
>
> For reparenting of children the new parents list of children
> needs protection but that should not need tasklist lock.
>
> For kill_orphaned_pgrp with some additional per process group
> maintenance state so that will_become_orphaned_pgrp and has_stopped_jobs
> don't need to traverse the process group it should be possible to
> just have it send a sender of a process group signal.
>
> zap_pid_ns_processes is already called without the tasklist_lock.
>
> Maintenance of the pid hash table certainly needs a write lock in
> __exit_signal but it doesn't need to be tasklist_lock.
>
>
>
>
>
> Which is a long way of saying semantically all we need is to
> prevent_addition to the group of processes a signal will be sent to. We
> have one version of that prevention today in fork where it tests
> fatal_signal_pending after taking tasklist_lock and siglock. For the
> case you are describing the code would just need to check each group of
> processes the new process is put into.
>
>
> Hmm.
>
> When I boil it all down in my head I wind up with something like:
>
> rwlock_t *lock = signal_lock(enum pid_type type);
> read_lock(lock);
> /* Do the work of sending the signal */
> read_unlock(lock);
>
> With fork needing to grab all of those possible locks for write
> as it adds the process to the group.
>
> Maybe it could be something like:
>
> struct group_signal {
> struct hlist_node node;
> struct kernel_siginfo *info;
> };
>
> void start_group_signal(struct group_signal *group, struct
> kernel_siginfo *info, enum pid_type type);
> void end_group_signal(struct group_signal *group);
>
> struct group_signal group_sig;
> start_group_signal(&group_sig, info, PIDTYPE_PGID);
>
> /* Walk through the list and deliver the signal */
>
> end_group_signal(&group_sig);
>
> That would allow fork to see all signals that are being delivered to a
> group even it the signal has not been delivered to the parent process
> yet. At which point the signal could be delivered to the parent before
> the fork. I just need something to ensure that the signal delivery loop
> between start_group_signal and end_group_signal skips processes that
> hurried up and delivered the signal to themselves, and does not
> deliver to newly added processes. A generation counter perhaps.
>
> There is a lot to flesh out, and I am burried alive in other cleanups
> but I think that could work, and remove the need to hold tasklist_lock
> during signal delivery.
>
>
>> If we find a solution to this scope reduction problem, then it will not
>> only squash the issue which started this discussion. This will have a
>> benefit in general.
>
> We need to go farther than simple scope reduction to benefit the
> original problem. As all of the process in that problem were
> all sending a signal to the same process group. So they all needed
> to wait for each other.
>
> If we need to block adds then the adds need to effectively take a
> write_lock to the read_lock taken during signal delivery. Because
> all of the blocking is the same we won't see an improvement of
> the original problem.
>
>
>
> If in addition to scope reduction, a barrier is implemented so that
> it is guaranteed that past a certain point processes will see the signal
> before they fork (or do anything else that userspace could tell the
> signal was not delivered atomically) then I think we can eliminate
> blocking in the same places and an improvement in the issue that
> started this discussion can be seen.
>
>
> I will queue it up on my list of things I would like to do. I am


hi, Eric:

Do you have any plans to fix it? I look forward to your patches.

thanks!
-Zhang Qiao
.



> burried in other signal related cleanups at the moment so I don't know
> when I will be able to get to anything like that. But I really
> appreciate the idea.
>
> Eric
>
> .
>