Hello,
We are facing a soft lockup on our systems which appears to be related
to rcu scheduling.
The bug appears as high CPU usage. Dmesg shows a soft lock which is
associated with "zap_pid_ns_processes". I have confirmed the behavior on
5.15 and 6.8 kernels.
This example was taken from an Ubuntu 22.04 VM running in a hyper-v
environment.
rachel@ubuntu:~$ uname -a
Linux ubuntu 5.15.0-107-generic #117-Ubuntu SMP Fri Apr 26 12:26:49 UTC
2024 x86_64 x86_64 x86_64 GNU/Linux
dmesg snippet:
watchdog: BUG: soft lockup - CPU#0 stuck for 212s! [npm start:306207]
Modules linked in: veth nf_conntrack_netlink xt_conntrack nft_chain_nat
xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables
nfnetlink binfmt_misc nls_iso8859_1 intel_rapl_msr serio_raw
intel_rapl_common hyperv_fb hv_balloon joydev mac_hid sch_fq_codel
dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua overlay
iptable_filter ip6table_filter ip6_tables br_netfilter bridge stp llc
arp_tables msr efi_pstore ip_tables x_tables autofs4 btrfs
blake2b_generic zstd_compress raid10 raid456 async_raid6_recov
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
raid0 multipath linear hyperv_drm drm_kms_helper syscopyarea sysfillrect
sysimgblt fb_sys_fops crct10dif_pclmul cec hv_storvsc crc32_pclmul
hid_generic hv_netvsc ghash_clmulni_intel scsi_transport_fc rc_core
sha256_ssse3 hid_hyperv drm sha1_ssse3 hv_utils hid hyperv_keyboard
aesni_intel crypto_simd cryptd hv_vmbus
CPU: 0 PID: 306207 Comm: npm start Tainted: G L
5.15.0-107-generic #117-Ubuntu
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine,
BIOS Hyper-V UEFI Release v4.1 04/06/2022
RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
Code: eb 8d cc cc cc 0f 1f 44 00 00 55 48 89 e5 e8 3a b8 36 ff 66 90 f7
c6 00 02 00 00 75 06 5d e9 e2 cb 22 00 fb 66 0f 1f 44 00 00 <5d> e9 d5
cb 22 00 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 8b 07
RSP: 0018:ffffb15fc915bc60 EFLAGS: 00000206
RAX: 0000000000000001 RBX: ffffb15fc915bcf8 RCX: 0000000000000000
RDX: ffff9d4713f9c828 RSI: 0000000000000246 RDI: ffff9d4713f9c820
RBP: ffffb15fc915bc60 R08: ffff9d4713f9c828 R09: ffff9d4713f9c828
R10: 0000000000000228 R11: ffffb15fc915bcf0 R12: ffff9d4713f9c820
R13: 0000000000000004 R14: ffff9d47305a9980 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff9d4643c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd63a1b6008 CR3: 0000000288bd6003 CR4: 0000000000370ef0
Call Trace:
<IRQ>
? show_trace_log_lvl+0x1d6/0x2ea
? show_trace_log_lvl+0x1d6/0x2ea
? add_wait_queue+0x6b/0x80
? show_regs.part.0+0x23/0x29
? show_regs.cold+0x8/0xd
? watchdog_timer_fn+0x1be/0x220
? lockup_detector_update_enable+0x60/0x60
? __hrtimer_run_queues+0x107/0x230
? read_hv_clock_tsc_cs+0x9/0x30
? hrtimer_interrupt+0x101/0x220
? hv_stimer0_isr+0x20/0x30
? __sysvec_hyperv_stimer0+0x32/0x70
? sysvec_hyperv_stimer0+0x7b/0x90
</IRQ>
<TASK>
? asm_sysvec_hyperv_stimer0+0x1b/0x20
? _raw_spin_unlock_irqrestore+0x25/0x30
add_wait_queue+0x6b/0x80
do_wait+0x52/0x310
kernel_wait4+0xaf/0x150
? thread_group_exited+0x50/0x50
zap_pid_ns_processes+0x111/0x1a0
forget_original_parent+0x348/0x360
exit_notify+0x4a/0x210
do_exit+0x24f/0x3c0
do_group_exit+0x3b/0xb0
__x64_sys_exit_group+0x18/0x20
x64_sys_call+0x1937/0x1fa0
do_syscall_64+0x56/0xb0
? do_user_addr_fault+0x1e7/0x670
? exit_to_user_mode_prepare+0x37/0xb0
? irqentry_exit_to_user_mode+0x17/0x20
? irqentry_exit+0x1d/0x30
? exc_page_fault+0x89/0x170
entry_SYSCALL_64_after_hwframe+0x67/0xd1
RIP: 0033:0x7f60019daf8e
Code: Unable to access opcode bytes at RIP 0x7f60019daf64.
RSP: 002b:00007fff2812a468 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f5ffeda01b0 RCX: 00007f60019daf8e
RDX: 00007f6001a560c0 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 00007fff2812a4b0 R08: 0000000000000024 R09: 0000000800000000
R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001
R13: 00007f60016f4a90 R14: 0000000000000000 R15: 00007f5ffede4d50
</TASK>
Looking at the running processes, there are zombie threads
root@ubuntu:/home/rachel# ps aux | grep Z
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
rachel 305832 0.5 0.0 0 0 ? Zsl 01:55 0:00 [npm
start] <defunct>
rachel 308234 0.3 0.0 0 0 ? Zl 01:55 0:00 [npm
run zombie] <defunct>
rachel 308987 0.0 0.0 0 0 ? Z 01:55 0:00 [sh]
<defunct>
root 345328 0.0 0.0 6480 2220 pts/5 S+ 01:56 0:00 grep
--color=auto Z
"308234" zombie thread group shows a thread is stuck on
synchronize_rcu_expedited
root@ubuntu:/home/rachel# ls /proc/308234/task
308234 308312
root@ubuntu:/home/rachel# cat /proc/308312/stack
[<0>] exp_funnel_lock+0x1eb/0x230
[<0>] synchronize_rcu_expedited+0x6d/0x1b0
[<0>] namespace_unlock+0xd6/0x1b0
[<0>] put_mnt_ns+0x74/0xa0
[<0>] free_nsproxy+0x1c/0x1b0
[<0>] switch_task_namespaces+0x5e/0x70
[<0>] exit_task_namespaces+0x10/0x20
[<0>] do_exit+0x212/0x3c0
[<0>] io_sq_thread+0x457/0x5b0
[<0>] ret_from_fork+0x22/0x30
To consistently reproduce the issue, disable "CONFIG_PREEMPT_RCU". It is
unclear if this completely prevents the issue, but it is much easier to
reproduce with preemption off. I was able to reproduce on the Ubuntu
22.04 5.15.0-107-generic and 24.04 6.8.0-30-generic. There are 2 methods
of reproducing. Both methods are hosted at
https://github.com/rlmenge/rcu-soft-lock-issue-repro .
Repro using npm and docker:
Get the script here:
https://github.com/rlmenge/rcu-soft-lock-issue-repro/blob/main/rcu-npm-repro.sh
# get image so that script doesn't keep pulling for images
$ sudo docker run telescope.azurecr.io/issue-repro/zombie:v1.1.11
$ sudo ./rcu-npm-repro.sh
This script creates several containers. Each container runs in new pid
and mount namespaces. The container's entrypoint is `npm run task && npm
start`.
npm run task: This command is to run `npm run zombie & npm run done`
command.
npm run zombie: It's to run `while true; do echo zombie; sleep 1; done`.
Infinite loop to print zombies.
npm run done: It's to run `echo done`. Short live process.
npm start: It's also a short live process. It will exit in a few seconds.
When `npm start` exits, the process tree in that pid namespace will be like
npm start (pid 1)
|__npm run zombie
|__ sh -c "whle true; do echo zombie; sleep 1; done"
Repro using golang:
Use the go module found here:
https://github.com/rlmenge/rcu-soft-lock-issue-repro/blob/main/rcudeadlock.go
Run
$ go mod init rcudeadlock.go
$ go mod tidy
$ CGO_ENABLED=0 go build -o ./rcudeadlock ./
$ sudo ./rcudeadlock
This golang program is to simulate the npm reproducer without involving
docker as dependency. This binary is using re-exec self to support
multiple subcommands. It also sets up processes in new pid and mount
namespaces by unshare, since the `put_mnt_ns` is a critical code path in
the kernel to reproduce this issue. Both mount and pid namespaces are
required in this issue.
The entrypoint of new pid and mount namespaces is `rcudeadlock task &&
rcudeadlock start`.
rcudeadlock task: This command is to run `rcudeadlock zombie &
rcudeadlock done`
rcudeadlock zombie: It's to run `bash -c "while true; do echo zombie;
sleep 1; done"`. Infinite loop to print zombies.
rcudeadlock done: Prints done and exits.
rcudeadlock start: Prints `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA` 10 times
and exits.
When `rcudeadlock start` exits, the process tree in that pid namespace
will be like
rcudeadlock start (pid 1)
|__rcudeadlock zombie
|__bash -c "while true; do echo zombie; sleep 1; done".
Each rcudeadlock process will set up 4 idle io_uring threads before
handling commands, like `task`, `zombie`, `done` and `start`. That is
similar to npm reproducer. Not sure that it's related to io_uring. But
with io_uring idle threads, it's easy to reproduce this issue.
Thank you,
Rachel
Add Eric.
Well, due to unfortunate design zap_pid_ns_processes() can hang "forever"
if this namespace has a (zombie) task injected from the parent ns, this
task should be reaped by its parent.
But zap_pid_ns_processes() shouldn't cause the soft-lockup, it should
sleep in kernel_wait4().
Any chance you can test the patch below? This patch makes sense anyway,
I'll send it later. But I am not sure it can fix your problem.
Oleg.
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index dc48fecfa1dc..25f3cf679b35 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
*/
do {
clear_thread_flag(TIF_SIGPENDING);
+ clear_thread_flag(TIF_NOTIFY_SIGNAL);
rc = kernel_wait4(-1, NULL, __WALL, NULL);
} while (rc != -ECHILD);
On 06/05, Rachel Menge wrote:
>
> Hello,
>
> We are facing a soft lockup on our systems which appears to be related to
> rcu scheduling.
>
> The bug appears as high CPU usage. Dmesg shows a soft lock which is
> associated with "zap_pid_ns_processes". I have confirmed the behavior on
> 5.15 and 6.8 kernels.
>
> This example was taken from an Ubuntu 22.04 VM running in a hyper-v
> environment.
> rachel@ubuntu:~$ uname -a
> Linux ubuntu 5.15.0-107-generic #117-Ubuntu SMP Fri Apr 26 12:26:49 UTC 2024
> x86_64 x86_64 x86_64 GNU/Linux
>
> dmesg snippet:
> watchdog: BUG: soft lockup - CPU#0 stuck for 212s! [npm start:306207]
> Modules linked in: veth nf_conntrack_netlink xt_conntrack nft_chain_nat
> xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user
> xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink binfmt_misc
> nls_iso8859_1 intel_rapl_msr serio_raw intel_rapl_common hyperv_fb
> hv_balloon joydev mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc
> scsi_dh_alua overlay iptable_filter ip6table_filter ip6_tables br_netfilter
> bridge stp llc arp_tables msr efi_pstore ip_tables x_tables autofs4 btrfs
> blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
> async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
> linear hyperv_drm drm_kms_helper syscopyarea sysfillrect sysimgblt
> fb_sys_fops crct10dif_pclmul cec hv_storvsc crc32_pclmul hid_generic
> hv_netvsc ghash_clmulni_intel scsi_transport_fc rc_core sha256_ssse3
> hid_hyperv drm sha1_ssse3 hv_utils hid hyperv_keyboard aesni_intel
> crypto_simd cryptd hv_vmbus
> CPU: 0 PID: 306207 Comm: npm start Tainted: G L
> 5.15.0-107-generic #117-Ubuntu
> Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS
> Hyper-V UEFI Release v4.1 04/06/2022
> RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
> Code: eb 8d cc cc cc 0f 1f 44 00 00 55 48 89 e5 e8 3a b8 36 ff 66 90 f7 c6
> 00 02 00 00 75 06 5d e9 e2 cb 22 00 fb 66 0f 1f 44 00 00 <5d> e9 d5 cb 22 00
> 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 8b 07
> RSP: 0018:ffffb15fc915bc60 EFLAGS: 00000206
> RAX: 0000000000000001 RBX: ffffb15fc915bcf8 RCX: 0000000000000000
> RDX: ffff9d4713f9c828 RSI: 0000000000000246 RDI: ffff9d4713f9c820
> RBP: ffffb15fc915bc60 R08: ffff9d4713f9c828 R09: ffff9d4713f9c828
> R10: 0000000000000228 R11: ffffb15fc915bcf0 R12: ffff9d4713f9c820
> R13: 0000000000000004 R14: ffff9d47305a9980 R15: 0000000000000000
> FS: 0000000000000000(0000) GS:ffff9d4643c00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fd63a1b6008 CR3: 0000000288bd6003 CR4: 0000000000370ef0
> Call Trace:
> <IRQ>
> ? show_trace_log_lvl+0x1d6/0x2ea
> ? show_trace_log_lvl+0x1d6/0x2ea
> ? add_wait_queue+0x6b/0x80
> ? show_regs.part.0+0x23/0x29
> ? show_regs.cold+0x8/0xd
> ? watchdog_timer_fn+0x1be/0x220
> ? lockup_detector_update_enable+0x60/0x60
> ? __hrtimer_run_queues+0x107/0x230
> ? read_hv_clock_tsc_cs+0x9/0x30
> ? hrtimer_interrupt+0x101/0x220
> ? hv_stimer0_isr+0x20/0x30
> ? __sysvec_hyperv_stimer0+0x32/0x70
> ? sysvec_hyperv_stimer0+0x7b/0x90
> </IRQ>
> <TASK>
> ? asm_sysvec_hyperv_stimer0+0x1b/0x20
> ? _raw_spin_unlock_irqrestore+0x25/0x30
> add_wait_queue+0x6b/0x80
> do_wait+0x52/0x310
> kernel_wait4+0xaf/0x150
> ? thread_group_exited+0x50/0x50
> zap_pid_ns_processes+0x111/0x1a0
> forget_original_parent+0x348/0x360
> exit_notify+0x4a/0x210
> do_exit+0x24f/0x3c0
> do_group_exit+0x3b/0xb0
> __x64_sys_exit_group+0x18/0x20
> x64_sys_call+0x1937/0x1fa0
> do_syscall_64+0x56/0xb0
> ? do_user_addr_fault+0x1e7/0x670
> ? exit_to_user_mode_prepare+0x37/0xb0
> ? irqentry_exit_to_user_mode+0x17/0x20
> ? irqentry_exit+0x1d/0x30
> ? exc_page_fault+0x89/0x170
> entry_SYSCALL_64_after_hwframe+0x67/0xd1
> RIP: 0033:0x7f60019daf8e
> Code: Unable to access opcode bytes at RIP 0x7f60019daf64.
> RSP: 002b:00007fff2812a468 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> RAX: ffffffffffffffda RBX: 00007f5ffeda01b0 RCX: 00007f60019daf8e
> RDX: 00007f6001a560c0 RSI: 0000000000000000 RDI: 0000000000000001
> RBP: 00007fff2812a4b0 R08: 0000000000000024 R09: 0000000800000000
> R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001
> R13: 00007f60016f4a90 R14: 0000000000000000 R15: 00007f5ffede4d50
> </TASK>
>
> Looking at the running processes, there are zombie threads
> root@ubuntu:/home/rachel# ps aux | grep Z
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> rachel 305832 0.5 0.0 0 0 ? Zsl 01:55 0:00 [npm
> start] <defunct>
> rachel 308234 0.3 0.0 0 0 ? Zl 01:55 0:00 [npm run
> zombie] <defunct>
> rachel 308987 0.0 0.0 0 0 ? Z 01:55 0:00 [sh]
> <defunct>
> root 345328 0.0 0.0 6480 2220 pts/5 S+ 01:56 0:00 grep
> --color=auto Z
>
> "308234" zombie thread group shows a thread is stuck on
> synchronize_rcu_expedited
> root@ubuntu:/home/rachel# ls /proc/308234/task
> 308234 308312
> root@ubuntu:/home/rachel# cat /proc/308312/stack
> [<0>] exp_funnel_lock+0x1eb/0x230
> [<0>] synchronize_rcu_expedited+0x6d/0x1b0
> [<0>] namespace_unlock+0xd6/0x1b0
> [<0>] put_mnt_ns+0x74/0xa0
> [<0>] free_nsproxy+0x1c/0x1b0
> [<0>] switch_task_namespaces+0x5e/0x70
> [<0>] exit_task_namespaces+0x10/0x20
> [<0>] do_exit+0x212/0x3c0
> [<0>] io_sq_thread+0x457/0x5b0
> [<0>] ret_from_fork+0x22/0x30
>
> To consistently reproduce the issue, disable "CONFIG_PREEMPT_RCU". It is
> unclear if this completely prevents the issue, but it is much easier to
> reproduce with preemption off. I was able to reproduce on the Ubuntu 22.04
> 5.15.0-107-generic and 24.04 6.8.0-30-generic. There are 2 methods of
> reproducing. Both methods are hosted at
> https://github.com/rlmenge/rcu-soft-lock-issue-repro .
>
> Repro using npm and docker:
> Get the script here: https://github.com/rlmenge/rcu-soft-lock-issue-repro/blob/main/rcu-npm-repro.sh
> # get image so that script doesn't keep pulling for images
> $ sudo docker run telescope.azurecr.io/issue-repro/zombie:v1.1.11
> $ sudo ./rcu-npm-repro.sh
>
> This script creates several containers. Each container runs in new pid and
> mount namespaces. The container's entrypoint is `npm run task && npm start`.
> npm run task: This command is to run `npm run zombie & npm run done`
> command.
> npm run zombie: It's to run `while true; do echo zombie; sleep 1; done`.
> Infinite loop to print zombies.
> npm run done: It's to run `echo done`. Short live process.
> npm start: It's also a short live process. It will exit in a few seconds.
>
> When `npm start` exits, the process tree in that pid namespace will be like
> npm start (pid 1)
> |__npm run zombie
> |__ sh -c "whle true; do echo zombie; sleep 1; done"
>
> Repro using golang:
> Use the go module found here:
> https://github.com/rlmenge/rcu-soft-lock-issue-repro/blob/main/rcudeadlock.go
>
> Run
> $ go mod init rcudeadlock.go
> $ go mod tidy
> $ CGO_ENABLED=0 go build -o ./rcudeadlock ./
> $ sudo ./rcudeadlock
>
> This golang program is to simulate the npm reproducer without involving
> docker as dependency. This binary is using re-exec self to support multiple
> subcommands. It also sets up processes in new pid and mount namespaces by
> unshare, since the `put_mnt_ns` is a critical code path in the kernel to
> reproduce this issue. Both mount and pid namespaces are required in this
> issue.
>
> The entrypoint of new pid and mount namespaces is `rcudeadlock task &&
> rcudeadlock start`.
> rcudeadlock task: This command is to run `rcudeadlock zombie & rcudeadlock
> done`
> rcudeadlock zombie: It's to run `bash -c "while true; do echo zombie; sleep
> 1; done"`. Infinite loop to print zombies.
> rcudeadlock done: Prints done and exits.
> rcudeadlock start: Prints `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA` 10 times and
> exits.
>
> When `rcudeadlock start` exits, the process tree in that pid namespace will
> be like
> rcudeadlock start (pid 1)
> |__rcudeadlock zombie
> |__bash -c "while true; do echo zombie; sleep 1; done".
>
> Each rcudeadlock process will set up 4 idle io_uring threads before handling
> commands, like `task`, `zombie`, `done` and `start`. That is similar to npm
> reproducer. Not sure that it's related to io_uring. But with io_uring idle
> threads, it's easy to reproduce this issue.
>
> Thank you,
> Rachel
>
Hi!
> Add Eric.
>
> Well, due to unfortunate design zap_pid_ns_processes() can hang "forever"
> if this namespace has a (zombie) task injected from the parent ns, this
> task should be reaped by its parent.
That zombie task was cloned by pid-1 process in that pid namespace. In my last
reproduced log, the process tree in that pid namespace looks like
```
# unshare(CLONE_NEWPID | CLONE_NEWNS)
npm start (pid 2522045)
|__npm run zombie (pid 2522605)
|__ sh -c "whle true; do echo zombie; sleep 1; done" (pid 2522869)
```
The `npm start (pid 2522045)` was stuck in kernel_wait4. And its child,
`npm run zombie (pid 2522605)`, has two threads. One of them was in D status.
As far as I know, pid-2522605 can't be reaped by its parent pid-2522045 until
that thread returns from `synchronize_rcu_expedited`.
```
$ sudo cat /proc/2522605/task/*/stack
[<0>] synchronize_rcu_expedited+0x177/0x1f0
[<0>] namespace_unlock+0xd6/0x1b0
[<0>] put_mnt_ns+0x73/0xa0
[<0>] free_nsproxy+0x1c/0x1b0
[<0>] switch_task_namespaces+0x5d/0x70
[<0>] exit_task_namespaces+0x10/0x20
[<0>] do_exit+0x2ce/0x500
[<0>] io_sq_thread+0x48e/0x5a0
[<0>] ret_from_fork+0x3c/0x60
[<0>] ret_from_fork_asm+0x1b/0x30
$ sudo cat /proc/2522605/task/2522645/status
Name: iou-sqp-2522605
State: D (disk sleep)
Tgid: 2522605
Ngid: 0
Pid: 2522645
PPid: 2522045
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1000 1000 1000 1000
FDSize: 0
Groups: 1000
NStgid: 2522605 25
NSpid: 2522645 40
NSpgid: 2522045 1
NSsid: 2522045 1
Kthread: 0
Threads: 2
SigQ: 0/128311
SigPnd: 0000000000000000
ShdPnd: 0000000000000100
SigBlk: fffffffffffbfeff
SigIgn: 0000000001001000
SigCgt: 0000000000014602
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 2
Seccomp_filters: 1
Speculation_Store_Bypass: vulnerable
SpeculationIndirectBranch: always enabled
Cpus_allowed: ff
Cpus_allowed_list: 0-7
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 11
nonvoluntary_ctxt_switches: 21
```
>
> But zap_pid_ns_processes() shouldn't cause the soft-lockup, it should
> sleep in kernel_wait4().
I run `cat /proc/2522045/status` and found that the status was kept switching
between running and sleeping. But the kernel was still reporting soft lockup.
And there is log from dmesg. The CPU 5 wasn't able to report the quiescent
state. It seems that [rcu_flavor_sched_clock_irq][1] wasn't able to call
[run_qs][2].
```
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 5-....: (15000 ticks this GP) idle=db4c/1/0x4000000000000000 softirq=14924115/14924115 fqs=7430
rcu: hardirqs softirqs csw/system
rcu: number: 0 833 0
rcu: cputime: 0 0 29996 ==> 30000(ms)
rcu: (t=15003 jiffies g=44379053 q=145851 ncpus=8)
CPU: 5 PID: 2522045 Comm: npm start Tainted: G L 6.5.0-1021-azure #22~22.04.1-Ubuntu
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x20
Code: cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 62 06 00 00 90 f7 c6 00 02 00 00 74 01 fb 5d <e9> d2 19 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 0018:ffffa666c4bafc30 EFLAGS: 00000206
RAX: 0000000000000001 RBX: ffffa666c4bafcc0 RCX: 0000000000000020
RDX: ffff8a3d82130928 RSI: 0000000000000282 RDI: ffff8a3d82130920
RBP: ffffa666c4bafc48 R08: ffff8a3d82130928 R09: ffff8a3d82130928
R10: 0000000000000040 R11: 0000000000000002 R12: ffff8a3d82130920
R13: ffff8a44f3db9980 R14: ffff8a44f3db9980 R15: ffff8a44f3db9970
FS: 0000000000000000(0000) GS:ffff8a451fd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000767ea57cc000 CR3: 00000005db436002 CR4: 0000000000370ee0
Call Trace:
<IRQ>
? show_regs+0x6a/0x80
? dump_cpu_task+0x71/0x90
? rcu_dump_cpu_stacks+0xe8/0x180
? print_cpu_stall+0x131/0x290
? load_balance+0x160/0x870
? check_cpu_stall+0x1d8/0x270
? rcu_pending+0x32/0x1e0
? rcu_sched_clock_irq+0x16e/0x290
? update_process_times+0x63/0xa0
? tick_sched_handle+0x28/0x70
? tick_sched_timer+0x77/0x90
? __pfx_tick_sched_timer+0x10/0x10
? __hrtimer_run_queues+0x111/0x240
? srso_alias_return_thunk+0x5/0x7f
? hrtimer_interrupt+0x101/0x240
? hv_stimer0_isr+0x20/0x30
? __sysvec_hyperv_stimer0+0x32/0x70
? sysvec_hyperv_stimer0+0x7b/0x90
</IRQ>
<TASK>
? asm_sysvec_hyperv_stimer0+0x1b/0x20
? _raw_spin_unlock_irqrestore+0x19/0x20
? remove_wait_queue+0x47/0x50
do_wait+0x19f/0x300
kernel_wait4+0xaf/0x150
? __pfx_child_wait_callback+0x10/0x10
zap_pid_ns_processes+0x105/0x190
forget_original_parent+0x2e4/0x360
exit_notify+0x4a/0x210
do_exit+0x30b/0x500
? srso_alias_return_thunk+0x5/0x7f
? wake_up_state+0x10/0x20
? srso_alias_return_thunk+0x5/0x7f
do_group_exit+0x35/0x90
__x64_sys_exit_group+0x18/0x20
x64_sys_call+0xd95/0x1ff0
do_syscall_64+0x56/0x80
? srso_alias_return_thunk+0x5/0x7f
? handle_mm_fault+0x128/0x290
? srso_alias_return_thunk+0x5/0x7f
? srso_alias_return_thunk+0x5/0x7f
? exit_to_user_mode_prepare+0x49/0x100
? srso_alias_return_thunk+0x5/0x7f
? irqentry_exit_to_user_mode+0x19/0x30
? srso_alias_return_thunk+0x5/0x7f
? irqentry_exit+0x1d/0x30
? srso_alias_return_thunk+0x5/0x7f
? exc_page_fault+0x80/0x160
entry_SYSCALL_64_after_hwframe+0x73/0xdd
RIP: 0033:0x75fce9367f8e
Code: Unable to access opcode bytes at 0x75fce9367f64.
RSP: 002b:00007ffc80c04b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 000075fce672d1b0 RCX: 000075fce9367f8e
RDX: 000075fce93e30c0 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 00007ffc80c04b60 R08: 0000000000000024 R09: 0000000800000000
R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001
R13: 000075fce9081a90 R14: 0000000000000000 R15: 000075fce6771d50
</TASK>
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 5-.... } 15359 jiffies s: 90777 root: 0x20/.
rcu: blocking rcu_node structures (internal RCU debug):
Sending NMI from CPU 4 to CPUs 5:
NMI backtrace for cpu 5
CPU: 5 PID: 2522045 Comm: npm start Tainted: G L 6.5.0-1021-azure #22~22.04.1-Ubuntu
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
RIP: 0010:do_wait+0x11/0x300
Code: 8b 4d d4 e9 28 fd ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 <53> 48 89 fb 48 83 ec 08 48 8b 77 08 0f 1f 44 00 00 65 4c 8b 34 25
RSP: 0018:ffffa666c4bafc68 EFLAGS: 00000202
RAX: 0000000000000000 RBX: 0000000040000004 RCX: 0000000000000000
RDX: 0000000040000000 RSI: 0000000000000000 RDI: ffffa666c4bafc98
RBP: ffffa666c4bafc88 R08: ffff8a3d82130928 R09: ffff8a3d82130928
R10: 0000000000000040 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000004 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8a451fd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000767ea57cc000 CR3: 00000005db436002 CR4: 0000000000370ee0
Call Trace:
<NMI>
? show_regs+0x6a/0x80
? nmi_cpu_backtrace+0x9c/0x100
? nmi_cpu_backtrace_handler+0x11/0x20
? nmi_handle+0x62/0x160
? default_do_nmi+0x45/0x120
? exc_nmi+0x19f/0x250
? end_repeat_nmi+0x16/0x67
? do_wait+0x11/0x300
? do_wait+0x11/0x300
? do_wait+0x11/0x300
</NMI>
<TASK>
kernel_wait4+0xaf/0x150
? __pfx_child_wait_callback+0x10/0x10
zap_pid_ns_processes+0x105/0x190
forget_original_parent+0x2e4/0x360
exit_notify+0x4a/0x210
do_exit+0x30b/0x500
? srso_alias_return_thunk+0x5/0x7f
? wake_up_state+0x10/0x20
? srso_alias_return_thunk+0x5/0x7f
do_group_exit+0x35/0x90
__x64_sys_exit_group+0x18/0x20
x64_sys_call+0xd95/0x1ff0
do_syscall_64+0x56/0x80
? srso_alias_return_thunk+0x5/0x7f
? handle_mm_fault+0x128/0x290
? srso_alias_return_thunk+0x5/0x7f
? srso_alias_return_thunk+0x5/0x7f
? exit_to_user_mode_prepare+0x49/0x100
? srso_alias_return_thunk+0x5/0x7f
? irqentry_exit_to_user_mode+0x19/0x30
? srso_alias_return_thunk+0x5/0x7f
? irqentry_exit+0x1d/0x30
? srso_alias_return_thunk+0x5/0x7f
? exc_page_fault+0x80/0x160
entry_SYSCALL_64_after_hwframe+0x73/0xdd
RIP: 0033:0x75fce9367f8e
Code: Unable to access opcode bytes at 0x75fce9367f64.
RSP: 002b:00007ffc80c04b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 000075fce672d1b0 RCX: 000075fce9367f8e
RDX: 000075fce93e30c0 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 00007ffc80c04b60 R08: 0000000000000024 R09: 0000000800000000
R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001
R13: 000075fce9081a90 R14: 0000000000000000 R15: 000075fce6771d50
</TASK>
watchdog: BUG: soft lockup - CPU#5 stuck for 85s! [npm start:2522045]
Modules linked in: tls raw_diag unix_diag af_packet_diag netlink_diag udp_diag tcp_diag inet_diag xt_statistic xt_mark veth xt_comment xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_nat xt_MASQUERADE nft_chain_nat nf_nat nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter bridge stp llc overlay binfmt_misc nls_iso8859_1 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_owner xt_tcpudp nft_compat crct10dif_pclmul crc32_pclmul nf_tables polyval_clmulni polyval_generic ghash_clmulni_intel libcrc32c sha256_ssse3 joydev sha1_ssse3 hid_generic nfnetlink aesni_intel crypto_simd hyperv_drm cryptd hid_hyperv serio_raw drm_kms_helper hv_netvsc hid hyperv_keyboard pata_acpi drm_shmem_helper dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel msr drm efi_pstore i2c_core ip_tables x_tables autofs4
CPU: 5 PID: 2522045 Comm: npm start Tainted: G L 6.5.0-1021-azure #22~22.04.1-Ubuntu
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x20
Code: cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 62 06 00 00 90 f7 c6 00 02 00 00 74 01 fb 5d <e9> d2 19 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 0018:ffffa666c4bafc30 EFLAGS: 00000206
RAX: 0000000000000001 RBX: ffffa666c4bafcc0 RCX: 0000000000000020
RDX: ffff8a3d82130928 RSI: 0000000000000282 RDI: ffff8a3d82130920
RBP: ffffa666c4bafc48 R08: ffff8a3d82130928 R09: ffff8a3d82130928
R10: 0000000000000040 R11: 0000000000000002 R12: ffff8a3d82130920
R13: ffff8a44f3db9980 R14: ffff8a44f3db9980 R15: ffff8a44f3db9970
FS: 0000000000000000(0000) GS:ffff8a451fd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000767ea57cc000 CR3: 00000005db436002 CR4: 0000000000370ee0
Call Trace:
<IRQ>
? show_regs+0x6a/0x80
? watchdog_timer_fn+0x1ce/0x230
? __pfx_watchdog_timer_fn+0x10/0x10
? __hrtimer_run_queues+0x111/0x240
? srso_alias_return_thunk+0x5/0x7f
? hrtimer_interrupt+0x101/0x240
? hv_stimer0_isr+0x20/0x30
? __sysvec_hyperv_stimer0+0x32/0x70
? sysvec_hyperv_stimer0+0x7b/0x90
</IRQ>
<TASK>
? asm_sysvec_hyperv_stimer0+0x1b/0x20
? _raw_spin_unlock_irqrestore+0x19/0x20
? remove_wait_queue+0x47/0x50
do_wait+0x19f/0x300
kernel_wait4+0xaf/0x150
? __pfx_child_wait_callback+0x10/0x10
zap_pid_ns_processes+0x105/0x190
forget_original_parent+0x2e4/0x360
exit_notify+0x4a/0x210
do_exit+0x30b/0x500
? srso_alias_return_thunk+0x5/0x7f
? wake_up_state+0x10/0x20
? srso_alias_return_thunk+0x5/0x7f
do_group_exit+0x35/0x90
__x64_sys_exit_group+0x18/0x20
x64_sys_call+0xd95/0x1ff0
do_syscall_64+0x56/0x80
? srso_alias_return_thunk+0x5/0x7f
? handle_mm_fault+0x128/0x290
? srso_alias_return_thunk+0x5/0x7f
? srso_alias_return_thunk+0x5/0x7f
? exit_to_user_mode_prepare+0x49/0x100
? srso_alias_return_thunk+0x5/0x7f
? irqentry_exit_to_user_mode+0x19/0x30
? srso_alias_return_thunk+0x5/0x7f
? irqentry_exit+0x1d/0x30
? srso_alias_return_thunk+0x5/0x7f
? exc_page_fault+0x80/0x160
entry_SYSCALL_64_after_hwframe+0x73/0xdd
RIP: 0033:0x75fce9367f8e
Code: Unable to access opcode bytes at 0x75fce9367f64.
RSP: 002b:00007ffc80c04b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 000075fce672d1b0 RCX: 000075fce9367f8e
RDX: 000075fce93e30c0 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 00007ffc80c04b60 R08: 0000000000000024 R09: 0000000800000000
R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001
R13: 000075fce9081a90 R14: 0000000000000000 R15: 000075fce6771d50
</TASK>
```
>
> Any chance you can test the patch below? This patch makes sense anyway,
> I'll send it later. But I am not sure it can fix your problem.
Sure! Will do! Thanks
>
> Oleg.
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index dc48fecfa1dc..25f3cf679b35 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> */
> do {
> clear_thread_flag(TIF_SIGPENDING);
> + clear_thread_flag(TIF_NOTIFY_SIGNAL);
> rc = kernel_wait4(-1, NULL, __WALL, NULL);
> } while (rc != -ECHILD);
>
>
>
Wei, Fu
[1]: https://elixir.bootlin.com/linux/v6.5/source/kernel/rcu/tree_plugin.h#L964
[2]: https://elixir.bootlin.com/linux/v6.5/source/kernel/rcu/tree_plugin.h#L848
Hi Wei, thanks for more info.
On 06/06, Wei Fu wrote:
>
> > Well, due to unfortunate design zap_pid_ns_processes() can hang "forever"
> > if this namespace has a (zombie) task injected from the parent ns, this
> > task should be reaped by its parent.
>
> That zombie task was cloned by pid-1 process in that pid namespace. In my last
> reproduced log, the process tree in that pid namespace looks like
OK,
> ```
> # unshare(CLONE_NEWPID | CLONE_NEWNS)
>
> npm start (pid 2522045)
> |__npm run zombie (pid 2522605)
> |__ sh -c "whle true; do echo zombie; sleep 1; done" (pid 2522869)
> ```
only 3 processes? nothing is running? Is the last process 2522869 a
zombie too?
Could you show your .config? In particular, CONFIG_PREEMPT...
> The `npm start (pid 2522045)` was stuck in kernel_wait4. And its child,
so this is the init task in this namespace,
> `npm run zombie (pid 2522605)`, has two threads. One of them was in D status.
...
> $ sudo cat /proc/2522605/task/*/stack
> [<0>] synchronize_rcu_expedited+0x177/0x1f0
> [<0>] namespace_unlock+0xd6/0x1b0
> [<0>] put_mnt_ns+0x73/0xa0
> [<0>] free_nsproxy+0x1c/0x1b0
> [<0>] switch_task_namespaces+0x5d/0x70
> [<0>] exit_task_namespaces+0x10/0x20
> [<0>] do_exit+0x2ce/0x500
> [<0>] io_sq_thread+0x48e/0x5a0
> [<0>] ret_from_fork+0x3c/0x60
> [<0>] ret_from_fork_asm+0x1b/0x30
so I guess this is the trace of its sub-thread 2522645.
What about the process 2522605? Has it exited too?
> > But zap_pid_ns_processes() shouldn't cause the soft-lockup, it should
> > sleep in kernel_wait4().
>
> I run `cat /proc/2522045/status` and found that the status was kept switching
> between running and sleeping.
OK, this shouldn't happen in this case. So it really looks like it spins
in a busy-wait loop because TIF_NOTIFY_SIGNAL is not cleared. It can be
reported as sleeping because do_wait() sets/clears TASK_INTERRUPTIBLE,
although the window is small...
Oleg.
>
> > ```
> > # unshare(CLONE_NEWPID | CLONE_NEWNS)
> >
> > npm start (pid 2522045)
> > |__npm run zombie (pid 2522605)
> > |__ sh -c "whle true; do echo zombie; sleep 1; done" (pid 2522869)
> > ```
>
> only 3 processes? nothing is running? Is the last process 2522869 a
> zombie too?
Yes. The pid-2522045 sent SIGKILL to all the processes in that pid namespace,
when it exited. The last process 2522869 was zombie as well. Sometimes,
`npm start` could exit before `npm run zombie` forks `sh`. You might see there
are only two processes in that pid namespace.
>
> Could you show your .config? In particular, CONFIG_PREEMPT...
I'm using [6.5.0-1021-azure][1] kernel and preempt is disabled.
Highlight part of .config.
```
$ cat /boot/config-6.5.0-1021-azure | grep _RCU
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
# CONFIG_RCU_LAZY is not set
CONFIG_MMU_GATHER_RCU_TABLE_FREE=y
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
CONFIG_RCU_CPU_STALL_CPUTIME=y
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
$ cat /boot/config-6.5.0-1021-azure | grep _PREEMPT
CONFIG_PREEMPT_VOLUNTARY_BUILD=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_DYNAMIC is not set
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640
CONFIG_DRM_I915_PREEMPT_TIMEOUT_COMPUTE=7500
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
$ cat /boot/config-6.5.0-1021-azure | grep HZ
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_MACHZ_WDT=m
```
>
> > The `npm start (pid 2522045)` was stuck in kernel_wait4. And its child,
>
> so this is the init task in this namespace,
Yes~
>
> > `npm run zombie (pid 2522605)`, has two threads. One of them was in D status.
> ...
> > $ sudo cat /proc/2522605/task/*/stack
> > [<0>] synchronize_rcu_expedited+0x177/0x1f0
> > [<0>] namespace_unlock+0xd6/0x1b0
> > [<0>] put_mnt_ns+0x73/0xa0
> > [<0>] free_nsproxy+0x1c/0x1b0
> > [<0>] switch_task_namespaces+0x5d/0x70
> > [<0>] exit_task_namespaces+0x10/0x20
> > [<0>] do_exit+0x2ce/0x500
> > [<0>] io_sq_thread+0x48e/0x5a0
> > [<0>] ret_from_fork+0x3c/0x60
> > [<0>] ret_from_fork_asm+0x1b/0x30
>
> so I guess this is the trace of its sub-thread 2522645.
Sorry for unclear message.
Yes~
>
> What about the process 2522605? Has it exited too?
The process-2522605 has two threads. The main thread-2522605 was in zombie
status. Yes. That main thread has exited as well. Only thread-2522645 was
stuck in synchronize_rcu_expedited.
>
> > > But zap_pid_ns_processes() shouldn't cause the soft-lockup, it should
> > > sleep in kernel_wait4().
> >
> > I run `cat /proc/2522045/status` and found that the status was kept switching
> > between running and sleeping.
>
> OK, this shouldn't happen in this case. So it really looks like it spins
> in a busy-wait loop because TIF_NOTIFY_SIGNAL is not cleared. It can be
> reported as sleeping because do_wait() sets/clears TASK_INTERRUPTIBLE,
> although the window is small...
>
I can reproduce this issue in v5.15, v6.1, v6.5, v6.8, v6.9 and v6.10-rc2.
All the kernels disable CONFIG_PREEMPT and PREEMPT_RCU. And it's very easy to
reproduce this in v5.15.x with 8 vcores in few minutes. For the other versions
of kernel, it could take 30 minutes or few hours.
Rachel provides [golang-repro][2] which is similar to docker repro. It can be
built as static binary which is friendly to reproduce.
Hope this information can help.
Thanks,
Wei
[1]: https://gist.github.com/fuweid/ae8bad349fee3e00a4f1ce82397831ac
[2]: https://github.com/rlmenge/rcu-soft-lock-issue-repro?tab=readme-ov-file#golang-repro
Thanks for this info,
On 06/07, Wei Fu wrote:
>
> All the kernels disable CONFIG_PREEMPT and PREEMPT_RCU.
Ah, this can explain both soft-lockup and synchronize_rcu() hang. If my theory
is correct.
Can you try the patch I sent?
Oleg.
Hi!
>
> On 06/07, Wei Fu wrote:
> >
> > All the kernels disable CONFIG_PREEMPT and PREEMPT_RCU.
>
> Ah, this can explain both soft-lockup and synchronize_rcu() hang. If my theory
> is correct.
>
> Can you try the patch I sent?
>
> Oleg.
>
Yes. I applied your patch on v5.15.160 and run reproducer for 5 hours.
I didn't see this issue. Currently, it looks good!. I will continue that test
on this weekend.
In last reply, you mentioned TIF_NOTIFY_SIGNAL related to busy-wait loop.
Would you please explain why flag-clear works here?
Thanks,
Wei
```
➜ linux git:(v5.15.160) ✗ git --no-pager show
commit c61bd26ae81a896c8660150b4e356153da30880a (HEAD, tag: v5.15.160, origin/linux-5.15.y)
Author: Greg Kroah-Hartman <[email protected]>
Date: Sat May 25 16:20:19 2024 +0200
Linux 5.15.160
Link: https://lore.kernel.org/r/[email protected]
Tested-by: SeongJae Park <[email protected]>
Tested-by: Mark Brown <[email protected]>
Tested-by: Florian Fainelli <[email protected]>
Tested-by: Harshit Mogalapalli <[email protected]>
Tested-by: Linux Kernel Functional Testing <[email protected]>
Tested-by: Shuah Khan <[email protected]>
Tested-by: Ron Economos <[email protected]>
Tested-by: Kelsey Steele <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
diff --git a/Makefile b/Makefile
index 5cbfe2be72dd..bfc863d71978 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
VERSION = 5
PATCHLEVEL = 15
-SUBLEVEL = 159
+SUBLEVEL = 160
EXTRAVERSION =
NAME = Trick or Treat
➜ linux git:(v5.15.160) ✗ git --no-pager diff .
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 259fc4ca0d9c..40b011f88067 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -214,6 +214,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
*/
do {
clear_thread_flag(TIF_SIGPENDING);
+ clear_thread_flag(TIF_NOTIFY_SIGNAL);
rc = kernel_wait4(-1, NULL, __WALL, NULL);
} while (rc != -ECHILD);
```
On 06/07, Wei Fu wrote:
>
> Yes. I applied your patch on v5.15.160 and run reproducer for 5 hours.
> I didn't see this issue. Currently, it looks good!. I will continue that test
> on this weekend.
Great, thanks!
> In last reply, you mentioned TIF_NOTIFY_SIGNAL related to busy-wait loop.
> Would you please explain why flag-clear works here?
Sure, I'll write the changelog with the explanation and send the patch on
weekend. If it passes your testing.
But in short this is very simple. zap_pid_ns_processes() clears TIF_SIGPENDING
exactly because we want to avoid the busy-wait loop. But today this is not
enough to make signal_pending() return F, see
include/linux/sched/signal.h:signal_pending().
Thanks,
Oleg.
kernel_wait4() doesn't sleep and returns -EINTR if there is no
eligible child and signal_pending() is true.
That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not
enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending()
return false and avoid a busy-wait loop.
Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Reported-by: Rachel Menge <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/pid_namespace.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index dc48fecfa1dc..25f3cf679b35 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
*/
do {
clear_thread_flag(TIF_SIGPENDING);
+ clear_thread_flag(TIF_NOTIFY_SIGNAL);
rc = kernel_wait4(-1, NULL, __WALL, NULL);
} while (rc != -ECHILD);
--
2.25.1.362.g51ebf55
On 06/07, Oleg Nesterov wrote:
>
> On 06/07, Wei Fu wrote:
> >
> > Yes. I applied your patch on v5.15.160 and run reproducer for 5 hours.
> > I didn't see this issue. Currently, it looks good!. I will continue that test
> > on this weekend.
>
> Great, thanks!
>
> > In last reply, you mentioned TIF_NOTIFY_SIGNAL related to busy-wait loop.
> > Would you please explain why flag-clear works here?
>
> Sure, I'll write the changelog with the explanation and send the patch on
> weekend. If it passes your testing.
Please see the patch I've sent. The changelog doesn't bother to describe this
particular problem because busy-waiting can obviously cause multiple problems,
especially without CONFIG_PREEMPT or if rt_task().
So let me add more details about this particular deadlock here.
The sub-namespace init task T spins in a tight loop calling kernel_wait4()
which returns -EINTR without sleeping because its child C has not exited
yet and signal_pending(T) is true due to TIF_NOTIFY_SIGNAL.
The exiting child C sleeps in synchronize_rcu() which hangs exactly because
T never calls schedule/rcu_note_context_switch, it can't be preempted because
CONFIG_PREEMPT is not enabled.
Note also that without PREEMPT_RCU __rcu_read_lock() is just preempt_disable()
which is nop without CONFIG_PREEMPT.
Oleg.
The comment above the idr_for_each_entry_continue() loop tries to explain
why we have to signal each thread in the namespace, but it is outdated.
This code no longer uses kill_proc_info(), we have a target task so we can
check thread_group_leader() and avoid the unnecessary group_send_sig_info.
Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID,
this way it returns NULL if this pid is not a group-leader pid.
Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process /
thread doesn't necessarily has a pending SIGKILL. Either way these checks
are racy without siglock, so the patch uses data_race() to shut up KCSAN.
Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/pid_namespace.c | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 25f3cf679b35..0f9bd67c9e75 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -191,21 +191,14 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
* The last thread in the cgroup-init thread group is terminating.
* Find remaining pid_ts in the namespace, signal and wait for them
* to exit.
- *
- * Note: This signals each threads in the namespace - even those that
- * belong to the same thread group, To avoid this, we would have
- * to walk the entire tasklist looking a processes in this
- * namespace, but that could be unnecessarily expensive if the
- * pid namespace has just a few processes. Or we need to
- * maintain a tasklist for each pid namespace.
- *
*/
rcu_read_lock();
read_lock(&tasklist_lock);
nr = 2;
idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
- task = pid_task(pid, PIDTYPE_PID);
- if (task && !__fatal_signal_pending(task))
+ task = pid_task(pid, PIDTYPE_TGID);
+ /* reading signal->flags is racy without sighand->siglock */
+ if (task && !(data_race(task->signal->flags) & SIGNAL_GROUP_EXIT))
group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
}
read_unlock(&tasklist_lock);
--
2.25.1.362.g51ebf55
On Sat, Jun 08, 2024 at 02:06:16PM +0200, Oleg Nesterov wrote:
> kernel_wait4() doesn't sleep and returns -EINTR if there is no
> eligible child and signal_pending() is true.
>
> That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not
> enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending()
> return false and avoid a busy-wait loop.
>
> Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
> Reported-by: Rachel Menge <[email protected]>
> Closes: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Oleg Nesterov <[email protected]>
Reviewed-by: Boqun Feng <[email protected]>
Wei, appreciate it if you could share some test result and provide a
Tested-by tag. Thanks!
Regards,
Boqun
> ---
> kernel/pid_namespace.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index dc48fecfa1dc..25f3cf679b35 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> */
> do {
> clear_thread_flag(TIF_SIGPENDING);
> + clear_thread_flag(TIF_NOTIFY_SIGNAL);
> rc = kernel_wait4(-1, NULL, __WALL, NULL);
> } while (rc != -ECHILD);
>
> --
> 2.25.1.362.g51ebf55
>
>
> kernel_wait4() doesn't sleep and returns -EINTR if there is no
> eligible child and signal_pending() is true.
>
> That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not
> enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending()
> return false and avoid a busy-wait loop.
>
> Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
> Reported-by: Rachel Menge <[email protected]>
> Closes: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Oleg Nesterov <[email protected]>
Tested-By: Wei Fu <[email protected]>
This change looks good to me!
I used [rcudeadlock-v1][1] to verify this patch on v5.15.160 for more than 30
hours. The soft lockup didn't show up. If there is no such patch, that
test will trigger soft-lockup in 10 minutes.
```
root@(none):/# uname -a
Linux (none) 5.15.160-dirty #7 SMP Fri Jun 7 15:25:30 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@(none):/# ps -ef | grep rcu
root 3 2 0 Jun07 ? 00:00:00 [rcu_gp]
root 4 2 0 Jun07 ? 00:00:00 [rcu_par_gp]
root 11 2 0 Jun07 ? 00:00:00 [rcu_tasks_rude_]
root 12 2 0 Jun07 ? 00:00:00 [rcu_tasks_trace]
root 15 2 0 Jun07 ? 00:03:31 [rcu_sched]
root 145 141 0 Jun07 ? 00:15:29 ./rcudeadlock
root 5372 141 0 13:37 ? 00:00:00 grep rcu
root@(none):/# date
Sun Jun 9 13:37:38 UTC 2024
```
I used [rcudeadlock-v2][2] to verify this patch on v6.10-rc2 for more than 2
hours. The soft lockup didn't show up. If there is no such patch, that
test will trigger soft-lockup in 1 minute.
```
root@(none):/# uname -a
Linux (none) 6.10.0-rc2-dirty #4 SMP Sun Jun 9 11:19:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@(none):/# ps -ef | grep rcu
root 4 2 0 11:20 ? 00:00:00 [kworker/R-rcu_g]
root 13 2 0 11:20 ? 00:00:00 [rcu_tasks_rude_kthread]
root 14 2 0 11:20 ? 00:00:00 [rcu_tasks_trace_kthread]
root 16 2 0 11:20 ? 00:00:03 [rcu_sched]
root 17 2 0 11:20 ? 00:00:00 [rcu_exp_par_gp_kthread_worker/0]
root 18 2 0 11:20 ? 00:00:12 [rcu_exp_gp_kthread_worker]
root 117 108 0 11:21 ? 00:01:06 ./rcudeadlock
root 14451 108 0 13:37 ? 00:00:00 grep rcu
root@(none):/# date
Sun Jun 9 13:37:15 UTC 2024
```
It's about data-race during cleanup active iou-wrk-thread. I shares that idea
about how to verify this patch.
> ---
> kernel/pid_namespace.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index dc48fecfa1dc..25f3cf679b35 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> */
> do {
> clear_thread_flag(TIF_SIGPENDING);
> + clear_thread_flag(TIF_NOTIFY_SIGNAL);
> rc = kernel_wait4(-1, NULL, __WALL, NULL);
> } while (rc != -ECHILD);
>
> --
> 2.25.1.362.g51ebf55
>
>
>
Let's assume that there is new pid namespace unshared by host pid namespace,
named by `PA`. There are two processes in `PA`. The init process is named by
`X` and its child is named by `Y`.
```
unshare(CLONE_NEWPID|CLONE_NEWNS)
X
|__ Y
```
The main-thread of process X creates one active iouring worker thread
`iou-wrk-X`. When process X exits, that main-thread of process X wakes up and
set `TIF_NOTIFY_SIGNAL` flag on `iou-wrk-X` thread.
However, if `iou-wrk-X` thread receives signal from main-thread and wakes up,
that thread isn't able to clear `TIF_NOTIFY_SIGNAL` flag. And that `iou-wrk-X`
thread is last thread in process-X and it will carry `TIF_NOTIFY_SIGNAL` flag
to enter `zap_pid_ns_processes`. It can be described by the following comment.
```
== X main-thread == == X iou-wrk-X == == Y main-thread ==
do_exit
kill iou-wrk-X thread
io_uring_files_cancel io_wq_worker
set TIF_NOTIFY_SIGNAL on
iou-wrk-X thread
do_exit(0)
exit_task_namespace
exit_task_namespace
do_task_dead
exit_notify
forget_original_parent
find_child_reaper
zap_pid_ns_processes do_exit
exit_task_namespace
...
namespace_unlock
synchronize_rcu_expedited
```
The `iou-wrk-X` thread kills process-Y which is only one holding the mount
namespace reference. The process-Y will get into `synchronize_rcu_expedited`.
Since kernel doesn't enable preempt and `iou-wrk-X` thread has
`TIF_NOTIFY_SIGNAL` flag, the `iou-wrk-X` thread will get into infinity loop,
which cause soft lockup.
So, in [rcudeadlock-v2][2] test, I create more active iou-wrk- threads in
init process so that there is high chance to have iou-wrk- thread in
`zap_pid_ns_processes` function.
Hope it can help.
Thanks,
Wei
[1]: https://github.com/rlmenge/rcu-soft-lock-issue-repro/blob/662b8e414ff15d75419e2286b8121b7c2049a37c/rcudeadlock.go#L1
[2]: https://github.com/rlmenge/rcu-soft-lock-issue-repro/pull/1
>
> On 06/07, Oleg Nesterov wrote:
> >
> > On 06/07, Wei Fu wrote:
> > >
> > > Yes. I applied your patch on v5.15.160 and run reproducer for 5 hours.
> > > I didn't see this issue. Currently, it looks good!. I will continue that test
> > > on this weekend.
> >
> > Great, thanks!
> >
> > > In last reply, you mentioned TIF_NOTIFY_SIGNAL related to busy-wait loop.
> > > Would you please explain why flag-clear works here?
> >
> > Sure, I'll write the changelog with the explanation and send the patch on
> > weekend. If it passes your testing.
>
> Please see the patch I've sent. The changelog doesn't bother to describe this
> particular problem because busy-waiting can obviously cause multiple problems,
> especially without CONFIG_PREEMPT or if rt_task().
>
> So let me add more details about this particular deadlock here.
>
> The sub-namespace init task T spins in a tight loop calling kernel_wait4()
> which returns -EINTR without sleeping because its child C has not exited
> yet and signal_pending(T) is true due to TIF_NOTIFY_SIGNAL.
>
> The exiting child C sleeps in synchronize_rcu() which hangs exactly because
> T never calls schedule/rcu_note_context_switch, it can't be preempted because
> CONFIG_PREEMPT is not enabled.
>
> Note also that without PREEMPT_RCU __rcu_read_lock() is just preempt_disable()
> which is nop without CONFIG_PREEMPT.
>
> Oleg.
>
>
Thanks for the update. That's really helpful!
Wei
On 6/8/24 6:06 AM, Oleg Nesterov wrote:
> kernel_wait4() doesn't sleep and returns -EINTR if there is no
> eligible child and signal_pending() is true.
>
> That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not
> enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending()
> return false and avoid a busy-wait loop.
Reviewed-by: Jens Axboe <[email protected]>
--
Jens Axboe
Oleg Nesterov <[email protected]> writes:
> kernel_wait4() doesn't sleep and returns -EINTR if there is no
> eligible child and signal_pending() is true.
>
> That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not
> enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending()
> return false and avoid a busy-wait loop.
I took a look through the code. It used to be that TIF_NOTIFY_SIGNAL
was all about waking up a task so that task_work_run can be used.
io_uring still mostly uses it that way. There is also a use in
kthread_stop that just uses it as a TIF_SIGPENDING without having a
pending signal.
At the point in do_exit where exit_notify and thus zap_pid_ns_processes
is called I can't possibly see a use for TIF_NOTIFY_SIGNAL.
exit_task_work, exit_signals, and io_uring_cancel have all been called.
So TIF_NOTIFY_SIGNAL should be spurious at this point and safe to clear.
Why it remains set is a mystery to me.
If I had infinite time and energy the ideal is to rework the pid
namespace exit logic so that waiting for everything to exit works like
delay_group_leader in wait_task_consider. Simply blocking reaping of
the pid namespace leader until everything in the pid namespace have been
reaped. I think acct_exit_ns is the only piece of code that needs
to be moved to allow that, and acct_exit_ns is purely bookkeeping so
does not affect userspace visible semantics.
This active waiting is weird and non-standard in the kernel and winds up
causeing a problem every couple of years because of that.
>
> Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
> Reported-by: Rachel Menge <[email protected]>
> Closes: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> kernel/pid_namespace.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index dc48fecfa1dc..25f3cf679b35 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> */
> do {
> clear_thread_flag(TIF_SIGPENDING);
> + clear_thread_flag(TIF_NOTIFY_SIGNAL);
> rc = kernel_wait4(-1, NULL, __WALL, NULL);
> } while (rc != -ECHILD);
Oleg Nesterov <[email protected]> writes:
> The comment above the idr_for_each_entry_continue() loop tries to explain
> why we have to signal each thread in the namespace, but it is outdated.
> This code no longer uses kill_proc_info(), we have a target task so we can
> check thread_group_leader() and avoid the unnecessary group_send_sig_info.
> Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID,
> this way it returns NULL if this pid is not a group-leader pid.
>
> Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process /
> thread doesn't necessarily has a pending SIGKILL. Either way these checks
> are racy without siglock, so the patch uses data_race() to shut up KCSAN.
You remove the comment but the meat of what it was trying to say remains
true. For processes in a session or processes is a process group a list
of all such processes is kept. No such list is kept for a pid
namespace. So the best we can do is walk through the allocated pid
numbers in the pid namespace.
It would also help if this explains that in the case of SIGKILL
complete_signal always sets SIGNAL_GROUP_EXIT which makes that a good
check to use to see if the process has been killed (with SIGKILL).
There are races with coredump here but *shrug* I don't think this
changes behavior in that situation.
Eric
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> kernel/pid_namespace.c | 13 +++----------
> 1 file changed, 3 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 25f3cf679b35..0f9bd67c9e75 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -191,21 +191,14 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> * The last thread in the cgroup-init thread group is terminating.
> * Find remaining pid_ts in the namespace, signal and wait for them
> * to exit.
> - *
> - * Note: This signals each threads in the namespace - even those that
> - * belong to the same thread group, To avoid this, we would have
> - * to walk the entire tasklist looking a processes in this
> - * namespace, but that could be unnecessarily expensive if the
> - * pid namespace has just a few processes. Or we need to
> - * maintain a tasklist for each pid namespace.
> - *
> */
> rcu_read_lock();
> read_lock(&tasklist_lock);
> nr = 2;
> idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
> - task = pid_task(pid, PIDTYPE_PID);
> - if (task && !__fatal_signal_pending(task))
> + task = pid_task(pid, PIDTYPE_TGID);
> + /* reading signal->flags is racy without sighand->siglock */
> + if (task && !(data_race(task->signal->flags) & SIGNAL_GROUP_EXIT))
> group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
> }
> read_unlock(&tasklist_lock);
>
> Oleg Nesterov <[email protected]> writes:
>
> > kernel_wait4() doesn't sleep and returns -EINTR if there is no
> > eligible child and signal_pending() is true.
> >
> > That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not
> > enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending()
> > return false and avoid a busy-wait loop.
>
> I took a look through the code. It used to be that TIF_NOTIFY_SIGNAL
> was all about waking up a task so that task_work_run can be used.
> io_uring still mostly uses it that way. There is also a use in
> kthread_stop that just uses it as a TIF_SIGPENDING without having a
> pending signal.
>
> At the point in do_exit where exit_notify and thus zap_pid_ns_processes
> is called I can't possibly see a use for TIF_NOTIFY_SIGNAL.
> exit_task_work, exit_signals, and io_uring_cancel have all been called.
>
> So TIF_NOTIFY_SIGNAL should be spurious at this point and safe to clear.
> Why it remains set is a mystery to me.
I think there is a case that TIF_NOTIFY_SIGNAL remains set.
Init process has main-thread, sub-thread-X and iou-wrk-thread-X (created by
sub-thread-X). When main-thread enters exit_group, both sub-thread-X and
iou-wrk-thread-X are set by TIF_SIGPENDING and wake up. The sub-thread-X could
call io_uring_cancel to set TIF_NOTIFY_SIGNAL for iou-wrk-thread-X which doesn't
have chance to clear it. And then iou-wrk-thread-X gets into zap_pid_ns_processes
function with TIF_NOTIFY_SIGNAL flag. If there are active processes in that pid
namespace, it will run into this issue.
Wei
On 06/13, Wei Fu wrote:
>
> I think there is a case that TIF_NOTIFY_SIGNAL remains set.
[...snip...]
Of course! but please forget about io_uring even if currently io_uring/
is the only user of TWA_SIGNAL.
Just suppose that the exiting task/thread races with task_workd_add(TWA_SIGNAL),
TIF_NOTIFY_SIGNAL won't be cleared.
This is fine in that the exiting task T will do exit_task_work() and after that
task_work_add(T) can't succeed with or without TWA_SIGNAL. So it can't miss the
pending work.
But I think we can forget about TIF_NOTIFY_SIGNAL. To me, the problem is that
the state of signal_pending() of the exiting task was never clearly defined, and
I can't even recall how many times I mentioned this in the previous discussions.
TIF_NOTIFY_SIGNAL doesn't add more confusion, imo.
Oleg.
On 06/13, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > The comment above the idr_for_each_entry_continue() loop tries to explain
> > why we have to signal each thread in the namespace, but it is outdated.
> > This code no longer uses kill_proc_info(), we have a target task so we can
> > check thread_group_leader() and avoid the unnecessary group_send_sig_info.
> > Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID,
> > this way it returns NULL if this pid is not a group-leader pid.
> >
> > Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process /
> > thread doesn't necessarily has a pending SIGKILL. Either way these checks
> > are racy without siglock, so the patch uses data_race() to shut up KCSAN.
>
> You remove the comment but the meat of what it was trying to say remains
> true. For processes in a session or processes is a process group a list
> of all such processes is kept. No such list is kept for a pid
> namespace. So the best we can do is walk through the allocated pid
> numbers in the pid namespace.
OK, I'll recheck tomorrow. Yet I think it doesn't make sense to send
SIGKILL to sub-threads, and the comment looks misleading today. This was
the main motivation, but again, I'll recheck.
> It would also help if this explains that in the case of SIGKILL
> complete_signal always sets SIGNAL_GROUP_EXIT which makes that a good
> check to use to see if the process has been killed (with SIGKILL).
Well, if SIGNAL_GROUP_EXIT is set we do not care if this process was
killed or not. It (the whole thread group) is going to exit, that is all.
We can even remove this check, it is just the optimization, just like
the current fatal_signal_pending() check.
Oleg.
On 06/13, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > kernel_wait4() doesn't sleep and returns -EINTR if there is no
> > eligible child and signal_pending() is true.
> >
> > That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not
> > enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending()
> > return false and avoid a busy-wait loop.
>
> I took a look through the code. It used to be that TIF_NOTIFY_SIGNAL
> was all about waking up a task so that task_work_run can be used.
> io_uring still mostly uses it that way. There is also a use in
> kthread_stop that just uses it as a TIF_SIGPENDING without having a
> pending signal.
>
> At the point in do_exit where exit_notify and thus zap_pid_ns_processes
> is called I can't possibly see a use for TIF_NOTIFY_SIGNAL.
> exit_task_work, exit_signals, and io_uring_cancel have all been called.
>
> So TIF_NOTIFY_SIGNAL should be spurious at this point and safe to clear.
> Why it remains set is a mystery to me.
because exit_task_work() -> task_work_run() doesn't clear TIF_NOTIFY_SIGNAL.
So yes, it is spurious, but to me a possible TIF_SIGPENDING is even more
"spurious". See my reply to Wei.
We don't need to clear TIF_NOTIFY_SIGNAL inside the loop, task_work_addd()
can't succeed after exit_task_work() sets ->task_works =&work_exited, but
this is another story and this doesn't (well, shouldn't) differ from
TIF_SIGPENDING.
> If I had infinite time and energy the ideal is to rework the pid
> namespace exit logic
Perhaps in this case you could take a look at the next loop waiting for
pid_ns->pid_allocated == init_pids ;)
I always hated the fact that the the exiting sub-namespace init can
"hang forever" if this namespace has the tasks injected from the parent
namespace. And I had enough hard-to-debug internal bug reports which
blamed the kernel.
> This active waiting is weird and non-standard in the kernel and winds up
> causeing a problem every couple of years because of that.
Agreed.
Oleg.
Oleg Nesterov <[email protected]> writes:
> On 06/13, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <[email protected]> writes:
>>
>> > The comment above the idr_for_each_entry_continue() loop tries to explain
>> > why we have to signal each thread in the namespace, but it is outdated.
>> > This code no longer uses kill_proc_info(), we have a target task so we can
>> > check thread_group_leader() and avoid the unnecessary group_send_sig_info.
>> > Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID,
>> > this way it returns NULL if this pid is not a group-leader pid.
>> >
>> > Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process /
>> > thread doesn't necessarily has a pending SIGKILL. Either way these checks
>> > are racy without siglock, so the patch uses data_race() to shut up KCSAN.
>>
>> You remove the comment but the meat of what it was trying to say remains
>> true. For processes in a session or processes is a process group a list
>> of all such processes is kept. No such list is kept for a pid
>> namespace. So the best we can do is walk through the allocated pid
>> numbers in the pid namespace.
>
> OK, I'll recheck tomorrow. Yet I think it doesn't make sense to send
> SIGKILL to sub-threads, and the comment looks misleading today. This was
> the main motivation, but again, I'll recheck.
Yes, we only need to send SIGKILL to only one thread.
Of course there are a few weird cases with zombie leader threads,
but I think the pattern you are using handles them.
>> It would also help if this explains that in the case of SIGKILL
>> complete_signal always sets SIGNAL_GROUP_EXIT which makes that a good
>> check to use to see if the process has been killed (with SIGKILL).
>
> Well, if SIGNAL_GROUP_EXIT is set we do not care if this process was
> killed or not. It (the whole thread group) is going to exit, that is all.
>
> We can even remove this check, it is just the optimization, just like
> the current fatal_signal_pending() check.
I just meant that the optimization is effective because
group_send_sig_info calls complete_signal which sets SIGNAL_GROUP_EXIT.
Which makes it an almost 100% accurate test, which makes it a very
good optimization. Especially in the case of multi-threaded processes
where the code will arrive there for every thread.
Eric