2018-10-03 11:22:51

by Håkon Bugge

[permalink] [raw]
Subject: Bug introduced by commit ebeeb1ad9b8a

Hi Greg,


I hope you will find this note appropriate.

The stable cherry-pick of upstream commit ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management") provokes the following stack trace when running with debug:


kernel: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:748
kernel: =============================
kernel: in_atomic(): 1, irqs_disabled(): 0, pid: 4392, name: rds-stress
kernel: 1 lock held by rds-stress/4392:
kernel: #0: 00000000df837d5e
kernel: WARNING: suspicious RCU usage
kernel: 4.18.8 #1 Not tainted
kernel: -----------------------------
kernel: ./include/linux/rcupdate.h:303 Illegal context switch in RCU read-side critical section!
kernel: (
kernel: #012other info that might help us debug this:
kernel: #012rcu_scheduler_active = 2, debug_locks = 1
kernel: rcu_read_lock){....}
kernel: 1 lock held by rds-stress/4393:
kernel: #0:
kernel: , at: __rds_conn_create+0x604/0x960 [rds]
kernel: 00000000df837d5e
kernel: CPU: 38 PID: 4392 Comm: rds-stress Not tainted 4.18.8 #1
kernel: Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 31110000 03/03/2017
kernel: (rcu_read_lock
kernel: Call Trace:
kernel: ){....}
kernel: dump_stack+0x81/0xb8
kernel: , at: __rds_conn_create+0x604/0x960 [rds]
kernel: #012stack backtrace:
kernel: ___might_sleep+0x239/0x260
kernel: __might_sleep+0x4a/0x80
kernel: __mutex_lock+0x58/0x9c0
kernel: ? __lock_acquire+0x47f/0x7e0
kernel: ? pcpu_alloc+0x429/0x860
kernel: ? find_held_lock+0x40/0xb0
kernel: ? create_object+0x22f/0x320
kernel: ? _raw_write_unlock_irqrestore+0x36/0x60
kernel: mutex_lock_killable_nested+0x1b/0x20
kernel: pcpu_alloc+0x429/0x860
kernel: ? create_object+0x22f/0x320
kernel: __alloc_percpu+0x15/0x20
kernel: rds_ib_recv_alloc_cache+0x1c/0x80 [rds_rdma]
kernel: rds_ib_recv_alloc_caches+0x1d/0x60 [rds_rdma]
kernel: rds_ib_conn_alloc+0x46/0x170 [rds_rdma]
kernel: __rds_conn_create+0x68d/0x960 [rds]
kernel: ? __rds_conn_create+0x604/0x960 [rds]
kernel: rds_conn_create_outgoing+0x14/0x20 [rds]
kernel: rds_sendmsg+0x2e8/0xcd0 [rds]
kernel: ? copy_msghdr_from_user+0xdb/0x140
kernel: sock_sendmsg+0x38/0x50
kernel: ___sys_sendmsg+0x27b/0x290
kernel: ? __lock_acquire+0x47f/0x7e0
kernel: ? find_held_lock+0x40/0xb0
kernel: ? __audit_syscall_entry+0xdf/0x160
kernel: ? ktime_get_coarse_real_ts64+0x6e/0xe0
kernel: ? trace_hardirqs_on_caller+0x128/0x1b0
kernel: ? trace_hardirqs_on+0xd/0x10
kernel: ? __audit_syscall_entry+0xdf/0x160
kernel: ? __audit_syscall_entry+0xdf/0x160
kernel: __sys_sendmsg+0x5d/0xb0
kernel: __x64_sys_sendmsg+0x1f/0x30
kernel: do_syscall_64+0x5f/0x220
kernel: entry_SYSCALL_64_after_hwframe+0x49/0xbe

Command line:

$ rds-stress -r <IB port 1 IP>& sleep 1; rds-stress -r <IB port 2 IP> -s <IB port 1 IP> -T 10

Deliberately or accidently, Ka-Cheong's commit f394ad28feff ("rds: rds_ib_recv_alloc_cache() should call alloc_percpu_gfp() instead") fixes the bug introduced by commit ebeeb1ad9b8a. Kudos to Zhu Yanjun who quickly detected this.

But be aware, commit f394ad28feff does not contain the "Fixes:" tag.

Hence, I suggest that in all stable releases containing commit ebeeb1ad9b8a, f394ad28feff must be included as well.


Thxs, Håkon









2018-10-03 11:28:53

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Bug introduced by commit ebeeb1ad9b8a

On Wed, Oct 03, 2018 at 01:20:44PM +0200, H?kon Bugge wrote:
> Hi Greg,
>
>
> I hope you will find this note appropriate.
>
> The stable cherry-pick of upstream commit ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management") provokes the following stack trace when running with debug:
>
>
> kernel: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:748
> kernel: =============================
> kernel: in_atomic(): 1, irqs_disabled(): 0, pid: 4392, name: rds-stress
> kernel: 1 lock held by rds-stress/4392:
> kernel: #0: 00000000df837d5e
> kernel: WARNING: suspicious RCU usage
> kernel: 4.18.8 #1 Not tainted
> kernel: -----------------------------
> kernel: ./include/linux/rcupdate.h:303 Illegal context switch in RCU read-side critical section!
> kernel: (
> kernel: #012other info that might help us debug this:
> kernel: #012rcu_scheduler_active = 2, debug_locks = 1
> kernel: rcu_read_lock){....}
> kernel: 1 lock held by rds-stress/4393:
> kernel: #0:
> kernel: , at: __rds_conn_create+0x604/0x960 [rds]
> kernel: 00000000df837d5e
> kernel: CPU: 38 PID: 4392 Comm: rds-stress Not tainted 4.18.8 #1
> kernel: Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 31110000 03/03/2017
> kernel: (rcu_read_lock
> kernel: Call Trace:
> kernel: ){....}
> kernel: dump_stack+0x81/0xb8
> kernel: , at: __rds_conn_create+0x604/0x960 [rds]
> kernel: #012stack backtrace:
> kernel: ___might_sleep+0x239/0x260
> kernel: __might_sleep+0x4a/0x80
> kernel: __mutex_lock+0x58/0x9c0
> kernel: ? __lock_acquire+0x47f/0x7e0
> kernel: ? pcpu_alloc+0x429/0x860
> kernel: ? find_held_lock+0x40/0xb0
> kernel: ? create_object+0x22f/0x320
> kernel: ? _raw_write_unlock_irqrestore+0x36/0x60
> kernel: mutex_lock_killable_nested+0x1b/0x20
> kernel: pcpu_alloc+0x429/0x860
> kernel: ? create_object+0x22f/0x320
> kernel: __alloc_percpu+0x15/0x20
> kernel: rds_ib_recv_alloc_cache+0x1c/0x80 [rds_rdma]
> kernel: rds_ib_recv_alloc_caches+0x1d/0x60 [rds_rdma]
> kernel: rds_ib_conn_alloc+0x46/0x170 [rds_rdma]
> kernel: __rds_conn_create+0x68d/0x960 [rds]
> kernel: ? __rds_conn_create+0x604/0x960 [rds]
> kernel: rds_conn_create_outgoing+0x14/0x20 [rds]
> kernel: rds_sendmsg+0x2e8/0xcd0 [rds]
> kernel: ? copy_msghdr_from_user+0xdb/0x140
> kernel: sock_sendmsg+0x38/0x50
> kernel: ___sys_sendmsg+0x27b/0x290
> kernel: ? __lock_acquire+0x47f/0x7e0
> kernel: ? find_held_lock+0x40/0xb0
> kernel: ? __audit_syscall_entry+0xdf/0x160
> kernel: ? ktime_get_coarse_real_ts64+0x6e/0xe0
> kernel: ? trace_hardirqs_on_caller+0x128/0x1b0
> kernel: ? trace_hardirqs_on+0xd/0x10
> kernel: ? __audit_syscall_entry+0xdf/0x160
> kernel: ? __audit_syscall_entry+0xdf/0x160
> kernel: __sys_sendmsg+0x5d/0xb0
> kernel: __x64_sys_sendmsg+0x1f/0x30
> kernel: do_syscall_64+0x5f/0x220
> kernel: entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> Command line:
>
> $ rds-stress -r <IB port 1 IP>& sleep 1; rds-stress -r <IB port 2 IP> -s <IB port 1 IP> -T 10
>
> Deliberately or accidently, Ka-Cheong's commit f394ad28feff ("rds: rds_ib_recv_alloc_cache() should call alloc_percpu_gfp() instead") fixes the bug introduced by commit ebeeb1ad9b8a. Kudos to Zhu Yanjun who quickly detected this.
>
> But be aware, commit f394ad28feff does not contain the "Fixes:" tag.
>
> Hence, I suggest that in all stable releases containing commit ebeeb1ad9b8a, f394ad28feff must be included as well.

Great, thanks for the information. Can you submit this info to the
netdev developers who will queue it up for a stable release? Or, as
David is already on the cc: list here, he can just tell me to
cherry-pick it and I can do it on my own :)

thanks,

greg k-h