2019-07-17 07:21:17

by Juri Lelli

[permalink] [raw]
Subject: [PATCH] net/xfrm/xfrm_ipcomp: Use {get,put}_cpu_light

The following BUG has been reported while running ipsec tests.

BUG: scheduling while atomic: irq/78-eno3-rx-/12023/0x00000002
Modules linked in: ipcomp xfrm_ipcomp ...
Preemption disabled at:
[<ffffffffc0b29730>] ipcomp_input+0xd0/0x9a0 [xfrm_ipcomp]
CPU: 1 PID: 12023 Comm: irq/78-eno3-rx- Kdump: loaded Not tainted [...] #1
Hardware name: [...]
Call Trace:
dump_stack+0x5c/0x80
? ipcomp_input+0xd0/0x9a0 [xfrm_ipcomp]
__schedule_bug.cold.81+0x44/0x51
__schedule+0x5bf/0x6a0
schedule+0x39/0xd0
rt_spin_lock_slowlock_locked+0x10e/0x2b0
rt_spin_lock_slowlock+0x50/0x80
get_page_from_freelist+0x609/0x1560
? zlib_updatewindow+0x5a/0xd0
__alloc_pages_nodemask+0xd9/0x280
ipcomp_input+0x299/0x9a0 [xfrm_ipcomp]
xfrm_input+0x5e3/0x960
xfrm4_ipcomp_rcv+0x34/0x50
ip_local_deliver_finish+0x22d/0x250
ip_local_deliver+0x6d/0x110
? ip_rcv_finish+0xac/0x480
ip_rcv+0x28e/0x3f9
? packet_rcv+0x43/0x4c0
__netif_receive_skb_core+0xb7c/0xd10
? inet_gro_receive+0x8e/0x2f0
netif_receive_skb_internal+0x4a/0x160
napi_gro_receive+0xee/0x110
tg3_rx+0x2a8/0x810 [tg3]
tg3_poll_work+0x3b3/0x830 [tg3]
tg3_poll_msix+0x3b/0x170 [tg3]
net_rx_action+0x1ff/0x470
? __switch_to_asm+0x41/0x70
do_current_softirqs+0x223/0x3e0
? irq_thread_check_affinity+0x20/0x20
__local_bh_enable+0x51/0x60
irq_forced_thread_fn+0x5e/0x80
? irq_finalize_oneshot.part.45+0xf0/0xf0
irq_thread+0x13d/0x1a0
? wake_threads_waitq+0x30/0x30
kthread+0x112/0x130
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x35/0x40

The problem resides in the fact that get_cpu() called from ipcomp_input()
disables preemption, and that triggers the scheduling while atomic BUG further
down the callpath chain of get_page_from_freelist(), i.e.,

ipcomp_input
ipcomp_decompress
<-- get_cpu()
alloc_page(GFP_ATOMIC)
alloc_pages(GFP_ATOMIC, 0)
alloc_pages_current
__alloc_pages_nodemask
get_page_from_freelist
(try_this_zone:) rmqueue
rmqueue_pcplist
__rmqueue_pcplist
rmqueue_bulk
<-- spin_lock(&zone->lock) - BUG

Fix this by using {get,put}_cpu_light() in ipcomp_decompress().

Signed-off-by: Juri Lelli <[email protected]>
---
Hi,

This has been found on a 4.19.x-rt kernel, but 5.x-rt(s) are affected as
well.

Best,

Juri
---
net/xfrm/xfrm_ipcomp.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_ipcomp.c b/net/xfrm/xfrm_ipcomp.c
index a00ec715aa46..39d9e663384f 100644
--- a/net/xfrm/xfrm_ipcomp.c
+++ b/net/xfrm/xfrm_ipcomp.c
@@ -45,7 +45,7 @@ static int ipcomp_decompress(struct xfrm_state *x, struct sk_buff *skb)
const int plen = skb->len;
int dlen = IPCOMP_SCRATCH_SIZE;
const u8 *start = skb->data;
- const int cpu = get_cpu();
+ const int cpu = get_cpu_light();
u8 *scratch = *per_cpu_ptr(ipcomp_scratches, cpu);
struct crypto_comp *tfm = *per_cpu_ptr(ipcd->tfms, cpu);
int err = crypto_comp_decompress(tfm, start, plen, scratch, &dlen);
@@ -103,7 +103,7 @@ static int ipcomp_decompress(struct xfrm_state *x, struct sk_buff *skb)
err = 0;

out:
- put_cpu();
+ put_cpu_light();
return err;
}

--
2.17.2


Subject: Re: [PATCH] net/xfrm/xfrm_ipcomp: Use {get,put}_cpu_light



On 17/07/2019 09:20, Juri Lelli wrote:
> The following BUG has been reported while running ipsec tests.
>
> BUG: scheduling while atomic: irq/78-eno3-rx-/12023/0x00000002
> Modules linked in: ipcomp xfrm_ipcomp ...
> Preemption disabled at:
> [<ffffffffc0b29730>] ipcomp_input+0xd0/0x9a0 [xfrm_ipcomp]
> CPU: 1 PID: 12023 Comm: irq/78-eno3-rx- Kdump: loaded Not tainted [...] #1
> Hardware name: [...]
> Call Trace:
> dump_stack+0x5c/0x80
> ? ipcomp_input+0xd0/0x9a0 [xfrm_ipcomp]
> __schedule_bug.cold.81+0x44/0x51
> __schedule+0x5bf/0x6a0
> schedule+0x39/0xd0
> rt_spin_lock_slowlock_locked+0x10e/0x2b0
> rt_spin_lock_slowlock+0x50/0x80
> get_page_from_freelist+0x609/0x1560
> ? zlib_updatewindow+0x5a/0xd0
> __alloc_pages_nodemask+0xd9/0x280
> ipcomp_input+0x299/0x9a0 [xfrm_ipcomp]
> xfrm_input+0x5e3/0x960
> xfrm4_ipcomp_rcv+0x34/0x50
> ip_local_deliver_finish+0x22d/0x250
> ip_local_deliver+0x6d/0x110
> ? ip_rcv_finish+0xac/0x480
> ip_rcv+0x28e/0x3f9
> ? packet_rcv+0x43/0x4c0
> __netif_receive_skb_core+0xb7c/0xd10
> ? inet_gro_receive+0x8e/0x2f0
> netif_receive_skb_internal+0x4a/0x160
> napi_gro_receive+0xee/0x110
> tg3_rx+0x2a8/0x810 [tg3]
> tg3_poll_work+0x3b3/0x830 [tg3]
> tg3_poll_msix+0x3b/0x170 [tg3]
> net_rx_action+0x1ff/0x470
> ? __switch_to_asm+0x41/0x70
> do_current_softirqs+0x223/0x3e0
> ? irq_thread_check_affinity+0x20/0x20
> __local_bh_enable+0x51/0x60
> irq_forced_thread_fn+0x5e/0x80
> ? irq_finalize_oneshot.part.45+0xf0/0xf0
> irq_thread+0x13d/0x1a0
> ? wake_threads_waitq+0x30/0x30
> kthread+0x112/0x130
> ? kthread_create_worker_on_cpu+0x70/0x70
> ret_from_fork+0x35/0x40
>
> The problem resides in the fact that get_cpu() called from ipcomp_input()
> disables preemption, and that triggers the scheduling while atomic BUG further
> down the callpath chain of get_page_from_freelist(), i.e.,
>
> ipcomp_input
> ipcomp_decompress
> <-- get_cpu()
> alloc_page(GFP_ATOMIC)
> alloc_pages(GFP_ATOMIC, 0)
> alloc_pages_current
> __alloc_pages_nodemask
> get_page_from_freelist
> (try_this_zone:) rmqueue
> rmqueue_pcplist
> __rmqueue_pcplist
> rmqueue_bulk
> <-- spin_lock(&zone->lock) - BUG
>
> Fix this by using {get,put}_cpu_light() in ipcomp_decompress().
>
> Signed-off-by: Juri Lelli <[email protected]>

Reviewed-by: Daniel Bristot de Oliveira <[email protected]>

Thanks!
-- Daniel

2019-07-17 14:00:26

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [PATCH] net/xfrm/xfrm_ipcomp: Use {get,put}_cpu_light


>
> The following BUG has been reported while running ipsec tests.
>
> BUG: scheduling while atomic: irq/78-eno3-rx-/12023/0x00000002
> Modules linked in: ipcomp xfrm_ipcomp ...
> Preemption disabled at:
> [<ffffffffc0b29730>] ipcomp_input+0xd0/0x9a0 [xfrm_ipcomp]
> CPU: 1 PID: 12023 Comm: irq/78-eno3-rx- Kdump: loaded Not tainted [...] #1
> Hardware name: [...]
> Call Trace:
> dump_stack+0x5c/0x80
> ? ipcomp_input+0xd0/0x9a0 [xfrm_ipcomp]
> __schedule_bug.cold.81+0x44/0x51
> __schedule+0x5bf/0x6a0
> schedule+0x39/0xd0
> rt_spin_lock_slowlock_locked+0x10e/0x2b0
> rt_spin_lock_slowlock+0x50/0x80
> get_page_from_freelist+0x609/0x1560
> ? zlib_updatewindow+0x5a/0xd0
> __alloc_pages_nodemask+0xd9/0x280
> ipcomp_input+0x299/0x9a0 [xfrm_ipcomp]
> xfrm_input+0x5e3/0x960
> xfrm4_ipcomp_rcv+0x34/0x50
> ip_local_deliver_finish+0x22d/0x250
> ip_local_deliver+0x6d/0x110
> ? ip_rcv_finish+0xac/0x480
> ip_rcv+0x28e/0x3f9
> ? packet_rcv+0x43/0x4c0
> __netif_receive_skb_core+0xb7c/0xd10
> ? inet_gro_receive+0x8e/0x2f0
> netif_receive_skb_internal+0x4a/0x160
> napi_gro_receive+0xee/0x110
> tg3_rx+0x2a8/0x810 [tg3]
> tg3_poll_work+0x3b3/0x830 [tg3]
> tg3_poll_msix+0x3b/0x170 [tg3]
> net_rx_action+0x1ff/0x470
> ? __switch_to_asm+0x41/0x70
> do_current_softirqs+0x223/0x3e0
> ? irq_thread_check_affinity+0x20/0x20
> __local_bh_enable+0x51/0x60
> irq_forced_thread_fn+0x5e/0x80
> ? irq_finalize_oneshot.part.45+0xf0/0xf0
> irq_thread+0x13d/0x1a0
> ? wake_threads_waitq+0x30/0x30
> kthread+0x112/0x130
> ? kthread_create_worker_on_cpu+0x70/0x70
> ret_from_fork+0x35/0x40
>
> The problem resides in the fact that get_cpu() called from ipcomp_input()
> disables preemption, and that triggers the scheduling while atomic BUG
> further
> down the callpath chain of get_page_from_freelist(), i.e.,
>
> ipcomp_input
> ipcomp_decompress
> <-- get_cpu()
> alloc_page(GFP_ATOMIC)
> alloc_pages(GFP_ATOMIC, 0)
> alloc_pages_current
> __alloc_pages_nodemask
> get_page_from_freelist
> (try_this_zone:) rmqueue
> rmqueue_pcplist
> __rmqueue_pcplist
> rmqueue_bulk
> <-- spin_lock(&zone->lock) - BUG
>
> Fix this by using {get,put}_cpu_light() in ipcomp_decompress().
>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> Hi,
>
> This has been found on a 4.19.x-rt kernel, but 5.x-rt(s) are affected as
> well.
>
> Best,
>
> Juri
> ---
> net/xfrm/xfrm_ipcomp.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/xfrm/xfrm_ipcomp.c b/net/xfrm/xfrm_ipcomp.c
> index a00ec715aa46..39d9e663384f 100644
> --- a/net/xfrm/xfrm_ipcomp.c
> +++ b/net/xfrm/xfrm_ipcomp.c
> @@ -45,7 +45,7 @@ static int ipcomp_decompress(struct xfrm_state *x, struct
> sk_buff *skb)
> const int plen = skb->len;
> int dlen = IPCOMP_SCRATCH_SIZE;
> const u8 *start = skb->data;
> - const int cpu = get_cpu();
> + const int cpu = get_cpu_light();
> u8 *scratch = *per_cpu_ptr(ipcomp_scratches, cpu);
> struct crypto_comp *tfm = *per_cpu_ptr(ipcd->tfms, cpu);
> int err = crypto_comp_decompress(tfm, start, plen, scratch, &dlen);
> @@ -103,7 +103,7 @@ static int ipcomp_decompress(struct xfrm_state *x, struct
> sk_buff *skb)
> err = 0;
>
> out:
> - put_cpu();
> + put_cpu_light();
> return err;
> }
>
> --

Reviewed-by: Pankaj Gupta <[email protected]>

> 2.17.2
>
>

Subject: Re: [PATCH] net/xfrm/xfrm_ipcomp: Use {get,put}_cpu_light

On 2019-07-17 09:20:19 [+0200], Juri Lelli wrote:
> The following BUG has been reported while running ipsec tests.

> Hi,
>
> This has been found on a 4.19.x-rt kernel, but 5.x-rt(s) are affected as
> well.
>
> Best,
>
> Juri
> ---
> net/xfrm/xfrm_ipcomp.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/xfrm/xfrm_ipcomp.c b/net/xfrm/xfrm_ipcomp.c
> index a00ec715aa46..39d9e663384f 100644
> --- a/net/xfrm/xfrm_ipcomp.c
> +++ b/net/xfrm/xfrm_ipcomp.c
> @@ -45,7 +45,7 @@ static int ipcomp_decompress(struct xfrm_state *x, struct sk_buff *skb)
> const int plen = skb->len;
> int dlen = IPCOMP_SCRATCH_SIZE;
> const u8 *start = skb->data;
> - const int cpu = get_cpu();
> + const int cpu = get_cpu_light();

By using get_cpu_light() you don't forbid another function to invoke
ipcomp_decompress() on the same CPU. That means that

> u8 *scratch = *per_cpu_ptr(ipcomp_scratches, cpu);

scratch buffer here could be used by two tasks on the same CPU. You are
aware of that right?
According to your backtrace you get here from NAPI which means BH which
means it is enough to use smp_processor_id() in such a case.

ipcomp_compress() is using the very same buffer while invoking
local_bh_disable() before using the buffer to ensure nothing else is
using the buffer on this CPU. This will work in v5.2-RT because the new
softirq code uses a local_lock() as part of local_bh_disable(). This
does not work on v4.19-RT and earlier.

For v4.19 and earlier I suggest to use a local_lock().
For v5.2 and later I suggest to replace get_cpu() with
smp_processor_id(). Ideally a with a lockdep annotation to ensure that
BH is disabled (which we don't have).

> struct crypto_comp *tfm = *per_cpu_ptr(ipcd->tfms, cpu);
> int err = crypto_comp_decompress(tfm, start, plen, scratch, &dlen);

Sebastian