2015-04-06 04:00:28

by Sasha Levin

[permalink] [raw]
Subject: Hang on large copy_from_user with PREEMPT_NONE

Hi all,

I'm seeing an interesting hang when trinity is trying to load a large module, where
the size passed by userspace is larger than the amount of memory actually allocated
in userspace and passed in the 'from' parameter:

[ 1549.080032] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [trinity-c11:12699]
[ 1549.080032] Modules linked in:
[ 1549.080032] hardirqs last enabled at (4202427): restore_args (arch/x86/kernel/entry_64.S:750)
[ 1549.080032] hardirqs last disabled at (4202428): apic_timer_interrupt (arch/x86/kernel/entry_64.S:890)
[ 1549.080032] softirqs last enabled at (4202426): __do_softirq (./arch/x86/include/asm/preempt.h:22 kernel/softirq.c:300)
[ 1549.080032] softirqs last disabled at (4202421): irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
[ 1549.080032] CPU: 11 PID: 12699 Comm: trinity-c11 Not tainted 4.0.0-rc6-next-20150402-sasha-00039-ge0bdae3-dirty #2125
[ 1549.080032] task: ffff880260f30000 ti: ffff88025ebe8000 task.ti: ffff88025ebe8000
[ 1549.080032] RIP: copy_user_handle_tail (arch/x86/lib/usercopy_64.c:85)
[ 1549.080032] RSP: 0000:ffff88025ebefe38 EFLAGS: 00010202
[ 1549.080032] RAX: 00000000356cb494 RBX: 1ffff1004bd7dfcb RCX: 0000000000000000
[ 1549.080032] RDX: 000000009277d652 RSI: ffffc90078bba001 RDI: ffffc90078bba000
[ 1549.080032] RBP: ffff88025ebefe38 R08: 0000000000000000 R09: 0000000002f75001
[ 1549.080032] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88025ebefe78
[ 1549.080032] R13: ffff88025ebeff18 R14: 0000000094949494 R15: 0000000000da91be
[ 1549.080032] FS: 00007f1f669df700(0000) GS:ffff8803f2800000(0000) knlGS:0000000000000000
[ 1549.080032] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1549.080032] CR2: 0000000002f75000 CR3: 000000025d031000 CR4: 00000000000007a0
[ 1549.080032] Stack:
[ 1549.080032] ffff88025ebeff48 ffffffff963c5ef0 ffffc9001993c000 0000000000da91be
[ 1549.080032] 0000000041b58ab3 ffffffffa32918f3 ffffffff963c5da0 0000000000000286
[ 1549.080032] ffffc9001993c000 0000000094949494 0000000000000000 0000000000000000
[ 1549.080032] Call Trace:
[ 1549.080032] SyS_init_module (kernel/module.c:2505 kernel/module.c:3401 kernel/module.c:3388)
[ 1549.080032] ? load_module (kernel/module.c:3388)
[ 1549.080032] ia32_do_call (arch/x86/ia32/ia32entry.S:501)
[ 1549.080032] Code: d0 66 66 90 8a 06 66 66 90 45 85 c0 74 d3 85 c9 74 23 89 d0 45 31 c0 eb 08 83 e8 01 48 89 f7 74 14 48 8d 77 01 44 89 c1 66 66 90 <c6> 07 00 66 66 90 85 c9 74 e4 89 d0 66 66 90 5d c3 0f 1f 44 00
All code
========
0: d0 66 66 shlb 0x66(%rsi)
3: 90 nop
4: 8a 06 mov (%rsi),%al
6: 66 66 90 data32 xchg %ax,%ax
9: 45 85 c0 test %r8d,%r8d
c: 74 d3 je 0xffffffffffffffe1
e: 85 c9 test %ecx,%ecx
10: 74 23 je 0x35
12: 89 d0 mov %edx,%eax
14: 45 31 c0 xor %r8d,%r8d
17: eb 08 jmp 0x21
19: 83 e8 01 sub $0x1,%eax
1c: 48 89 f7 mov %rsi,%rdi
1f: 74 14 je 0x35
21: 48 8d 77 01 lea 0x1(%rdi),%rsi
25: 44 89 c1 mov %r8d,%ecx
28: 66 66 90 data32 xchg %ax,%ax
2b:* c6 07 00 movb $0x0,(%rdi) <-- trapping instruction
2e: 66 66 90 data32 xchg %ax,%ax
31: 85 c9 test %ecx,%ecx
33: 74 e4 je 0x19
35: 89 d0 mov %edx,%eax
37: 66 66 90 data32 xchg %ax,%ax
3a: 5d pop %rbp
3b: c3 retq
3c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

Code starting with the faulting instruction
===========================================
0: c6 07 00 movb $0x0,(%rdi)
3: 66 66 90 data32 xchg %ax,%ax
6: 85 c9 test %ecx,%ecx
8: 74 e4 je 0xffffffffffffffee
a: 89 d0 mov %edx,%eax
c: 66 66 90 data32 xchg %ax,%ax
f: 5d pop %rbp
10: c3 retq
11: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)


This is the result of getting copy_user_handle_tail to zero out a large block of
kernel memory very inefficiently:

for (c = 0, zero_len = len; zerorest && zero_len; --zero_len)
if (__put_user_nocheck(c, to++, sizeof(char)))
break;


Thanks,
Sasha


2015-04-06 11:26:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On Sun, Apr 05, 2015 at 11:59:52PM -0400, Sasha Levin wrote:
> Hi all,
>
> I'm seeing an interesting hang when trinity is trying to load a large module, where
> the size passed by userspace is larger than the amount of memory actually allocated
> in userspace and passed in the 'from' parameter:
>
> [ 1549.080032] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [trinity-c11:12699]
> [ 1549.080032] Modules linked in:
> [ 1549.080032] hardirqs last enabled at (4202427): restore_args (arch/x86/kernel/entry_64.S:750)
> [ 1549.080032] hardirqs last disabled at (4202428): apic_timer_interrupt (arch/x86/kernel/entry_64.S:890)
> [ 1549.080032] softirqs last enabled at (4202426): __do_softirq (./arch/x86/include/asm/preempt.h:22 kernel/softirq.c:300)
> [ 1549.080032] softirqs last disabled at (4202421): irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
> [ 1549.080032] CPU: 11 PID: 12699 Comm: trinity-c11 Not tainted 4.0.0-rc6-next-20150402-sasha-00039-ge0bdae3-dirty #2125
> [ 1549.080032] task: ffff880260f30000 ti: ffff88025ebe8000 task.ti: ffff88025ebe8000
> [ 1549.080032] RIP: copy_user_handle_tail (arch/x86/lib/usercopy_64.c:85)
> [ 1549.080032] RSP: 0000:ffff88025ebefe38 EFLAGS: 00010202
> [ 1549.080032] RAX: 00000000356cb494 RBX: 1ffff1004bd7dfcb RCX: 0000000000000000

Wow, that's some serious len in %rax. Almost a gigabyte AFAICT.

> [ 1549.080032] RDX: 000000009277d652 RSI: ffffc90078bba001 RDI: ffffc90078bba000

Btw, what's happening to that user pointer in %rdi, is it kosher?

RSI: ffffc90078bba001
RDI: ffffc90078bba000

Hohumm, @from and @to overlap by a byte... Interesting...

> [ 1549.080032] RBP: ffff88025ebefe38 R08: 0000000000000000 R09: 0000000002f75001
> [ 1549.080032] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88025ebefe78
> [ 1549.080032] R13: ffff88025ebeff18 R14: 0000000094949494 R15: 0000000000da91be
> [ 1549.080032] FS: 00007f1f669df700(0000) GS:ffff8803f2800000(0000) knlGS:0000000000000000
> [ 1549.080032] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 1549.080032] CR2: 0000000002f75000 CR3: 000000025d031000 CR4: 00000000000007a0
> [ 1549.080032] Stack:
> [ 1549.080032] ffff88025ebeff48 ffffffff963c5ef0 ffffc9001993c000 0000000000da91be
> [ 1549.080032] 0000000041b58ab3 ffffffffa32918f3 ffffffff963c5da0 0000000000000286
> [ 1549.080032] ffffc9001993c000 0000000094949494 0000000000000000 0000000000000000
> [ 1549.080032] Call Trace:
> [ 1549.080032] SyS_init_module (kernel/module.c:2505 kernel/module.c:3401 kernel/module.c:3388)
> [ 1549.080032] ? load_module (kernel/module.c:3388)
> [ 1549.080032] ia32_do_call (arch/x86/ia32/ia32entry.S:501)
> [ 1549.080032] Code: d0 66 66 90 8a 06 66 66 90 45 85 c0 74 d3 85 c9 74 23 89 d0 45 31 c0 eb 08 83 e8 01 48 89 f7 74 14 48 8d 77 01 44 89 c1 66 66 90 <c6> 07 00 66 66 90 85 c9 74 e4 89 d0 66 66 90 5d c3 0f 1f 44 00
> All code
> ========
> 0: d0 66 66 shlb 0x66(%rsi)
> 3: 90 nop
> 4: 8a 06 mov (%rsi),%al
> 6: 66 66 90 data32 xchg %ax,%ax
> 9: 45 85 c0 test %r8d,%r8d
> c: 74 d3 je 0xffffffffffffffe1
> e: 85 c9 test %ecx,%ecx
> 10: 74 23 je 0x35
> 12: 89 d0 mov %edx,%eax
> 14: 45 31 c0 xor %r8d,%r8d
> 17: eb 08 jmp 0x21
> 19: 83 e8 01 sub $0x1,%eax
> 1c: 48 89 f7 mov %rsi,%rdi
> 1f: 74 14 je 0x35
> 21: 48 8d 77 01 lea 0x1(%rdi),%rsi
> 25: 44 89 c1 mov %r8d,%ecx
> 28: 66 66 90 data32 xchg %ax,%ax
> 2b:* c6 07 00 movb $0x0,(%rdi) <-- trapping instruction
> 2e: 66 66 90 data32 xchg %ax,%ax
> 31: 85 c9 test %ecx,%ecx
> 33: 74 e4 je 0x19
> 35: 89 d0 mov %edx,%eax
> 37: 66 66 90 data32 xchg %ax,%ax
> 3a: 5d pop %rbp
> 3b: c3 retq
> 3c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

Hmm, lotsa alternatives patching around that area even though I can't
see anything wrong from looking at your dump. The "66 66 90" nops are
the STAC/CLAC things optimized to NOPs which are asm volatile within the
__put_user_asm()'s own asm volatile. And I thought the labels might be
fudged but my usercopy_64.s version here looks ok.

Can you boot that box with "debug-alternative" and send me dmesg? Also,
vmlinux too please. Privately's fine too.

>
> Code starting with the faulting instruction
> ===========================================
> 0: c6 07 00 movb $0x0,(%rdi)
> 3: 66 66 90 data32 xchg %ax,%ax
> 6: 85 c9 test %ecx,%ecx
> 8: 74 e4 je 0xffffffffffffffee
> a: 89 d0 mov %edx,%eax
> c: 66 66 90 data32 xchg %ax,%ax
> f: 5d pop %rbp
> 10: c3 retq
> 11: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
>
>
> This is the result of getting copy_user_handle_tail to zero out a large block of
> kernel memory very inefficiently:
>
> for (c = 0, zero_len = len; zerorest && zero_len; --zero_len)

Btw, that zerorest is being tested on every loop iteration! AFAICT,

if (!zerorest) {
clac();
return len;
}

before the loop should be nicer. Or am I missing something?

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-06 14:52:40

by Sasha Levin

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On 04/06/2015 07:24 AM, Borislav Petkov wrote:
> On Sun, Apr 05, 2015 at 11:59:52PM -0400, Sasha Levin wrote:
>> Hi all,
>>
>> I'm seeing an interesting hang when trinity is trying to load a large module, where
>> the size passed by userspace is larger than the amount of memory actually allocated
>> in userspace and passed in the 'from' parameter:
>>
>> [ 1549.080032] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [trinity-c11:12699]
>> [ 1549.080032] Modules linked in:
>> [ 1549.080032] hardirqs last enabled at (4202427): restore_args (arch/x86/kernel/entry_64.S:750)
>> [ 1549.080032] hardirqs last disabled at (4202428): apic_timer_interrupt (arch/x86/kernel/entry_64.S:890)
>> [ 1549.080032] softirqs last enabled at (4202426): __do_softirq (./arch/x86/include/asm/preempt.h:22 kernel/softirq.c:300)
>> [ 1549.080032] softirqs last disabled at (4202421): irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
>> [ 1549.080032] CPU: 11 PID: 12699 Comm: trinity-c11 Not tainted 4.0.0-rc6-next-20150402-sasha-00039-ge0bdae3-dirty #2125
>> [ 1549.080032] task: ffff880260f30000 ti: ffff88025ebe8000 task.ti: ffff88025ebe8000
>> [ 1549.080032] RIP: copy_user_handle_tail (arch/x86/lib/usercopy_64.c:85)
>> [ 1549.080032] RSP: 0000:ffff88025ebefe38 EFLAGS: 00010202
>> [ 1549.080032] RAX: 00000000356cb494 RBX: 1ffff1004bd7dfcb RCX: 0000000000000000
>
> Wow, that's some serious len in %rax. Almost a gigabyte AFAICT.

Yup, you can load (or, the kernel would allow you to attempt and load) a module of up to 4GB in size.

>> [ 1549.080032] RDX: 000000009277d652 RSI: ffffc90078bba001 RDI: ffffc90078bba000
>
> Btw, what's happening to that user pointer in %rdi, is it kosher?

Probably not. Trinity passes random addresses into that syscall, so it's probably one of it's magic
values.

> RSI: ffffc90078bba001
> RDI: ffffc90078bba000
>
> Hohumm, @from and @to overlap by a byte... Interesting...
>
>> [ 1549.080032] RBP: ffff88025ebefe38 R08: 0000000000000000 R09: 0000000002f75001
>> [ 1549.080032] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88025ebefe78
>> [ 1549.080032] R13: ffff88025ebeff18 R14: 0000000094949494 R15: 0000000000da91be
>> [ 1549.080032] FS: 00007f1f669df700(0000) GS:ffff8803f2800000(0000) knlGS:0000000000000000
>> [ 1549.080032] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [ 1549.080032] CR2: 0000000002f75000 CR3: 000000025d031000 CR4: 00000000000007a0
>> [ 1549.080032] Stack:
>> [ 1549.080032] ffff88025ebeff48 ffffffff963c5ef0 ffffc9001993c000 0000000000da91be
>> [ 1549.080032] 0000000041b58ab3 ffffffffa32918f3 ffffffff963c5da0 0000000000000286
>> [ 1549.080032] ffffc9001993c000 0000000094949494 0000000000000000 0000000000000000
>> [ 1549.080032] Call Trace:
>> [ 1549.080032] SyS_init_module (kernel/module.c:2505 kernel/module.c:3401 kernel/module.c:3388)
>> [ 1549.080032] ? load_module (kernel/module.c:3388)
>> [ 1549.080032] ia32_do_call (arch/x86/ia32/ia32entry.S:501)
>> [ 1549.080032] Code: d0 66 66 90 8a 06 66 66 90 45 85 c0 74 d3 85 c9 74 23 89 d0 45 31 c0 eb 08 83 e8 01 48 89 f7 74 14 48 8d 77 01 44 89 c1 66 66 90 <c6> 07 00 66 66 90 85 c9 74 e4 89 d0 66 66 90 5d c3 0f 1f 44 00
>> All code
>> ========
>> 0: d0 66 66 shlb 0x66(%rsi)
>> 3: 90 nop
>> 4: 8a 06 mov (%rsi),%al
>> 6: 66 66 90 data32 xchg %ax,%ax
>> 9: 45 85 c0 test %r8d,%r8d
>> c: 74 d3 je 0xffffffffffffffe1
>> e: 85 c9 test %ecx,%ecx
>> 10: 74 23 je 0x35
>> 12: 89 d0 mov %edx,%eax
>> 14: 45 31 c0 xor %r8d,%r8d
>> 17: eb 08 jmp 0x21
>> 19: 83 e8 01 sub $0x1,%eax
>> 1c: 48 89 f7 mov %rsi,%rdi
>> 1f: 74 14 je 0x35
>> 21: 48 8d 77 01 lea 0x1(%rdi),%rsi
>> 25: 44 89 c1 mov %r8d,%ecx
>> 28: 66 66 90 data32 xchg %ax,%ax
>> 2b:* c6 07 00 movb $0x0,(%rdi) <-- trapping instruction
>> 2e: 66 66 90 data32 xchg %ax,%ax
>> 31: 85 c9 test %ecx,%ecx
>> 33: 74 e4 je 0x19
>> 35: 89 d0 mov %edx,%eax
>> 37: 66 66 90 data32 xchg %ax,%ax
>> 3a: 5d pop %rbp
>> 3b: c3 retq
>> 3c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
>
> Hmm, lotsa alternatives patching around that area even though I can't
> see anything wrong from looking at your dump. The "66 66 90" nops are
> the STAC/CLAC things optimized to NOPs which are asm volatile within the
> __put_user_asm()'s own asm volatile. And I thought the labels might be
> fudged but my usercopy_64.s version here looks ok.
>
> Can you boot that box with "debug-alternative" and send me dmesg? Also,
> vmlinux too please. Privately's fine too.

I've attached dmesg with debug-alternative, and my .config. vmlinux is pretty
large (1GB compressed) so hopefully the config would be enough - let me know
if it isn't.

>> Code starting with the faulting instruction
>> ===========================================
>> 0: c6 07 00 movb $0x0,(%rdi)
>> 3: 66 66 90 data32 xchg %ax,%ax
>> 6: 85 c9 test %ecx,%ecx
>> 8: 74 e4 je 0xffffffffffffffee
>> a: 89 d0 mov %edx,%eax
>> c: 66 66 90 data32 xchg %ax,%ax
>> f: 5d pop %rbp
>> 10: c3 retq
>> 11: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
>>
>>
>> This is the result of getting copy_user_handle_tail to zero out a large block of
>> kernel memory very inefficiently:
>>
>> for (c = 0, zero_len = len; zerorest && zero_len; --zero_len)
>
> Btw, that zerorest is being tested on every loop iteration! AFAICT,
>
> if (!zerorest) {
> clac();
> return len;
> }
>
> before the loop should be nicer. Or am I missing something?

Looks good to me. I suspect the compiler optimizes that anyway though.


Thanks,
Sasha


Attachments:
sasha-config (159.66 kB)
dmesg.gz (118.38 kB)
Download all attachments

2015-04-06 16:54:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On Mon, Apr 06, 2015 at 10:51:42AM -0400, Sasha Levin wrote:
> > Btw, what's happening to that user pointer in %rdi, is it kosher?
>
> Probably not. Trinity passes random addresses into that syscall, so it's probably one of it's magic
> values.

So what is the expectation? If that pointer goes away from under us,
userspace is supposed to get some signal instead of hanging?

> I've attached dmesg with debug-alternative, and my .config. vmlinux is pretty
> large (1GB compressed) so hopefully the config would be enough - let me know
> if it isn't.

Thanks, I'll give it a try and do some staring tomorrow. Still a holiday
today here :-)

> > Btw, that zerorest is being tested on every loop iteration! AFAICT,
> >
> > if (!zerorest) {
> > clac();
> > return len;
> > }
> >
> > before the loop should be nicer. Or am I missing something?
>
> Looks good to me. I suspect the compiler optimizes that anyway though.

Yeah, doesn't look like it. See the

e: 85 c9 test %ecx,%ecx

?

That's:

---
testl %esi, %esi # __pu_err
je .L34 #,
.L21:
testl %ecx, %ecx # zerorest <---
je .L33 #,
movl %edx, %eax # len, len
xorl %r8d, %r8d # tmp194
jmp .L27 #
.p2align 4,,7
.p2align 3
.L35:
decl %eax # len
movq %rsi, %rdi # to, to
je .L33 #,
.L27:
leaq 1(%rdi), %rsi #, to
---

and my asm is not so different from yours so that I can match your asm
with the .s output here.

Anyway, that's a different story, I'll do a fix for it.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-06 17:26:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On Sun, Apr 5, 2015 at 8:59 PM, Sasha Levin <[email protected]> wrote:
>
> This is the result of getting copy_user_handle_tail to zero out a large block of
> kernel memory very inefficiently:

Ugh.

Normally we should be able to just do

if (zerorest)
memset(to, 0, len);

and be done with it.

The only reason we don't do that actually looks like a buglet in
'copy_in_user()', which can have a source fault but should *not*
necessarily try to clear the rest of the destination buffer. But it
uses the shared "copy_user_generic()" logic, so it doesn't even
realize that.

I call it a "buglet", because there's not necessarily anything
horribly wrong with clearing the tail, it's just completely wasted
work. And it makes the "oops, source is bad" case unnecessarily
horribly slow.

In fact, the whole "zerorest" thing is garbage, I think. The
copy_user_generic() code seems to always set it to just 'len', and
it's because it doesn't even know or care about the actual direction.

The *real* test should just be "is the destination a kernel space
buffer" (we have done the "access_ok()" things independently). And
that test we can do without any 'zerorest' parameter.

So the attached (untested) patch should

(a) remove the pointless 'zerorest' parameter

(b) fix the 'copy_in_user()' buglet

(c) make the kernel destination case be much more efficient with just
a simple 'memset()'

Hmm? Comments? Sasha, do you have the initial random number state to
recreate this easily?

Linus


Attachments:
patch.diff (1.54 kB)

2015-04-06 19:09:05

by Sasha Levin

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On 04/06/2015 01:26 PM, Linus Torvalds wrote:
> On Sun, Apr 5, 2015 at 8:59 PM, Sasha Levin <[email protected]> wrote:
>>
>> This is the result of getting copy_user_handle_tail to zero out a large block of
>> kernel memory very inefficiently:
>
> Ugh.
>
> Normally we should be able to just do
>
> if (zerorest)
> memset(to, 0, len);
>
> and be done with it.
>
> The only reason we don't do that actually looks like a buglet in
> 'copy_in_user()', which can have a source fault but should *not*
> necessarily try to clear the rest of the destination buffer. But it
> uses the shared "copy_user_generic()" logic, so it doesn't even
> realize that.
>
> I call it a "buglet", because there's not necessarily anything
> horribly wrong with clearing the tail, it's just completely wasted
> work. And it makes the "oops, source is bad" case unnecessarily
> horribly slow.
>
> In fact, the whole "zerorest" thing is garbage, I think. The
> copy_user_generic() code seems to always set it to just 'len', and
> it's because it doesn't even know or care about the actual direction.
>
> The *real* test should just be "is the destination a kernel space
> buffer" (we have done the "access_ok()" things independently). And
> that test we can do without any 'zerorest' parameter.
>
> So the attached (untested) patch should
>
> (a) remove the pointless 'zerorest' parameter
>
> (b) fix the 'copy_in_user()' buglet
>
> (c) make the kernel destination case be much more efficient with just
> a simple 'memset()'
>
> Hmm? Comments? Sasha, do you have the initial random number state to
> recreate this easily?

Your patch just makes it hang in memset instead:

[ 963.104556] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [trinity-c224:9845]
[ 963.104556] Modules linked in:
[ 963.104556] irq event stamp: 3773324
[ 963.104556] hardirqs last enabled at (3773323): [<ffffffffb7af8e3c>] restore_args+0x0/0x34
[ 963.104556] hardirqs last disabled at (3773324): [<ffffffffb7af919e>] apic_timer_interrupt+0x6e/0x80
[ 963.104556] softirqs last enabled at (3773322): [<ffffffffad1f6589>] __do_softirq+0x709/0xd40
[ 963.104556] softirqs last disabled at (3773317): [<ffffffffad1f746d>] irq_exit+0x29d/0x320
[ 963.104556] CPU: 1 PID: 9845 Comm: trinity-c224 Not tainted 4.0.0-rc6-next-20150402-sasha-00039-ge0bdae3-dirty #2130
[ 963.104556] task: ffff8802a3560000 ti: ffff8802a3568000 task.ti: ffff8802a3568000
[ 963.104556] RIP: 0010:[<ffffffffaefbd173>] [<ffffffffaefbd173>] memset_orig+0x33/0xb0
[ 963.104556] RSP: 0000:ffff8802a356fdf0 EFLAGS: 00010212
[ 963.104556] RAX: 0000000000000000 RBX: 000000007e777000 RCX: 000000000042f8bf
[ 963.104556] RDX: 000000007e777000 RSI: 0000000000000000 RDI: ffffc9008f766000
[ 963.104556] RBP: ffff8802a356fe18 R08: 000000000fceee00 R09: 0000000000000000
[ 963.104556] R10: ffffc90021bd2000 R11: fffff52014069200 R12: 000000000fceee00
[ 963.104556] R13: fffff520136892d8 R14: 000000000fceee00 R15: fffff520140691ff
[ 963.104556] FS: 00007fd40c4fe700(0000) GS:ffff8800a2800000(0000) knlGS:0000000000000000
[ 963.104556] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 963.104556] CR2: 00007fd40c018000 CR3: 00000002a351a000 CR4: 00000000000007a0
[ 963.104556] DR0: 00007fd409b8f000 DR1: 0000000000000000 DR2: 0000000000000000
[ 963.104556] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 963.104556] Stack:
[ 963.104556] ffffffffad63b4ac ffffffffb7afd69c 000000007e777000 ffff8802a356fe78
[ 963.104556] ffff8802a356ff18 ffff8802a356fe38 ffffffffaefbd759 0000000000010206
[ 963.104556] 1ffff100546adfcb ffff8802a356ff48 ffffffffad3c5ef0 ffffc90020b49000
[ 963.104556] Call Trace:
[ 963.104556] [<ffffffffad63b4ac>] ? memset+0x2c/0x40
[ 963.104556] [<ffffffffb7afd69c>] ? bad_to_user+0x66/0x1391
[ 963.104556] [<ffffffffaefbd759>] copy_user_handle_tail+0x69/0x80
[ 963.104556] [<ffffffffad3c5ef0>] SyS_init_module+0x150/0x210
[ 963.104556] [<ffffffffad3c5da0>] ? load_module+0x96a0/0x96a0
[ 963.104556] [<ffffffffaefbd3b8>] ? trace_hardirqs_on_thunk+0x17/0x19
[ 963.104556] [<ffffffffb7af83b6>] system_call_fastpath+0x16/0x84
[ 963.104556] Code: b8 01 01 01 01 01 01 01 01 48 0f af c1 41 89 f9 41 83 e1 07 75 70 48 89 d1 48 c1 e9 06 74 39 66 0f 1f 84 00 00 00 00 00 48 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89

It's easy to reproduce it with trinity by adding -cinit_module, no need to touch the seed.


Thanks,
Sasha

2015-04-06 19:38:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On Mon, Apr 06, 2015 at 03:08:24PM -0400, Sasha Levin wrote:
> Your patch just makes it hang in memset instead:
>
> [ 963.104556] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [trinity-c224:9845]
> [ 963.104556] Modules linked in:
> [ 963.104556] irq event stamp: 3773324
> [ 963.104556] hardirqs last enabled at (3773323): [<ffffffffb7af8e3c>] restore_args+0x0/0x34
> [ 963.104556] hardirqs last disabled at (3773324): [<ffffffffb7af919e>] apic_timer_interrupt+0x6e/0x80
> [ 963.104556] softirqs last enabled at (3773322): [<ffffffffad1f6589>] __do_softirq+0x709/0xd40
> [ 963.104556] softirqs last disabled at (3773317): [<ffffffffad1f746d>] irq_exit+0x29d/0x320
> [ 963.104556] CPU: 1 PID: 9845 Comm: trinity-c224 Not tainted 4.0.0-rc6-next-20150402-sasha-00039-ge0bdae3-dirty #2130
> [ 963.104556] task: ffff8802a3560000 ti: ffff8802a3568000 task.ti: ffff8802a3568000
> [ 963.104556] RIP: 0010:[<ffffffffaefbd173>] [<ffffffffaefbd173>] memset_orig+0x33/0xb0

memset_orig?

Hmmm, your box doesn't set X86_FEATURE_REP_GOOD. And looking at your
dmesg I can see why - that's a kvm guest and you're emulating some qemu
-cpu kvm64 or so thing, AFAICT.

Can you give me your full command line?

Btw, maybe you should try to trigger this on real metal, just in case,
to rule out virt issues. Or have you, already?

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-06 20:42:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On Mon, Apr 6, 2015 at 12:08 PM, Sasha Levin <[email protected]> wrote:
>
> Your patch just makes it hang in memset instead:

So it's certainly a big memset (2GB or so: original count in RDX:
0x7e777000, and "%rcx << 6" is bytes left, so it has done about 85% of
it), which is certainly going to be slow, but it shouldn't *hang*. The
kernel memory should be all there and allocated, so it should be just
limited by memory speeds, which shouldn't be enough to take 22s. The
previous "one byte at a time" case I could easily have seen being slow
enough to , but 2GB of pre-allocated memory? Weird. Any half-way
normal memory subsystem should write memory at tens of GB/s.

So it's a bit odd that the watchdog triggers.

That said, maybe there is some virtualization thing that slows down
these things by an order of magniture or two (for example, paging in
the host). At that point I can easily see the 2GB memset() taking a
long time.

The main (only, really) reason we zero the target kernel buffer is for
security reasons, but that's really mainly for copying structures from
user space or for the data copy for write() system calls etc. So we
could easily say that we limit the clearing to a single hugepage or
something, since anything bigger than that is going to be into vmalloc
space and the copyer had *better* check the return value anyway.

Alternatively, we could just limit module loading size to some (fairly
arbitrary) big number.

Linus

2015-04-06 23:45:43

by Sasha Levin

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On 04/06/2015 03:36 PM, Borislav Petkov wrote:
> On Mon, Apr 06, 2015 at 03:08:24PM -0400, Sasha Levin wrote:
>> > Your patch just makes it hang in memset instead:
>> >
>> > [ 963.104556] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [trinity-c224:9845]
>> > [ 963.104556] Modules linked in:
>> > [ 963.104556] irq event stamp: 3773324
>> > [ 963.104556] hardirqs last enabled at (3773323): [<ffffffffb7af8e3c>] restore_args+0x0/0x34
>> > [ 963.104556] hardirqs last disabled at (3773324): [<ffffffffb7af919e>] apic_timer_interrupt+0x6e/0x80
>> > [ 963.104556] softirqs last enabled at (3773322): [<ffffffffad1f6589>] __do_softirq+0x709/0xd40
>> > [ 963.104556] softirqs last disabled at (3773317): [<ffffffffad1f746d>] irq_exit+0x29d/0x320
>> > [ 963.104556] CPU: 1 PID: 9845 Comm: trinity-c224 Not tainted 4.0.0-rc6-next-20150402-sasha-00039-ge0bdae3-dirty #2130
>> > [ 963.104556] task: ffff8802a3560000 ti: ffff8802a3568000 task.ti: ffff8802a3568000
>> > [ 963.104556] RIP: 0010:[<ffffffffaefbd173>] [<ffffffffaefbd173>] memset_orig+0x33/0xb0
> memset_orig?
>
> Hmmm, your box doesn't set X86_FEATURE_REP_GOOD. And looking at your
> dmesg I can see why - that's a kvm guest and you're emulating some qemu
> -cpu kvm64 or so thing, AFAICT.
>
> Can you give me your full command line?

I'm running kvmtool rather than qemu.

I've changed the cpu that the guest sees, and it just moved to the optimized memset:

[ 843.403237] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c123:9780]
[ 843.403265] Modules linked in:
[ 843.403265] irq event stamp: 7340472
[ 843.403265] hardirqs last enabled at (7340471): [<ffffffffafaf8e3c>] restore_args+0x0/0x34
[ 843.403265] hardirqs last disabled at (7340472): [<ffffffffafaf919e>] apic_timer_interrupt+0x6e/0x80
[ 843.403265] softirqs last enabled at (7340470): [<ffffffffa51f6589>] __do_softirq+0x709/0xd40
[ 843.403265] softirqs last disabled at (7340455): [<ffffffffa51f746d>] irq_exit+0x29d/0x320
[ 843.403265] CPU: 0 PID: 9780 Comm: trinity-c123 Not tainted 4.0.0-rc6-next-20150402-sasha-00039-ge0bda
e3-dirty #2130
[ 843.403265] task: ffff88034ffb0000 ti: ffff88034ffb8000 task.ti: ffff88034ffb8000
[ 843.403265] RIP: 0010:[<ffffffffa6fbd124>] [<ffffffffa6fbd124>] __memset+0x24/0x30
[ 843.403265] RSP: 0000:ffff88034ffbfdf0 EFLAGS: 00010206
[ 843.403265] RAX: 0000000000000000 RBX: 00000000fffff000 RCX: 0000000004a8a000
[ 843.403265] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc900ecdbd000
[ 843.403265] RBP: ffff88034ffbfe18 R08: 000000001ffffe00 R09: ffffc9001220e000
[ 843.403265] R10: ffffc9011220cfff R11: fffff52022441a00 R12: 000000001ffffe00
[ 843.403265] R13: fffff52022399988 R14: 000000001ffffe00 R15: fffff520224419ff
[ 843.403265] FS: 00007f0600ec4700(0000) GS:ffff880050600000(0000) knlGS:0000000000000000
[ 843.403265] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 843.403265] CR2: 00000000ff355000 CR3: 000000034f73f000 CR4: 00000000000007f0
[ 843.403265] Stack:
[ 843.403265] ffffffffa563b4ac ffffffffafafd69c 00000000fffff000 ffff88034ffbfe78
[ 843.403265] ffff88034ffbff18 ffff88034ffbfe38 ffffffffa6fbd759 0000000000010246
[ 843.403265] 1ffff10069ff7fcb ffff88034ffbff48 ffffffffa53c5ef0 ffffc9001220e000
[ 843.403265] Call Trace:
[ 843.403265] [<ffffffffa563b4ac>] ? memset+0x2c/0x40
[ 843.403265] [<ffffffffafafd69c>] ? bad_to_user+0x66/0x1391
[ 843.403265] [<ffffffffa6fbd759>] copy_user_handle_tail+0x69/0x80
[ 843.403265] [<ffffffffa53c5ef0>] SyS_init_module+0x150/0x210
[ 843.403265] [<ffffffffa53c5da0>] ? load_module+0x96a0/0x96a0
[ 843.403265] [<ffffffffafafad49>] ia32_do_call+0x13/0x13
[ 843.403265] Code: 90 90 90 90 90 90 90 66 66 90 66 90 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1
>
> Btw, maybe you should try to trigger this on real metal, just in case,
> to rule out virt issues. Or have you, already?

I'll give it a go tomorrow.


Thanks,
Sasha

2015-04-08 01:17:05

by Rusty Russell

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

Linus Torvalds <[email protected]> writes:
> Alternatively, we could just limit module loading size to some (fairly
> arbitrary) big number.

Yeah, we used to do that because vmalloc would BUG rather than
returning NULL.

But you're already CAP_SYS_MODULE in this path, which kind of makes it
hard to care very much about it being slow under virtualization.

Cheers,
Rusty.

2015-04-07 09:31:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE


* Linus Torvalds <[email protected]> wrote:

> On Mon, Apr 6, 2015 at 12:08 PM, Sasha Levin <[email protected]> wrote:
> >
> > Your patch just makes it hang in memset instead:
>
> So it's certainly a big memset (2GB or so: original count in RDX:
> 0x7e777000, and "%rcx << 6" is bytes left, so it has done about 85%
> of it), which is certainly going to be slow, but it shouldn't
> *hang*. [...]

Would be nice to see the actual kernel .config, as there's a few debug
features that can slow things down significantly, such as
CONFIG_KMEMCHECK=y or in some cases CONFIG_DEBUG_PAGEALLOC=y.

Thanks,

Ingo

2015-04-07 10:42:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On Tue, Apr 07, 2015 at 11:31:45AM +0200, Ingo Molnar wrote:
> Would be nice to see the actual kernel .config, as there's a few
> debug features that can slow things down significantly, such as
> CONFIG_KMEMCHECK=y or in some cases CONFIG_DEBUG_PAGEALLOC=y.

He sent it here: https://lkml.kernel.org/r/[email protected]

I'm attaching it again.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--


Attachments:
(No filename) (436.00 B)
sasha-config (159.66 kB)
Download all attachments

2015-04-07 11:05:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE


* Borislav Petkov <[email protected]> wrote:

> On Tue, Apr 07, 2015 at 11:31:45AM +0200, Ingo Molnar wrote:
> > Would be nice to see the actual kernel .config, as there's a few
> > debug features that can slow things down significantly, such as
> > CONFIG_KMEMCHECK=y or in some cases CONFIG_DEBUG_PAGEALLOC=y.
>
> He sent it here: https://lkml.kernel.org/r/[email protected]
>
> I'm attaching it again.

Thanks. Nothing particularly expensive stands out to me at a quick
glance:-/

Thanks,

Ingo

2015-04-07 14:30:17

by Michal Hocko

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On Tue 07-04-15 13:05:31, Ingo Molnar wrote:
>
> * Borislav Petkov <[email protected]> wrote:
>
> > On Tue, Apr 07, 2015 at 11:31:45AM +0200, Ingo Molnar wrote:
> > > Would be nice to see the actual kernel .config, as there's a few
> > > debug features that can slow things down significantly, such as
> > > CONFIG_KMEMCHECK=y or in some cases CONFIG_DEBUG_PAGEALLOC=y.
> >
> > He sent it here: https://lkml.kernel.org/r/[email protected]
> >
> > I'm attaching it again.
>
> Thanks. Nothing particularly expensive stands out to me at a quick
> glance:-/

Sasha uses KASAN very often and it is enabled here as well. I guess this
might slow things really down.
--
Michal Hocko
SUSE Labs

2015-04-07 14:37:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE


* Michal Hocko <[email protected]> wrote:

> On Tue 07-04-15 13:05:31, Ingo Molnar wrote:
> >
> > * Borislav Petkov <[email protected]> wrote:
> >
> > > On Tue, Apr 07, 2015 at 11:31:45AM +0200, Ingo Molnar wrote:
> > > > Would be nice to see the actual kernel .config, as there's a few
> > > > debug features that can slow things down significantly, such as
> > > > CONFIG_KMEMCHECK=y or in some cases CONFIG_DEBUG_PAGEALLOC=y.
> > >
> > > He sent it here: https://lkml.kernel.org/r/[email protected]
> > >
> > > I'm attaching it again.
> >
> > Thanks. Nothing particularly expensive stands out to me at a quick
> > glance:-/
>
> Sasha uses KASAN very often and it is enabled here as well. I guess
> this might slow things really down.

Indeed, that kind of instrumentation might explain the overhead ...

Thanks,

Ingo

2015-04-07 17:01:30

by Sasha Levin

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On 04/07/2015 10:37 AM, Ingo Molnar wrote:
>
> * Michal Hocko <[email protected]> wrote:
>
>> On Tue 07-04-15 13:05:31, Ingo Molnar wrote:
>>>
>>> * Borislav Petkov <[email protected]> wrote:
>>>
>>>> On Tue, Apr 07, 2015 at 11:31:45AM +0200, Ingo Molnar wrote:
>>>>> Would be nice to see the actual kernel .config, as there's a few
>>>>> debug features that can slow things down significantly, such as
>>>>> CONFIG_KMEMCHECK=y or in some cases CONFIG_DEBUG_PAGEALLOC=y.
>>>>
>>>> He sent it here: https://lkml.kernel.org/r/[email protected]
>>>>
>>>> I'm attaching it again.
>>>
>>> Thanks. Nothing particularly expensive stands out to me at a quick
>>> glance:-/
>>
>> Sasha uses KASAN very often and it is enabled here as well. I guess
>> this might slow things really down.
>
> Indeed, that kind of instrumentation might explain the overhead ...

It locks up quickly without KASan as well.


Thanks,
Sasha

2015-04-07 17:33:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On Tue, Apr 7, 2015 at 10:00 AM, Sasha Levin <[email protected]> wrote:
>
> It locks up quickly without KASan as well.

I suspect it's some virtualization artifact, where the writes cause
COW faults (or just memory allocations) in the L0 domain.

Whatever. It's probably not worth fighting. Either we just decide that
"copy_from_user()" shouldn't bother to zero huge areas (limit the
zeroing to some arbitrary size), or we just special-case the module
loading.

Something like the attached patch.

Entirely untested, as usual.

Linus


Attachments:
patch.diff (1.09 kB)

2015-04-07 17:59:02

by Dave Jones

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On Tue, Apr 07, 2015 at 10:33:49AM -0700, Linus Torvalds wrote:
> On Tue, Apr 7, 2015 at 10:00 AM, Sasha Levin <[email protected]> wrote:
> >
> > It locks up quickly without KASan as well.
>
> I suspect it's some virtualization artifact, where the writes cause
> COW faults (or just memory allocations) in the L0 domain.
>
> Whatever. It's probably not worth fighting. Either we just decide that
> "copy_from_user()" shouldn't bother to zero huge areas (limit the
> zeroing to some arbitrary size), or we just special-case the module
> loading.
>
> +#define COPY_CHUNK_SIZE (16*PAGE_SIZE)

This needs to be bigger, just to accomodate things like xfs, which
on my system is over 1mb. I think some out-of-tree modules
(not that I care deeply) are even bigger.

Special casing might be an easier call.

Dave

2015-04-07 21:10:14

by Dave Jones

[permalink] [raw]
Subject: Re: Hang on large copy_from_user with PREEMPT_NONE

On Tue, Apr 07, 2015 at 11:12:31AM -0700, Linus Torvalds wrote:
> On Apr 7, 2015 10:58 AM, "Dave Jones" <[email protected]> wrote:
> > >
> > > +#define COPY_CHUNK_SIZE (16*PAGE_SIZE)
> >
> > This needs to be bigger, just to accomodate things like xfs, which
> > on my system is over 1mb. I think some out-of-tree modules
> > (not that I care deeply) are even bigger.
>
> Umm. It will happily copy megabytes of data. Or gigabytes. It just doors it
> in smaller chunks, and make sure to test the return value and possibly
> reschedule in between.

brain no worky. You are of course, correct.

Dave