2022-04-20 23:26:15

by kernel test robot

[permalink] [raw]
Subject: [fs/pipe] 5a519c8fe4: WARNING:at_mm/page_alloc.c:#__alloc_pages



Greeting,

FYI, we noticed the following commit (built with clang-15):

commit: 5a519c8fe4d620912385f94372fc8472fa98c662 ("fs/pipe: use kvcalloc to allocate a pipe_buffer array")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: trinity
version: trinity-static-i386-x86_64-1c734c75-1_2020-01-06
with following parameters:

runtime: 300s
group: group-04

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 32.170781][ T3729] WARNING: The mand mount option has been deprecated and
[ 32.170781][ T3729] and is ignored by this kernel. Remove the mand
[ 32.170781][ T3729] option from the mount to silence this warning.
[ 32.170781][ T3729] =======================================================
[ 32.763485][ T3736] _warn_unseeded_randomness: 547 callbacks suppressed
[ 32.763492][ T3736] random: get_random_u32 called from arch_pick_mmap_layout+0x4b/0x110 with crng_init=1
[ 32.763503][ T3736] random: get_random_u32 called from randomize_stack_top+0x1a/0x40 with crng_init=1
[ 32.763510][ T3736] random: get_random_u32 called from arch_align_stack+0x23/0x40 with crng_init=1
[ 33.428192][ C0] random: crng init done
[ 33.428690][ C0] random: 33 get_random_xx warning(s) missed due to ratelimiting
[ 35.180590][ T3728] raw_sendmsg: trinity-c4 forgot to set AF_INET. Fix it!
[ 224.552132][ T3730] ------------[ cut here ]------------
[ 224.552771][ T3730] WARNING: CPU: 1 PID: 3730 at mm/page_alloc.c:5364 __alloc_pages (mm/page_alloc.c:5364)
[ 224.553628][ T3730] Modules linked in: crc32_pclmul evdev serio_raw
[ 224.554257][ T3730] CPU: 1 PID: 3730 Comm: trinity-c6 Not tainted 5.17.0-00020-g5a519c8fe4d6 #1 b54824caf35f46f3e2520057d525606ee99ab517
[ 224.555542][ T3730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 224.558294][ T3730] EIP: __alloc_pages (mm/page_alloc.c:5364)
[ 224.558801][ T3730] Code: 74 42 89 d9 89 fa ff 75 e4 56 e8 61 1f 00 00 83 c4 08 89 d8 83 c4 1c 5e 5f 5b 5d 31 c9 31 d2 c3 31 db f7 c6 00 20 00 00 75 e8 <0f> 0b eb e4 a9 00 00 08 00 75 2a a9 00 00 04 00 74 29 81 e6 7f ff
All code
========
0: 74 42 je 0x44
2: 89 d9 mov %ebx,%ecx
4: 89 fa mov %edi,%edx
6: ff 75 e4 pushq -0x1c(%rbp)
9: 56 push %rsi
a: e8 61 1f 00 00 callq 0x1f70
f: 83 c4 08 add $0x8,%esp
12: 89 d8 mov %ebx,%eax
14: 83 c4 1c add $0x1c,%esp
17: 5e pop %rsi
18: 5f pop %rdi
19: 5b pop %rbx
1a: 5d pop %rbp
1b: 31 c9 xor %ecx,%ecx
1d: 31 d2 xor %edx,%edx
1f: c3 retq
20: 31 db xor %ebx,%ebx
22: f7 c6 00 20 00 00 test $0x2000,%esi
28: 75 e8 jne 0x12
2a:* 0f 0b ud2 <-- trapping instruction
2c: eb e4 jmp 0x12
2e: a9 00 00 08 00 test $0x80000,%eax
33: 75 2a jne 0x5f
35: a9 00 00 04 00 test $0x40000,%eax
3a: 74 29 je 0x65
3c: 81 .byte 0x81
3d: e6 7f out %al,$0x7f
3f: ff .byte 0xff

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: eb e4 jmp 0xffffffffffffffe8
4: a9 00 00 08 00 test $0x80000,%eax
9: 75 2a jne 0x35
b: a9 00 00 04 00 test $0x40000,%eax
10: 74 29 je 0x3b
12: 81 .byte 0x81
13: e6 7f out %al,$0x7f
15: ff .byte 0xff
[ 224.560750][ T3730] EAX: 00040dc0 EBX: 00000000 ECX: 00000000 EDX: 0000000b
[ 224.561496][ T3730] ESI: 00040dc0 EDI: 00000002 EBP: bb457dd4 ESP: bb457dac
[ 224.562277][ T3730] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
[ 224.564399][ T3730] CR0: 80050033 CR2: 0000000e CR3: 7ec7b000 CR4: 00040690
[ 224.565092][ T3730] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 224.565777][ T3730] DR6: fffe0ff0 DR7: 00000400
[ 224.566305][ T3730] Call Trace:
[ 224.567299][ T3730] kmalloc_order (include/linux/gfp.h:572 include/linux/gfp.h:595 include/linux/gfp.h:609 mm/slab_common.c:944)
[ 224.567707][ T3730] kmalloc_order_trace (mm/slab_common.c:960)
[ 224.568173][ T3730] __kmalloc (include/linux/slab.h:510 mm/slub.c:4413)
[ 224.568571][ T3730] iter_file_splice_write (include/linux/slab.h:? include/linux/slab.h:652 fs/splice.c:628)
[ 224.569069][ T3730] ? lock_acquire (kernel/locking/lockdep.c:5639)
[ 224.569563][ T3730] ? splice_from_pipe (fs/splice.c:620)
[ 224.570060][ T3730] do_splice (fs/splice.c:767 fs/splice.c:1079)
[ 224.570487][ T3730] ? rcu_read_lock_sched_held (kernel/rcu/update.c:125)
[ 224.571073][ T3730] ? rcu_lock_acquire (include/linux/rcupdate.h:272)
[ 224.572386][ T3730] __ia32_sys_splice (fs/splice.c:1144 fs/splice.c:1350 fs/splice.c:1332 fs/splice.c:1332)
[ 224.572869][ T3730] ? syscall_enter_from_user_mode (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 kernel/entry/common.c:107)
[ 224.573469][ T3730] do_int80_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:132)
[ 224.573933][ T3730] ? irqentry_exit (kernel/entry/common.c:441)
[ 224.574406][ T3730] ? irqentry_exit_to_user_mode (kernel/entry/common.c:316)
[ 224.575003][ T3730] ? irqentry_exit (kernel/entry/common.c:441)
[ 224.575461][ T3730] ? sysvec_call_function_single (arch/x86/kernel/apic/apic.c:1097)
[ 224.576048][ T3730] entry_INT80_32 (init_task.c:?)
[ 224.576536][ T3730] EIP: 0x80a3392
[ 224.576891][ T3730] Code: 89 c8 c3 90 8d 74 26 00 85 c0 c7 01 01 00 00 00 75 d8 a1 c8 a9 ac 08 eb d1 66 90 66 90 66 90 66 90 66 90 66 90 66 90 90 cd 80 <c3> 8d b6 00 00 00 00 8d bc 27 00 00 00 00 8b 10 a3 f0 a9 ac 08 85
All code
========
0: 89 c8 mov %ecx,%eax
2: c3 retq
3: 90 nop
4: 8d 74 26 00 lea 0x0(%rsi,%riz,1),%esi
8: 85 c0 test %eax,%eax
a: c7 01 01 00 00 00 movl $0x1,(%rcx)
10: 75 d8 jne 0xffffffffffffffea
12: a1 c8 a9 ac 08 eb d1 movabs 0x9066d1eb08aca9c8,%eax
19: 66 90
1b: 66 90 xchg %ax,%ax
1d: 66 90 xchg %ax,%ax
1f: 66 90 xchg %ax,%ax
21: 66 90 xchg %ax,%ax
23: 66 90 xchg %ax,%ax
25: 66 90 xchg %ax,%ax
27: 90 nop
28: cd 80 int $0x80
2a:* c3 retq <-- trapping instruction
2b: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
31: 8d bc 27 00 00 00 00 lea 0x0(%rdi,%riz,1),%edi
38: 8b 10 mov (%rax),%edx
3a: a3 .byte 0xa3
3b: f0 lock
3c: a9 .byte 0xa9
3d: ac lods %ds:(%rsi),%al
3e: 08 .byte 0x8
3f: 85 .byte 0x85

Code starting with the faulting instruction
===========================================
0: c3 retq
1: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
7: 8d bc 27 00 00 00 00 lea 0x0(%rdi,%riz,1),%edi
e: 8b 10 mov (%rax),%edx
10: a3 .byte 0xa3
11: f0 lock
12: a9 .byte 0xa9
13: ac lods %ds:(%rsi),%al
14: 08 .byte 0x8
15: 85 .byte 0x85


To reproduce:

# build kernel
cd linux
cp config-5.17.0-00020-g5a519c8fe4d6 .config
make HOSTCC=clang-15 CC=clang-15 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=clang-15 CC=clang-15 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (8.80 kB)
config-5.17.0-00020-g5a519c8fe4d6 (136.84 kB)
job-script (4.44 kB)
dmesg.xz (13.37 kB)
trinity (16.89 kB)
Download all attachments

2022-04-22 21:39:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [fs/pipe] 5a519c8fe4: WARNING:at_mm/page_alloc.c:#__alloc_pages

On Wed, Apr 20, 2022 at 12:37 AM kernel test robot
<[email protected]> wrote:
>
> commit: 5a519c8fe4d6 ("fs/pipe: use kvcalloc to allocate a pipe_buffer array")
>
> [ 32.170781][ T3729] WARNING: The mand mount option has been deprecated and
> [ 32.170781][ T3729] and is ignored by this kernel. Remove the mand
> [ 32.170781][ T3729] option from the mount to silence this warning.

Heh. Not that warning.

This warning:

> [ 224.552771][ T3730] WARNING: CPU: 1 PID: 3730 at mm/page_alloc.c:5364 __alloc_pages (mm/page_alloc.c:5364)

That's just the

5363 if (unlikely(order >= MAX_ORDER)) {
5364 WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
5365 return NULL;
5366 }

so somebody is doing a big allocation that will fail, and doesn't use
__GFP_NOWARN.

That someone being iter_file_splice_write():

> [ 224.567299][ T3730] kmalloc_order (include/linux/gfp.h:572 include/linux/gfp.h:595 include/linux/gfp.h:609 mm/slab_common.c:944)
> [ 224.567707][ T3730] kmalloc_order_trace (mm/slab_common.c:960)
> [ 224.568173][ T3730] __kmalloc (include/linux/slab.h:510 mm/slub.c:4413)
> [ 224.568571][ T3730] iter_file_splice_write (include/linux/slab.h:? include/linux/slab.h:652 fs/splice.c:628)
> [ 224.570060][ T3730] do_splice (fs/splice.c:767 fs/splice.c:1079)
> [ 224.572386][ T3730] __ia32_sys_splice (fs/splice.c:1144 fs/splice.c:1350 fs/splice.c:1332 fs/splice.c:1332)

and that's the

int nbufs = pipe->max_usage;
struct bio_vec *array = kcalloc(nbufs, sizeof(struct bio_vec),
GFP_KERNEL);

thing, and no, using __GFP_NOWARN here isn't what we'd want to do,
because the code in question has no fallback (it will just return
-ENOMEM).

Now, technically, returning -ENOMEM is a "fallback", but not really.
It just means the kernel won't crash, it doesn't mean that this is
acceptable behavior.

Basically, that commit 5a519c8fe4d6 made it possible to create a pipe
that is effectively "too large to be used". It used to be that such a
pipe could never be created before, because the 'pipe->bufs' resizing
allocation used to be

bufs = kcalloc(nr_slots, sizeof(*bufs),
GFP_KERNEL_ACCOUNT | __GFP_NOWARN);

and 'sizeof(struct pipe_buffer)' is bigger than 'sizeof(struct
bio_vec)', so if the resizing was successful, then the pipe buffer
count was guaranteed to be smaller than what that file_splice code
would use.

So it really does look like this whole "allow the pipe size to grow
almost unlimited" change was a fundamental mistake. It has these kinds
of subtle issues.

I'm inclined to revert commit 5a519c8fe4d6 - doing multiple iterations
really shouldn't be so expensive, and this shows that the whole "try
to do it in one big go" is fundamentally broken.

Could 'iter_file_splice_write()' be changed to limit it some way? Yes.

Could it be changed to use kvcalloc() too? Yes again.

But I'm not convinced that some odd corner-case CRIU optimization is
worth this kind of pain.

Linus

2022-04-22 22:02:25

by Andrei Vagin

[permalink] [raw]
Subject: Re: [fs/pipe] 5a519c8fe4: WARNING:at_mm/page_alloc.c:#__alloc_pages

On Wed, Apr 20, 2022 at 12:07 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Apr 20, 2022 at 12:37 AM kernel test robot
> <[email protected]> wrote:
> >
> > commit: 5a519c8fe4d6 ("fs/pipe: use kvcalloc to allocate a pipe_buffer array")
> >
> > [ 32.170781][ T3729] WARNING: The mand mount option has been deprecated and
> > [ 32.170781][ T3729] and is ignored by this kernel. Remove the mand
> > [ 32.170781][ T3729] option from the mount to silence this warning.
>
> Heh. Not that warning.
>
> This warning:
>
> > [ 224.552771][ T3730] WARNING: CPU: 1 PID: 3730 at mm/page_alloc.c:5364 __alloc_pages (mm/page_alloc.c:5364)
>
> That's just the
>
> 5363 if (unlikely(order >= MAX_ORDER)) {
> 5364 WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
> 5365 return NULL;
> 5366 }
>
> so somebody is doing a big allocation that will fail, and doesn't use
> __GFP_NOWARN.
>
> That someone being iter_file_splice_write():
>
> > [ 224.567299][ T3730] kmalloc_order (include/linux/gfp.h:572 include/linux/gfp.h:595 include/linux/gfp.h:609 mm/slab_common.c:944)
> > [ 224.567707][ T3730] kmalloc_order_trace (mm/slab_common.c:960)
> > [ 224.568173][ T3730] __kmalloc (include/linux/slab.h:510 mm/slub.c:4413)
> > [ 224.568571][ T3730] iter_file_splice_write (include/linux/slab.h:? include/linux/slab.h:652 fs/splice.c:628)
> > [ 224.570060][ T3730] do_splice (fs/splice.c:767 fs/splice.c:1079)
> > [ 224.572386][ T3730] __ia32_sys_splice (fs/splice.c:1144 fs/splice.c:1350 fs/splice.c:1332 fs/splice.c:1332)
>
> and that's the
>
> int nbufs = pipe->max_usage;
> struct bio_vec *array = kcalloc(nbufs, sizeof(struct bio_vec),
> GFP_KERNEL);
>
> thing, and no, using __GFP_NOWARN here isn't what we'd want to do,
> because the code in question has no fallback (it will just return
> -ENOMEM).
>
> Now, technically, returning -ENOMEM is a "fallback", but not really.
> It just means the kernel won't crash, it doesn't mean that this is
> acceptable behavior.
>
> Basically, that commit 5a519c8fe4d6 made it possible to create a pipe
> that is effectively "too large to be used". It used to be that such a
> pipe could never be created before, because the 'pipe->bufs' resizing
> allocation used to be
>
> bufs = kcalloc(nr_slots, sizeof(*bufs),
> GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
>
> and 'sizeof(struct pipe_buffer)' is bigger than 'sizeof(struct
> bio_vec)', so if the resizing was successful, then the pipe buffer
> count was guaranteed to be smaller than what that file_splice code
> would use.
>
> So it really does look like this whole "allow the pipe size to grow
> almost unlimited" change was a fundamental mistake. It has these kinds
> of subtle issues.
>
> I'm inclined to revert commit 5a519c8fe4d6 - doing multiple iterations
> really shouldn't be so expensive, and this shows that the whole "try
> to do it in one big go" is fundamentally broken.
>
> Could 'iter_file_splice_write()' be changed to limit it some way? Yes.
>
> Could it be changed to use kvcalloc() too? Yes again.
>
> But I'm not convinced that some odd corner-case CRIU optimization is
> worth this kind of pain.

Sorry for the inconvenience. This change is critical for the live
(iterative) migration scenario. For this case, we have the pre-dump
operation that dumps processes memory without freezing processes. The
memory tracker is used to detect what pages have been changed from the
previous iterations. The pre-dump operation requires being able to grub
process pages with minimal process downtime. We use pipes for that. CRIU
injects a parasite code to a target process, vmsplice-s memory to pipes,
resumes the process, and dumps pages from pipes while the process
continues doing its job. Right now, we can do reliably pipes with the 3MB
capacity (1 << PAGE_ALLOC_COSTLY_ORDER) * PAGE_SIZE / sizeof(struct
kernel_pipe_buffer).
It means that 1000 pipes allow us to dump only 3GB of memory.
The operation of injecting/curing parasite code from the
process is expensive and requires freezing a process. This is why we
want to maximize the load that we can carry on for each iteration. And I
am not sure that we can solve this problem without any kernel changes.
For us, it is the optimization of the often used code path. I know CRIU itself
is an odd corner case, but anyway it would be good if we can find a solution.

We made another attempt to solve the problem by introducing
process_vmsplice https://lkml.org/lkml/2018/1/9/32. It is stuck with a
very similar explanation:

> All seems fairly straightforward. The big question is: do we know
> that people will actually use this, and get sufficient value from it
> to justify its addition?

Thanks,
Andrei


>
> Linus