2019-04-30 13:25:32

by Mark Rutland

[permalink] [raw]
Subject: [PATCH] io_uring: avoid page allocation warnings

In io_sqe_buffer_register() we allocate a number of arrays based on the
iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
we can still attempt to allocate arrays exceeding MAX_ORDER.

On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
bytes, which is greater than 4MiB. This results in SLUB warning that
we're trying to allocate greater than MAX_ORDER, and failing the
allocation.

Avoid this by passing __GFP_NOWARN when allocating arrays for the
user-provided iov_len. We'll gracefully handle the failed allocation,
returning -ENOMEM to userspace.

We should probably consider lowering the limit below SZ_1G, or reworking
the array allocations.

Full splat from before this patch:

WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x110/0x190 lib/dump_stack.c:113
panic+0x384/0x68c kernel/panic.c:214
__warn+0x2bc/0x2c0 kernel/panic.c:571
report_bug+0x228/0x2d8 lib/bug.c:186
bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
el1_dbg+0x18/0x8c
__alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132
alloc_pages include/linux/gfp.h:509 [inline]
kmalloc_order+0x20/0x50 mm/slab_common.c:1231
kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243
kmalloc_large include/linux/slab.h:480 [inline]
__kmalloc+0x3dc/0x4f0 mm/slub.c:3791
kmalloc_array include/linux/slab.h:670 [inline]
io_sqe_buffer_register fs/io_uring.c:2472 [inline]
__io_uring_register fs/io_uring.c:2962 [inline]
__do_sys_io_uring_register fs/io_uring.c:3008 [inline]
__se_sys_io_uring_register fs/io_uring.c:2990 [inline]
__arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
SMP: stopping secondary CPUs
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
CPU features: 0x002,23000438
Memory Limit: none
Rebooting in 1 seconds..

Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
fs/io_uring.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 48fa6e86bfd6..25fc8cb56fc5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2453,10 +2453,10 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
kfree(vmas);
kfree(pages);
pages = kmalloc_array(nr_pages, sizeof(struct page *),
- GFP_KERNEL);
+ GFP_KERNEL | __GFP_NOWARN);
vmas = kmalloc_array(nr_pages,
sizeof(struct vm_area_struct *),
- GFP_KERNEL);
+ GFP_KERNEL | __GFP_NOWARN);
if (!pages || !vmas) {
ret = -ENOMEM;
if (ctx->account_mem)
@@ -2467,7 +2467,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
}

imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
- GFP_KERNEL);
+ GFP_KERNEL | __GFP_NOWARN);
ret = -ENOMEM;
if (!imu->bvec) {
if (ctx->account_mem)
--
2.11.0


2019-04-30 14:20:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] io_uring: avoid page allocation warnings

On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
> In io_sqe_buffer_register() we allocate a number of arrays based on the
> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
> we can still attempt to allocate arrays exceeding MAX_ORDER.
>
> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
> bytes, which is greater than 4MiB. This results in SLUB warning that
> we're trying to allocate greater than MAX_ORDER, and failing the
> allocation.
>
> Avoid this by passing __GFP_NOWARN when allocating arrays for the
> user-provided iov_len. We'll gracefully handle the failed allocation,
> returning -ENOMEM to userspace.
>
> We should probably consider lowering the limit below SZ_1G, or reworking
> the array allocations.

I'd suggest that kvmalloc is probably our friend here ... we don't really
want to return -ENOMEM to userspace for this case, I don't think.

2019-04-30 15:01:22

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] io_uring: avoid page allocation warnings

On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
> > In io_sqe_buffer_register() we allocate a number of arrays based on the
> > iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
> > we can still attempt to allocate arrays exceeding MAX_ORDER.
> >
> > On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
> > iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
> > allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
> > bytes, which is greater than 4MiB. This results in SLUB warning that
> > we're trying to allocate greater than MAX_ORDER, and failing the
> > allocation.
> >
> > Avoid this by passing __GFP_NOWARN when allocating arrays for the
> > user-provided iov_len. We'll gracefully handle the failed allocation,
> > returning -ENOMEM to userspace.
> >
> > We should probably consider lowering the limit below SZ_1G, or reworking
> > the array allocations.
>
> I'd suggest that kvmalloc is probably our friend here ... we don't really
> want to return -ENOMEM to userspace for this case, I don't think.

Sure. I'll go verify that the uring code doesn't assume this memory is
physically contiguous.

I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
GFP_KERNEL.

Thanks,
Mark.

2019-04-30 16:24:05

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] io_uring: avoid page allocation warnings

On 4/30/19 8:59 AM, Mark Rutland wrote:
> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
>>> In io_sqe_buffer_register() we allocate a number of arrays based on the
>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
>>> we can still attempt to allocate arrays exceeding MAX_ORDER.
>>>
>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
>>> bytes, which is greater than 4MiB. This results in SLUB warning that
>>> we're trying to allocate greater than MAX_ORDER, and failing the
>>> allocation.
>>>
>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the
>>> user-provided iov_len. We'll gracefully handle the failed allocation,
>>> returning -ENOMEM to userspace.
>>>
>>> We should probably consider lowering the limit below SZ_1G, or reworking
>>> the array allocations.
>>
>> I'd suggest that kvmalloc is probably our friend here ... we don't really
>> want to return -ENOMEM to userspace for this case, I don't think.
>
> Sure. I'll go verify that the uring code doesn't assume this memory is
> physically contiguous.
>
> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
> GFP_KERNEL.

kvmalloc() is fine, the io_uring code doesn't care about the layout of
the memory, it just uses it as an index.

--
Jens Axboe

2019-04-30 17:04:38

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] io_uring: avoid page allocation warnings

On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:
> On 4/30/19 8:59 AM, Mark Rutland wrote:
> > On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
> >> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
> >>> In io_sqe_buffer_register() we allocate a number of arrays based on the
> >>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
> >>> we can still attempt to allocate arrays exceeding MAX_ORDER.
> >>>
> >>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
> >>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
> >>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
> >>> bytes, which is greater than 4MiB. This results in SLUB warning that
> >>> we're trying to allocate greater than MAX_ORDER, and failing the
> >>> allocation.
> >>>
> >>> Avoid this by passing __GFP_NOWARN when allocating arrays for the
> >>> user-provided iov_len. We'll gracefully handle the failed allocation,
> >>> returning -ENOMEM to userspace.
> >>>
> >>> We should probably consider lowering the limit below SZ_1G, or reworking
> >>> the array allocations.
> >>
> >> I'd suggest that kvmalloc is probably our friend here ... we don't really
> >> want to return -ENOMEM to userspace for this case, I don't think.
> >
> > Sure. I'll go verify that the uring code doesn't assume this memory is
> > physically contiguous.
> >
> > I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
> > GFP_KERNEL.
>
> kvmalloc() is fine, the io_uring code doesn't care about the layout of
> the memory, it just uses it as an index.

I've just had a go at that, but when using kvmalloc() with or without
GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
syzkaller prog below:

----
Syzkaller reproducer:
# {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
r0 = io_uring_setup(0x378, &(0x7f00000000c0))
sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
----

... I'm a bit worried that opens up a trivial DoS.

Thoughts?

Thanks,
Mark.

2019-04-30 18:13:26

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] io_uring: avoid page allocation warnings

On 4/30/19 11:03 AM, Mark Rutland wrote:
> On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:
>> On 4/30/19 8:59 AM, Mark Rutland wrote:
>>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
>>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
>>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the
>>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
>>>>> we can still attempt to allocate arrays exceeding MAX_ORDER.
>>>>>
>>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
>>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
>>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
>>>>> bytes, which is greater than 4MiB. This results in SLUB warning that
>>>>> we're trying to allocate greater than MAX_ORDER, and failing the
>>>>> allocation.
>>>>>
>>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the
>>>>> user-provided iov_len. We'll gracefully handle the failed allocation,
>>>>> returning -ENOMEM to userspace.
>>>>>
>>>>> We should probably consider lowering the limit below SZ_1G, or reworking
>>>>> the array allocations.
>>>>
>>>> I'd suggest that kvmalloc is probably our friend here ... we don't really
>>>> want to return -ENOMEM to userspace for this case, I don't think.
>>>
>>> Sure. I'll go verify that the uring code doesn't assume this memory is
>>> physically contiguous.
>>>
>>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
>>> GFP_KERNEL.
>>
>> kvmalloc() is fine, the io_uring code doesn't care about the layout of
>> the memory, it just uses it as an index.
>
> I've just had a go at that, but when using kvmalloc() with or without
> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
> syzkaller prog below:
>
> ----
> Syzkaller reproducer:
> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
> r0 = io_uring_setup(0x378, &(0x7f00000000c0))
> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
> ----
>
> ... I'm a bit worried that opens up a trivial DoS.
>
> Thoughts?

Can you post the patch you used?

--
Jens Axboe

2019-05-01 10:31:40

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] io_uring: avoid page allocation warnings

On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:
> On 4/30/19 11:03 AM, Mark Rutland wrote:
> > On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:
> >> On 4/30/19 8:59 AM, Mark Rutland wrote:
> >>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
> >>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
> >>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the
> >>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
> >>>>> we can still attempt to allocate arrays exceeding MAX_ORDER.
> >>>>>
> >>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
> >>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
> >>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
> >>>>> bytes, which is greater than 4MiB. This results in SLUB warning that
> >>>>> we're trying to allocate greater than MAX_ORDER, and failing the
> >>>>> allocation.
> >>>>>
> >>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the
> >>>>> user-provided iov_len. We'll gracefully handle the failed allocation,
> >>>>> returning -ENOMEM to userspace.
> >>>>>
> >>>>> We should probably consider lowering the limit below SZ_1G, or reworking
> >>>>> the array allocations.
> >>>>
> >>>> I'd suggest that kvmalloc is probably our friend here ... we don't really
> >>>> want to return -ENOMEM to userspace for this case, I don't think.
> >>>
> >>> Sure. I'll go verify that the uring code doesn't assume this memory is
> >>> physically contiguous.
> >>>
> >>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
> >>> GFP_KERNEL.
> >>
> >> kvmalloc() is fine, the io_uring code doesn't care about the layout of
> >> the memory, it just uses it as an index.
> >
> > I've just had a go at that, but when using kvmalloc() with or without
> > GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
> > syzkaller prog below:
> >
> > ----
> > Syzkaller reproducer:
> > # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
> > r0 = io_uring_setup(0x378, &(0x7f00000000c0))
> > sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
> > io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
> > ----
> >
> > ... I'm a bit worried that opens up a trivial DoS.
> >
> > Thoughts?
>
> Can you post the patch you used?

Diff below.

Mark.

---->8----
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 578b5494f9e5..c6dad90fab14 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2364,7 +2364,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)

if (ctx->account_mem)
io_unaccount_mem(ctx->user, imu->nr_bvecs);
- kfree(imu->bvec);
+ kvfree(imu->bvec);
imu->nr_bvecs = 0;
}

@@ -2456,9 +2456,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
if (!pages || nr_pages > got_pages) {
kfree(vmas);
kfree(pages);
- pages = kmalloc_array(nr_pages, sizeof(struct page *),
+ pages = kvmalloc_array(nr_pages, sizeof(struct page *),
GFP_KERNEL);
- vmas = kmalloc_array(nr_pages,
+ vmas = kvmalloc_array(nr_pages,
sizeof(struct vm_area_struct *),
GFP_KERNEL);
if (!pages || !vmas) {
@@ -2470,7 +2470,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
got_pages = nr_pages;
}

- imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+ imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec),
GFP_KERNEL);
ret = -ENOMEM;
if (!imu->bvec) {
@@ -2531,12 +2531,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,

ctx->nr_user_bufs++;
}
- kfree(pages);
- kfree(vmas);
+ kvfree(pages);
+ kvfree(vmas);
return 0;
err:
- kfree(pages);
- kfree(vmas);
+ kvfree(pages);
+ kvfree(vmas);
io_sqe_buffer_unregister(ctx);
return ret;
}
--
2.11.0

2019-05-01 12:44:11

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] io_uring: avoid page allocation warnings

On 5/1/19 4:30 AM, Mark Rutland wrote:
> On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:
>> On 4/30/19 11:03 AM, Mark Rutland wrote:
>>> On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:
>>>> On 4/30/19 8:59 AM, Mark Rutland wrote:
>>>>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
>>>>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
>>>>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the
>>>>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
>>>>>>> we can still attempt to allocate arrays exceeding MAX_ORDER.
>>>>>>>
>>>>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
>>>>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
>>>>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
>>>>>>> bytes, which is greater than 4MiB. This results in SLUB warning that
>>>>>>> we're trying to allocate greater than MAX_ORDER, and failing the
>>>>>>> allocation.
>>>>>>>
>>>>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the
>>>>>>> user-provided iov_len. We'll gracefully handle the failed allocation,
>>>>>>> returning -ENOMEM to userspace.
>>>>>>>
>>>>>>> We should probably consider lowering the limit below SZ_1G, or reworking
>>>>>>> the array allocations.
>>>>>>
>>>>>> I'd suggest that kvmalloc is probably our friend here ... we don't really
>>>>>> want to return -ENOMEM to userspace for this case, I don't think.
>>>>>
>>>>> Sure. I'll go verify that the uring code doesn't assume this memory is
>>>>> physically contiguous.
>>>>>
>>>>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
>>>>> GFP_KERNEL.
>>>>
>>>> kvmalloc() is fine, the io_uring code doesn't care about the layout of
>>>> the memory, it just uses it as an index.
>>>
>>> I've just had a go at that, but when using kvmalloc() with or without
>>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
>>> syzkaller prog below:
>>>
>>> ----
>>> Syzkaller reproducer:
>>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
>>> r0 = io_uring_setup(0x378, &(0x7f00000000c0))
>>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
>>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
>>> ----
>>>
>>> ... I'm a bit worried that opens up a trivial DoS.
>>>
>>> Thoughts?
>>
>> Can you post the patch you used?
>
> Diff below.

And the reproducer, that was never posted. Patch looks fine to me. Note
that buffer registration is under the protection of RLIMIT_MEMLOCK.
That's usually very limited for non-root, as root you can of course
consume as much as you want and OOM the system.

--
Jens Axboe

2019-05-01 15:19:09

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] io_uring: avoid page allocation warnings

On Wed, May 01, 2019 at 06:41:43AM -0600, Jens Axboe wrote:
> On 5/1/19 4:30 AM, Mark Rutland wrote:
> > On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:
> >> On 4/30/19 11:03 AM, Mark Rutland wrote:
> >>> I've just had a go at that, but when using kvmalloc() with or without
> >>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
> >>> syzkaller prog below:
> >>>
> >>> ----
> >>> Syzkaller reproducer:
> >>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
> >>> r0 = io_uring_setup(0x378, &(0x7f00000000c0))
> >>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
> >>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
> >>> ----
> >>>
> >>> ... I'm a bit worried that opens up a trivial DoS.
> >>>
> >>> Thoughts?
> >>
> >> Can you post the patch you used?
> >
> > Diff below.
>
> And the reproducer, that was never posted.

It was; the "Syzakller reproducer" above is the reproducer I used with
syz-repro.

I've manually minimized that to C below. AFAICT, that hits a leak, which
is what's triggering the OOM after the program is run a number of times
with the previously posted kvmalloc patch.

Per /proc/meminfo, that memory isn't accounted anywhere.

> Patch looks fine to me. Note
> that buffer registration is under the protection of RLIMIT_MEMLOCK.
> That's usually very limited for non-root, as root you can of course
> consume as much as you want and OOM the system.

Sure.

As above, it looks like there's a leak, regardless.

Thanks,
Mark.

---->8----
#include <stdint.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <linux/uio.h>

// NOTE: arm64 syscall numbers
#ifndef __NR_io_uring_register
#define __NR_io_uring_register 427
#endif
#ifndef __NR_io_uring_setup
#define __NR_io_uring_setup 425
#endif

#define __IORING_REGISTER_BUFFERS 0

struct __io_sqring_offsets {
uint32_t head;
uint32_t tail;
uint32_t ring_mask;
uint32_t ring_entries;
uint32_t flags;
uint32_t dropped;
uint32_t array;
uint32_t resv1;
uint64_t resv2;
};

struct __io_uring_params {
uint32_t sq_entries;
uint32_t cq_entries;
uint32_t flags;
uint32_t sq_thread_cpu;
uint32_t sq_thread_idle;
uint32_t resv[5];
struct __io_sqring_offsets sq_off;
struct __io_sqring_offsets cq_off;

};

static struct __io_uring_params params;

static struct iovec iov = {
.iov_base = (void *)0x10,
.iov_len = 1024 * 1024 * 1024,
};

int main(void)
{
int fd;

fd = syscall(__NR_io_uring_setup, 0x1, &params);
syscall(__NR_io_uring_register, fd, __IORING_REGISTER_BUFFERS, &iov, 1);

return 0;
}

2019-05-01 15:30:37

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] io_uring: avoid page allocation warnings

On 5/1/19 9:09 AM, Mark Rutland wrote:
> On Wed, May 01, 2019 at 06:41:43AM -0600, Jens Axboe wrote:
>> On 5/1/19 4:30 AM, Mark Rutland wrote:
>>> On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:
>>>> On 4/30/19 11:03 AM, Mark Rutland wrote:
>>>>> I've just had a go at that, but when using kvmalloc() with or without
>>>>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
>>>>> syzkaller prog below:
>>>>>
>>>>> ----
>>>>> Syzkaller reproducer:
>>>>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
>>>>> r0 = io_uring_setup(0x378, &(0x7f00000000c0))
>>>>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
>>>>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
>>>>> ----
>>>>>
>>>>> ... I'm a bit worried that opens up a trivial DoS.
>>>>>
>>>>> Thoughts?
>>>>
>>>> Can you post the patch you used?
>>>
>>> Diff below.
>>
>> And the reproducer, that was never posted.
>
> It was; the "Syzakller reproducer" above is the reproducer I used with
> syz-repro.
>
> I've manually minimized that to C below. AFAICT, that hits a leak, which
> is what's triggering the OOM after the program is run a number of times
> with the previously posted kvmalloc patch.
>
> Per /proc/meminfo, that memory isn't accounted anywhere.
>
>> Patch looks fine to me. Note
>> that buffer registration is under the protection of RLIMIT_MEMLOCK.
>> That's usually very limited for non-root, as root you can of course
>> consume as much as you want and OOM the system.
>
> Sure.
>
> As above, it looks like there's a leak, regardless.

The leak is that we're not releasing imu->bvec in case of error. I fixed
a missing kfree -> kvfree as well in your patch, with this rolled up
version it works for me.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 18cecb6a0151..3e817d40fb96 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2443,7 +2443,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)

if (ctx->account_mem)
io_unaccount_mem(ctx->user, imu->nr_bvecs);
- kfree(imu->bvec);
+ kvfree(imu->bvec);
imu->nr_bvecs = 0;
}

@@ -2533,11 +2533,11 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,

ret = 0;
if (!pages || nr_pages > got_pages) {
- kfree(vmas);
- kfree(pages);
- pages = kmalloc_array(nr_pages, sizeof(struct page *),
+ kvfree(vmas);
+ kvfree(pages);
+ pages = kvmalloc_array(nr_pages, sizeof(struct page *),
GFP_KERNEL);
- vmas = kmalloc_array(nr_pages,
+ vmas = kvmalloc_array(nr_pages,
sizeof(struct vm_area_struct *),
GFP_KERNEL);
if (!pages || !vmas) {
@@ -2549,7 +2549,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
got_pages = nr_pages;
}

- imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+ imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec),
GFP_KERNEL);
ret = -ENOMEM;
if (!imu->bvec) {
@@ -2588,6 +2588,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
}
if (ctx->account_mem)
io_unaccount_mem(ctx->user, nr_pages);
+ kvfree(imu->bvec);
goto err;
}

@@ -2610,12 +2611,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,

ctx->nr_user_bufs++;
}
- kfree(pages);
- kfree(vmas);
+ kvfree(pages);
+ kvfree(vmas);
return 0;
err:
- kfree(pages);
- kfree(vmas);
+ kvfree(pages);
+ kvfree(vmas);
io_sqe_buffer_unregister(ctx);
return ret;
}

--
Jens Axboe

2019-05-01 15:57:18

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] io_uring: avoid page allocation warnings

On Wed, May 01, 2019 at 09:29:25AM -0600, Jens Axboe wrote:
> On 5/1/19 9:09 AM, Mark Rutland wrote:
> > I've manually minimized that to C below. AFAICT, that hits a leak, which
> > is what's triggering the OOM after the program is run a number of times
> > with the previously posted kvmalloc patch.
> >
> > Per /proc/meminfo, that memory isn't accounted anywhere.
> >
> >> Patch looks fine to me. Note
> >> that buffer registration is under the protection of RLIMIT_MEMLOCK.
> >> That's usually very limited for non-root, as root you can of course
> >> consume as much as you want and OOM the system.
> >
> > Sure.
> >
> > As above, it looks like there's a leak, regardless.
>
> The leak is that we're not releasing imu->bvec in case of error. I fixed
> a missing kfree -> kvfree as well in your patch, with this rolled up
> version it works for me.

That works for me too.

I'll fold that into v2, and send that out momentarily.

Thanks,
Mark.