A number of drivers call page_frag_alloc() with a
fragment's size > PAGE_SIZE.
In low memory conditions, __page_frag_cache_refill() may fail the order 3
cache allocation and fall back to order 0;
If this happens, the cache will be smaller than the fragment, causing
memory corruptions.
Prevent this from happening by checking if the newly allocated cache
is large enough for the fragment; if not, the allocation will fail
and page_frag_alloc() will return NULL.
Signed-off-by: Maurizio Lombardi <[email protected]>
---
mm/page_alloc.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e008a3df0485..7fb000d7e90c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5611,12 +5611,17 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
/* if size can vary use size else just use PAGE_SIZE */
size = nc->size;
#endif
- /* OK, page count is 0, we can safely set it */
- set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
-
/* reset page count bias and offset to start of new frag */
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
offset = size - fragsz;
+ if (unlikely(offset < 0)) {
+ free_the_page(page, compound_order(page));
+ nc->va = NULL;
+ return NULL;
+ }
+
+ /* OK, page count is 0, we can safely set it */
+ set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
}
nc->pagecnt_bias--;
--
2.31.1
Tested with this kernel module:
http://bsdbackstore.eu/misc/oomk/
It requires 2 parameters: the first one is the amount of memory you
want to allocate with page_frag_alloc(), the second one is the size of
the fragment
I tested it on a machine with ~7Gb of free memory.
Without the patch:
-------------------------------------------------
3Gb of memory will be used with frag size = 1024 byte. No issue:
#insmod oomk.ko memory_size_gb=3 fragsize=1024
[ 177.875107] Test begins, memory size = 3 fragsize = 1024
[ 177.974538] Test completed!
10 Gb of memory, 1024 byte frag. page allocation failure but the
kernel handles it and doesn't crash:
#insmod oomk.ko memory_size_gb=10 fragsize=1024
[ 215.104801] Test begins, memory size = 10 fragsize = 1024
[ 215.227854] insmod: page allocation failure: order:0,
mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
[ 215.230231] CPU: 1 PID: 1738 Comm: insmod Kdump: loaded Tainted: G
OE --------- --- 5.14.0-124.kpq0.el9.x86_64 #1
[ 215.232344] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 215.233523] Call Trace:
[ 215.234001] dump_stack_lvl+0x34/0x44
[ 215.234894] warn_alloc+0x134/0x160
[ 215.235592] __alloc_pages_slowpath.constprop.0+0x809/0x840
[ 215.236687] ? get_page_from_freelist+0xc6/0x500
[ 215.237569] __alloc_pages+0x1fa/0x230
[ 215.238381] page_frag_alloc_align+0x16c/0x1a0
[...]
[ 215.315722] allocation number 7379888 failed!
[ 215.426227] Test completed!
10Gb, 4097 byte frag. Kernel crashes:
#insmod oomk.ko memory_size_gb=10 fragsize=4097
[ 623.461505] BUG: Bad page state in process insmod pfn:10a80c
[ 623.462634] page:000000000654dc14 refcount:0 mapcount:0
mapping:000000007a56d6cd index:0x0 pfn:0x10a80c
[ 623.464401] memcg:ffff900343a5b501
[ 623.465058] aops:0xffff9003409e5d38 with invalid host inode 00003524480055f0
[ 623.466394] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
[ 623.467632] raw: 0017ffffc0000000 dead000000000100 dead000000000122
ffff900346cf2900
[ 623.469069] raw: 0000000000000000 0000000000100010 00000000ffffffff
ffff900343a5b501
[ 623.470521] page dumped because: page still charged to cgroup
[...]
[ 626.632838] general protection fault, probably for non-canonical
address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
[ 626.633913] ------------[ cut here ]------------
[ 626.639981] CPU: 0 PID: 722 Comm: agetty Kdump: loaded Tainted: G
B OE --------- --- 5.14.0-124.kpq0.el9.x86_64 #1
[ 626.640923] WARNING: CPU: 1 PID: 22 at mm/slub.c:4566 __ksize+0xc4/0xe0
[ 626.645018] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 626.645021] RIP: 0010:___slab_alloc+0x1b7/0x5c0
------------------------------------------
With the patch the kernel doesn't crash:
#insmod oomk.ko memory_size_gb=10 fragsize=4097
[ 4859.358496] Test begins, memory size = 10 fragsize = 4097
[ 4859.459674] allocation number 607754 failed!
[ 4859.495489] Test completed!
#insmod oomk.ko memory_size_gb=10 fragsize=40000
[ 8428.021491] Test begins, memory size = 10 fragsize = 40000
[ 8428.024308] allocation number 0 failed!
[ 8428.025709] Test completed!
Maurizio
On Mon, Jul 11, 2022 at 12:52 AM Maurizio Lombardi <[email protected]> wrote:
>
> A number of drivers call page_frag_alloc() with a
> fragment's size > PAGE_SIZE.
> In low memory conditions, __page_frag_cache_refill() may fail the order 3
> cache allocation and fall back to order 0;
> If this happens, the cache will be smaller than the fragment, causing
> memory corruptions.
>
> Prevent this from happening by checking if the newly allocated cache
> is large enough for the fragment; if not, the allocation will fail
> and page_frag_alloc() will return NULL.
>
> Signed-off-by: Maurizio Lombardi <[email protected]>
> ---
> mm/page_alloc.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e008a3df0485..7fb000d7e90c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5611,12 +5611,17 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
> /* if size can vary use size else just use PAGE_SIZE */
> size = nc->size;
> #endif
> - /* OK, page count is 0, we can safely set it */
> - set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
> -
> /* reset page count bias and offset to start of new frag */
> nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> offset = size - fragsz;
> + if (unlikely(offset < 0)) {
> + free_the_page(page, compound_order(page));
> + nc->va = NULL;
> + return NULL;
> + }
> +
> + /* OK, page count is 0, we can safely set it */
> + set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
> }
>
> nc->pagecnt_bias--;
Rather than forcing us to free the page it might be better to move the
lines getting the size and computing the offset to the top of the "if
(unlikely(offset < 0)) {" block. Then instead of freeing the page we
could just return NULL and don't have to change the value of any
fields in the page_frag_cache.
That way a driver performing bad requests can't force us to start
allocating and freeing pages like mad by repeatedly flushing the
cache.
po 11. 7. 2022 v 17:34 odesílatel Alexander Duyck
<[email protected]> napsal:
>
> Rather than forcing us to free the page it might be better to move the
> lines getting the size and computing the offset to the top of the "if
> (unlikely(offset < 0)) {" block. Then instead of freeing the page we
> could just return NULL and don't have to change the value of any
> fields in the page_frag_cache.
>
> That way a driver performing bad requests can't force us to start
> allocating and freeing pages like mad by repeatedly flushing the
> cache.
>
I understand. On the other hand, if we free the cache page then the
next time __page_frag_cache_refill() runs it may be successful
at allocating the order=3 cache, the normal page_frag_alloc() behaviour will
therefore be restored.
Maurizio
On Mon, Jul 11, 2022 at 9:17 AM Maurizio Lombardi <[email protected]> wrote:
>
> po 11. 7. 2022 v 17:34 odesílatel Alexander Duyck
> <[email protected]> napsal:
> >
> > Rather than forcing us to free the page it might be better to move the
> > lines getting the size and computing the offset to the top of the "if
> > (unlikely(offset < 0)) {" block. Then instead of freeing the page we
> > could just return NULL and don't have to change the value of any
> > fields in the page_frag_cache.
> >
> > That way a driver performing bad requests can't force us to start
> > allocating and freeing pages like mad by repeatedly flushing the
> > cache.
> >
>
> I understand. On the other hand, if we free the cache page then the
> next time __page_frag_cache_refill() runs it may be successful
> at allocating the order=3 cache, the normal page_frag_alloc() behaviour will
> therefore be restored.
That is a big "maybe". My concern is that it will actually make memory
pressure worse by forcing us to reduce the number of uses for a lower
order page. One bad actor will have us flushing memory like mad so a
guy expecting a small fragment may end up allocating 32K pages because
someone else is trying to allocate them.
I recommend we do not optimize for a case which this code was not
designed for. Try to optimize for the standard case that most of the
drivers are using. These drivers that are allocating higher order
pages worth of memory should really be using alloc_pages. Using this
to allocate pages over 4K in size is just a waste since they are not
likely to see page reuse which is what this code expects to see.
st 13. 7. 2022 v 16:59 odesílatel Maurizio Lombardi
<[email protected]> napsal:
>
> A number of drivers call page_frag_alloc() with a
> fragment's size > PAGE_SIZE.
> In low memory conditions, __page_frag_cache_refill() may fail the order 3
> cache allocation and fall back to order 0;
> In this case, the cache will be smaller than the fragment, causing
> memory corruptions.
Oops, I didn't modify the subject, I'm going to resend it.
Maurizio
>
> Prevent this from happening by checking if the newly allocated cache
> is large enough for the fragment; if not, the allocation will fail
> and page_frag_alloc() will return NULL.
>
> V2: do not free the cache page because this could make memory pressure
> even worse, just return NULL.
>
> Signed-off-by: Maurizio Lombardi <[email protected]>
> ---
> mm/page_alloc.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e008a3df0485..b1407254a826 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5617,6 +5617,8 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
> /* reset page count bias and offset to start of new frag */
> nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> offset = size - fragsz;
> + if (unlikely(offset < 0))
> + return NULL;
> }
>
> nc->pagecnt_bias--;
> --
> 2.31.1
>