LinuxLists.cc - Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE

2022-04-22 09:32:22

Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Excerpts from Edgecombe, Rick P's message of April 22, 2022 12:29 pm:
> On Fri, 2022-04-22 at 10:12 +1000, Nicholas Piggin wrote:
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index e163372d3967..70933f4ed069 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>> if (nr != nr_pages_request)
>> break;
>> }
>> - } else
>> - /*
>> - * Compound pages required for remap_vmalloc_page if
>> - * high-order pages.
>> - */
>> - gfp |= __GFP_COMP;
>> + }
>>
>> /* High-order pages or fallback path if "bulk" fails. */
>>
>> @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>> page = alloc_pages_node(nid, gfp, order);
>> if (unlikely(!page))
>> break;
>> + /*
>> + * Higher order allocations must be able to be
>> treated as
>> + * indepdenent small pages by callers (as they can
>> with
>> + * small page allocs).
>> + */
>> + if (order)
>> + split_page(page, order);
>>
>> /*
>> * Careful, we allocate and map page-order pages, but
>
> FWIW, I like this direction. I think it needs to free them differently
> though? Since currently assumes they are high order pages in that path.

Yeah I got a bit excited there, but fairly sure that's the bug.
I'll do a proper patch.

> I also wonder if we wouldn't need vm_struct->page_order anymore, and
> all the places that would percolates out to. Basically all the places
> where it iterates through vm_struct->pages with page_order stepping.
>
> Besides fixing the bisected issue (hopefully), it also more cleanly
> separates the mapping from the backing allocation logic. And then since
> all the pages are 4k (from the page allocator perspective), it would be
> easier to support non-huge page aligned sizes. i.e. not use up a whole
> additional 2MB page if you only need 4k more of allocation size.

Somewhat. I'm not sure about freeing pages back for general use when
they have page tables pointing to them. That will be a whole different
kettle of fish I suspect. Honestly I don't think fragmentation has
proven to be that bad of a problem yet that we'd want to do something
like that yet.

But there are other reasons to decople them in general. Some ppc32 I
think had a different page size that was not a huge page which it could
use to map vmalloc memory, for example.

In any case I have no objections to making that part of it more
flexible.

Thanks,
Nick

2022-04-22 17:35:54

by Nicholas Piggin

[permalink] [raw]

Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Excerpts from Nicholas Piggin's message of April 22, 2022 1:08 pm:
> Excerpts from Edgecombe, Rick P's message of April 22, 2022 12:29 pm:
>> On Fri, 2022-04-22 at 10:12 +1000, Nicholas Piggin wrote:
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index e163372d3967..70933f4ed069 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>> if (nr != nr_pages_request)
>>> break;
>>> }
>>> - } else
>>> - /*
>>> - * Compound pages required for remap_vmalloc_page if
>>> - * high-order pages.
>>> - */
>>> - gfp |= __GFP_COMP;
>>> + }
>>>
>>> /* High-order pages or fallback path if "bulk" fails. */
>>>
>>> @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>> page = alloc_pages_node(nid, gfp, order);
>>> if (unlikely(!page))
>>> break;
>>> + /*
>>> + * Higher order allocations must be able to be
>>> treated as
>>> + * indepdenent small pages by callers (as they can
>>> with
>>> + * small page allocs).
>>> + */
>>> + if (order)
>>> + split_page(page, order);
>>>
>>> /*
>>> * Careful, we allocate and map page-order pages, but
>>
>> FWIW, I like this direction. I think it needs to free them differently
>> though? Since currently assumes they are high order pages in that path.
>
> Yeah I got a bit excited there, but fairly sure that's the bug.
> I'll do a proper patch.

So here's the patch on top of the revert. Only tested on a lowly
powerpc machine, but it does fix this simple test case that does
what the drm driver is obviously doing:

size_t sz = PMD_SIZE;
void *mem = vmalloc(sz);
struct page *p = vmalloc_to_page(mem + PAGE_SIZE*3);
p->mapping = NULL;
p->index = 0;
INIT_LIST_HEAD(&p->lru);
vfree(mem);

Without the below fix the same exact problem reproduces:

BUG: Bad page state in process swapper/0 pfn:00743
page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743
flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff)
raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: corrupted mapping in tail page
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810
Call Trace:
[c000000002383940] [c0000000006ebb00] dump_stack_lvl+0x74/0xa8 (unreliable)
[c000000002383980] [c0000000003dabdc] bad_page+0x12c/0x170
[c000000002383a00] [c0000000003dad08] free_tail_pages_check+0xe8/0x190
[c000000002383a30] [c0000000003dc45c] free_pcp_prepare+0x31c/0x4e0
[c000000002383a90] [c0000000003df9f0] free_unref_page+0x40/0x1b0
[c000000002383ad0] [c0000000003d7fc8] __vunmap+0x1d8/0x420
[c000000002383b70] [c00000000102e0d8] proc_vmalloc_init+0xdc/0x108
[c000000002383bf0] [c000000000011f80] do_one_initcall+0x60/0x2c0
[c000000002383cc0] [c000000001001658] kernel_init_freeable+0x32c/0x3cc
[c000000002383da0] [c000000000012564] kernel_init+0x34/0x1a0
[c000000002383e10] [c00000000000ce64] ret_from_kernel_thread+0x5c/0x64

Any other concerns with the fix?

Thanks,
Nick

--
mm/vmalloc: huge vmalloc backing pages should be split rather than compound

Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
in order to allow the sub-pages to be refcounted by callers such as
"remap_vmalloc_page [sic]" (remap_vmalloc_range).

However a similar problem exists for other struct page fields callers
use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
not only refcounts it but uses ->lru, ->mapping, ->index. This is not
compatible with compound sub-pages.

The correct approach is to use split high-order pages for the huge
vmalloc backing. These allow callers to treat them in exactly the same
way as individually-allocated order-0 pages.

Signed-off-by: Nicholas Piggin <[email protected]>
---
mm/vmalloc.c | 36 +++++++++++++++++++++---------------
1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0b17498a34f1..09470361dc03 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2653,15 +2653,18 @@ static void __vunmap(const void *addr, int deallocate_pages)
vm_remove_mappings(area, deallocate_pages);

if (deallocate_pages) {
- unsigned int page_order = vm_area_page_order(area);
- int i, step = 1U << page_order;
+ int i;

- for (i = 0; i < area->nr_pages; i += step) {
+ for (i = 0; i < area->nr_pages; i++) {
struct page *page = area->pages[i];

BUG_ON(!page);
- mod_memcg_page_state(page, MEMCG_VMALLOC, -step);
- __free_pages(page, page_order);
+ mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
+ /*
+ * High-order allocs for huge vmallocs are split, so
+ * can be freed as an array of order-0 allocations
+ */
+ __free_pages(page, 0);
cond_resched();
}
atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
@@ -2914,12 +2917,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
if (nr != nr_pages_request)
break;
}
- } else
- /*
- * Compound pages required for remap_vmalloc_page if
- * high-order pages.
- */
- gfp |= __GFP_COMP;
+ }

/* High-order pages or fallback path if "bulk" fails. */

@@ -2933,6 +2931,15 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
page = alloc_pages_node(nid, gfp, order);
if (unlikely(!page))
break;
+ /*
+ * Higher order allocations must be able to be treated as
+ * indepdenent small pages by callers (as they can with
+ * small-page vmallocs). Some drivers do their own refcounting
+ * on vmalloc_to_page() pages, some use page->mapping,
+ * page->lru, etc.
+ */
+ if (order)
+ split_page(page, order);

/*
* Careful, we allocate and map page-order pages, but
@@ -2992,11 +2999,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,

atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
if (gfp_mask & __GFP_ACCOUNT) {
- int i, step = 1U << page_order;
+ int i;

- for (i = 0; i < area->nr_pages; i += step)
- mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC,
- step);
+ for (i = 0; i < area->nr_pages; i++)
+ mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1);
}

/*
--
2.35.1

2022-04-22 22:44:55

by Edgecombe, Rick P

[permalink] [raw]

Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Fri, 2022-04-22 at 14:31 +1000, Nicholas Piggin wrote:
> Any other concerns with the fix?

I had another concern with the vmalloc huge pages. Possibly getting
into paranoid territory here, but it's around the new behavior of
freeing vmalloc page tables. At least on x86 this doesn't happen very
often. AFAICT it all happens during boot or during memory hot unplug.

The risk this entails is racing against all the software page table
walks and walking a freed table. At least on x86 the walks of the
kernel tables are done with no locks, which works because the PTE
updates are atomic and pretty much never freed. Some of the kernel page
table walks in the fault handler are actually triggerable from
userspace.

So it's in the category of "already there is a bug" or a little trouble
someone could cause with a lot of effort. At least until PKS vmalloc
memory shows up. Anyway, as long as we are going over this all again I
thought it was worth bringing up.

2022-04-22 23:02:10

by Edgecombe, Rick P

[permalink] [raw]

Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Fri, 2022-04-22 at 10:10 -0700, Edgecombe, Richard P wrote:
> The risk this entails is racing against all the software page table
> walks and walking a freed table. At least on x86 the walks of the
> kernel tables are done with no locks, which works because the PTE
> updates are atomic and pretty much never freed. Some of the kernel
> page
> table walks in the fault handler are actually triggerable from
> userspace.

Argh, please ignore this. I guess interrupts getting disabled in the
fault handler forces the freeing of the page table to synchronize with
the fault handler walks via the TLB flush IPIs that precede the free.
Sorry for the noise.